Differentiating Kernel Models¶
Differentiating kernel models replace Euclidean distances with kernel-induced distances in prototype-based learning. The kernel parameters are adapted via gradient descent alongside the prototypes, enabling the model to learn an optimal non-linear similarity measure from data.
Mathematical Background¶
Gaussian Kernel Distance¶
For a Gaussian kernel with bandwidth \(\sigma\):
the induced distance in feature space is:
This distance is bounded in \([0, 2]\) regardless of input magnitude, making it naturally robust to outliers.
Relevance-Weighted Kernel Distance¶
Adding per-feature relevance weights \(\lambda_j = \text{softmax}(\text{relevances})_j\):
This combines feature selection with kernel distance, identifying which input dimensions are most important for classification.
Exponential Kernel Distance¶
The exponential kernel uses a learned transformation matrix \(\hat\Lambda = \hat\Omega \hat\Omega^T\):
Unlike the Gaussian kernel, \(\kappa_{\exp}(v, v) \neq 1\), so the full three-term distance formula is required:
Supervised Models¶
DKGLVQ¶
Differentiating Kernel GLVQ. Each prototype \(w_k\) has a learnable bandwidth \(\sigma_k\) adapted via gradient descent.
from prosemble.models import DKGLVQ
from prosemble.datasets import load_iris_jax
dataset = load_iris_jax()
X, y = dataset.input_data, dataset.target
model = DKGLVQ(
n_prototypes_per_class=2,
max_iter=200,
lr=0.01,
sigma_init='median', # per-class median distance initialization
sigma_min=1e-3, # prevent bandwidth collapse
random_seed=42,
)
model.fit(X, y)
preds = model.predict(X)
print(f"Accuracy: {(preds == y).mean():.2%}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")
The sigma_init parameter controls initialization:
'median'(default): per-class median distance from prototype to class members'mean': per-class mean distancefloat: fixed value for all prototypes
DKGRLVQ¶
Differentiating Kernel GRLVQ. Combines per-feature relevance weighting with per-prototype kernel bandwidth adaptation.
from prosemble.models import DKGRLVQ
model = DKGRLVQ(
n_prototypes_per_class=2,
max_iter=200,
lr=0.01,
sigma_init='median',
sigma_min=1e-3,
random_seed=42,
)
model.fit(X, y)
preds = model.predict(X)
print(f"Accuracy: {(preds == y).mean():.2%}")
print(f"Relevance profile: {model.relevance_profile}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")
The relevance_profile property returns the normalized feature relevance
weights \(\lambda = \text{softmax}(\text{relevances})\), identifying
which features are most discriminative.
DKGMLVQ¶
Differentiating Kernel GMLVQ with the exponential kernel. Learns a global
transformation matrix \(\hat\Omega\) of shape (d, latent_dim).
from prosemble.models import DKGMLVQ
model = DKGMLVQ(
n_prototypes_per_class=2,
max_iter=200,
lr=0.01,
latent_dim=None, # defaults to input dim
omega_hat_scale=0.1, # small init prevents exp overflow
random_seed=42,
)
model.fit(X, y)
preds = model.predict(X)
print(f"Omega hat shape: {model.omega_hat_matrix.shape}")
print(f"Lambda hat shape: {model.lambda_hat_matrix.shape}")
The lambda_hat_matrix property returns the symmetric positive
semi-definite matrix \(\hat\Lambda = \hat\Omega \hat\Omega^T\),
which can be analyzed for feature correlations learned by the model.
One-Class Models¶
The one-class differentiating kernel models combine OC-GLVQ’s \(\theta\)-based hypothesis testing with kernel distances. In standard OC-GLVQ, the classifier function is:
where \(k^*\) is the nearest prototype, \(\theta_{k^*}\) is a learned per-prototype visibility threshold, and \(s_i = +1\) for target, \(-1\) for outlier. The OC-DK variants replace the Euclidean distance \(d\) with kernel distances.
Critical design detail: The \(\theta_k\) thresholds are initialized in kernel distance scale, not Euclidean scale. Gaussian kernel distances are bounded in \([0, 2]\), so Euclidean-initialized thetas would be far too large.
OCDKGLVQ¶
One-class classification with Gaussian kernel distance and per-prototype bandwidth adaptation.
from prosemble.models import OCDKGLVQ
import jax
import jax.numpy as jnp
# Generate one-class dataset
key = jax.random.PRNGKey(42)
k1, k2 = jax.random.split(key)
X_target = jax.random.normal(k1, (100, 4)) * 0.5
X_outlier = jax.random.normal(k2, (30, 4)) * 0.5 + 3.0
X = jnp.concatenate([X_target, X_outlier])
y = jnp.concatenate([jnp.zeros(100, dtype=jnp.int32),
jnp.ones(30, dtype=jnp.int32)])
model = OCDKGLVQ(
n_prototypes=3,
max_iter=100,
lr=0.01,
sigma_init='median',
sigma_min=1e-3,
target_label=0,
random_seed=42,
)
model.fit(X, y)
scores = model.decision_function(X)
preds = model.predict(X)
print(f"Learned bandwidths: {model.kernel_bandwidths}")
print(f"Visibility radii: {model.visibility_radii}")
OCDKGRLVQ¶
One-class classification with relevance-weighted kernel distance, per-prototype bandwidth, and per-feature relevance learning.
from prosemble.models import OCDKGRLVQ
model = OCDKGRLVQ(
n_prototypes=3,
max_iter=100,
lr=0.01,
sigma_init='median',
sigma_min=1e-3,
target_label=0,
random_seed=42,
)
model.fit(X, y)
scores = model.decision_function(X)
print(f"Relevance profile: {model.relevance_profile}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")
The relevance_profile property returns the softmax-normalized per-feature
weights, identifying which features are most important for the one-class
boundary.
OCDKGMLVQ¶
One-class classification with exponential kernel distance and a learned transformation matrix \(\hat\Omega\).
from prosemble.models import OCDKGMLVQ
model = OCDKGMLVQ(
n_prototypes=3,
max_iter=100,
lr=0.01,
latent_dim=None,
omega_hat_scale=0.1,
target_label=0,
random_seed=42,
)
model.fit(X, y)
scores = model.decision_function(X)
print(f"Omega hat shape: {model.omega_hat_matrix.shape}")
print(f"Lambda hat (PSD): {model.lambda_hat_matrix.shape}")
Supervised Models with Neural Gas¶
The supervised DK-NG variants combine differentiating kernel distances with Neural Gas class-aware neighborhood cooperation. All same-class prototypes participate in the loss, weighted by their distance rank:
where \(\gamma\) decays during training from gamma_init to
gamma_final, and the GLVQ margin is computed per prototype:
with \(d^-\) being the nearest different-class prototype distance.
from prosemble.models import DKGLVQ_NG
model = DKGLVQ_NG(
n_prototypes_per_class=3,
max_iter=100,
lr=0.01,
sigma_init='median',
gamma_init=1.5,
gamma_final=0.01,
random_seed=42,
)
model.fit(X, y)
preds = model.predict(X)
print(f"Final gamma: {model.gamma_}")
The relevance-weighted variant (DKGRLVQ_NG) adds per-feature relevance
weights, while the matrix variant (DKGMLVQ_NG) uses exponential kernel
distance with learnable \(\hat\Omega\) transformation.
One-Class Models with Neural Gas¶
The OC-DK-NG variants extend the one-class kernel models with Neural Gas neighborhood cooperation. Instead of only the nearest prototype contributing to the loss, all prototypes participate weighted by their distance rank:
where \(\gamma\) decays during training from gamma_init to
gamma_final.
from prosemble.models import OCDKGLVQ_NG
model = OCDKGLVQ_NG(
n_prototypes=3,
max_iter=100,
lr=0.01,
sigma_init='median',
gamma_init=1.5,
gamma_final=0.01,
target_label=0,
random_seed=42,
)
model.fit(X, y)
print(f"Final gamma: {model.gamma_}")
The relevance-weighted variant (OCDKGRLVQ_NG) and matrix variant
(OCDKGMLVQ_NG) follow the same pattern, combining their respective
kernel distances with NG cooperation.
Unsupervised Models¶
The unsupervised kernel models use the Gaussian kernel distance for prototype ranking and BMU selection, but \(\sigma\) is a fixed hyperparameter (not learned). Prototypes live in the original data space — only the distance metric changes.
DKNeuralGas¶
Neural Gas with Gaussian kernel distance for ranking.
from prosemble.models import DKNeuralGas
from prosemble.datasets import load_iris_jax
dataset = load_iris_jax()
X = dataset.input_data
model = DKNeuralGas(
n_prototypes=10,
kernel_sigma=1.0,
max_iter=100,
lr_init=0.5,
lr_final=0.01,
lambda_final=0.01,
random_seed=42,
)
model.fit(X)
labels = model.predict(X)
print(f"Energy: {model.loss_:.4f}")
DKKohonenSOM¶
Kohonen SOM with Gaussian kernel distance for BMU selection. The grid neighborhood is unchanged — only the data-space metric changes.
from prosemble.models import DKKohonenSOM
model = DKKohonenSOM(
grid_height=5,
grid_width=5,
kernel_sigma=1.0,
sigma_init=2.0,
sigma_final=0.5,
lr_init=0.5,
lr_final=0.01,
max_iter=100,
random_seed=42,
)
model.fit(X)
bmu_coords = model.bmu_map(X)
print(f"BMU coordinates shape: {bmu_coords.shape}")
DKHeskesSOM¶
Heskes SOM with Gaussian kernel distance. The Heskes BMU criterion selects the unit whose entire neighborhood best represents the sample:
from prosemble.models import DKHeskesSOM
model = DKHeskesSOM(
grid_height=5,
grid_width=5,
kernel_sigma=1.0,
sigma_init=2.0,
sigma_final=0.5,
max_iter=100,
random_seed=42,
)
model.fit(X)
bmu_coords = model.bmu_map(X)
print(f"Energy: {model.loss_:.4f}")
Choosing a Model¶
Model |
Kernel |
Learned Params |
Best For |
|---|---|---|---|
DKGLVQ |
Gaussian |
\(w_k, \sigma_k\) |
Per-prototype bandwidth adaptation |
DKGRLVQ |
Gaussian (weighted) |
\(w_k, \sigma_k, \lambda\) |
Feature selection + kernel adaptation |
DKGMLVQ |
Exponential |
\(w_k, \hat\Omega\) |
Full metric adaptation in kernel space |
DKGLVQ_NG |
Gaussian |
\(w_k, \sigma_k, \gamma\) |
Supervised kernel + NG cooperation |
DKGRLVQ_NG |
Gaussian (weighted) |
\(w_k, \sigma_k, \lambda, \gamma\) |
Supervised kernel + relevances + NG cooperation |
DKGMLVQ_NG |
Exponential |
\(w_k, \hat\Omega, \gamma\) |
Supervised kernel matrix + NG cooperation |
OCDKGLVQ |
Gaussian |
\(w_k, \sigma_k, \theta_k\) |
One-class with kernel bandwidth adaptation |
OCDKGRLVQ |
Gaussian (weighted) |
\(w_k, \sigma_k, \lambda, \theta_k\) |
One-class with feature selection + kernel |
OCDKGMLVQ |
Exponential |
\(w_k, \hat\Omega, \theta_k\) |
One-class with full metric adaptation in kernel space |
OCDKGLVQ_NG |
Gaussian |
\(w_k, \sigma_k, \theta_k, \gamma\) |
One-class kernel + NG cooperation |
OCDKGRLVQ_NG |
Gaussian (weighted) |
\(w_k, \sigma_k, \lambda, \theta_k, \gamma\) |
One-class kernel + relevances + NG cooperation |
OCDKGMLVQ_NG |
Exponential |
\(w_k, \hat\Omega, \theta_k, \gamma\) |
One-class kernel matrix + NG cooperation |
DKNeuralGas |
Gaussian (fixed \(\sigma\)) |
\(w_k\) |
Unsupervised clustering with kernel distance |
DKKohonenSOM |
Gaussian (fixed \(\sigma\)) |
\(w_k\) |
SOM visualization with kernel distance |
DKHeskesSOM |
Gaussian (fixed \(\sigma\)) |
\(w_k\) |
Principled SOM with kernel distance |
Riemannian Variants¶
Differentiating kernel distances can also be applied on Riemannian manifolds. Three models combine the RiemannianSRNG framework (prototypes on manifold, NG rank cooperation, manifold projection) with kernel distance formulas:
RiemannianDKGLVQ — Gaussian kernel on geodesic distance: \(d_\kappa^2(x, w_k) = 2(1 - \exp(-d_{\text{geo}}^2(x, w_k) / 2\sigma_k^2))\)
RiemannianDKGRLVQ — Relevance-weighted kernel in tangent space: \(d_\kappa^2(x, w_k) = 2(1 - \exp(-\sum_j \lambda_j v_j^2 / 2\sigma_k^2))\) where \(v = \text{Log}_{w_k}(x)_{\text{flat}}\)
RiemannianDKGMLVQ — Exponential kernel in tangent space: \(d_\kappa^2(x, w_k) = \exp(v^T \hat\Lambda v) - 1\) where \(\hat\Lambda = \hat\Omega \hat\Omega^T\)
All three support SO(n), SPD(n), and Grassmannian(n,k) manifolds.
from prosemble.core.manifolds import SO
from prosemble.models import RiemannianDKGLVQ
manifold = SO(3)
model = RiemannianDKGLVQ(
manifold=manifold, n_prototypes_per_class=2,
max_iter=100, lr=0.01, use_scan=False,
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
# Inspect learned bandwidths
print(model.kernel_bandwidths)
Riemannian Metric-Adapted DK Variants¶
Nine additional models combine metric adaptation (global/local/subspace) with kernel distance learning on Riemannian manifolds, completing a 3x3 grid (Gaussian/Relevance/Matrix kernels x SMNG/SLNG/STNG bases):
Gaussian kernel variants (per-prototype \(\sigma_k\)):
RiemannianDKSMNG — \(d_\kappa^2 = 2(1 - \exp(-\|\Omega \cdot v\|^2 / 2\sigma_k^2))\)
RiemannianDKSLNG — \(d_\kappa^2 = 2(1 - \exp(-\|\Omega_k \cdot v\|^2 / 2\sigma_k^2))\)
RiemannianDKSTNG — \(d_\kappa^2 = 2(1 - \exp(-\|(I - \Omega_k\Omega_k^T) \cdot v\|^2 / 2\sigma_k^2))\)
Relevance kernel variants (\(\sigma_k\) + relevance \(\lambda\)):
RiemannianDKRSMNG — \(d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j (\Omega \cdot v)_j^2 / 2\sigma_k^2))\)
RiemannianDKRSLNG — \(d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j (\Omega_k \cdot v)_j^2 / 2\sigma_k^2))\)
RiemannianDKRSTNG — \(d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j r_j^2 / 2\sigma_k^2))\)
Matrix kernel variants (exponential with \(\hat\Lambda = \hat\Omega\hat\Omega^T\)):
RiemannianDKMSMNG — \(d_\kappa^2 = \exp((\Omega \cdot v)^T \hat\Lambda (\Omega \cdot v)) - 1\)
RiemannianDKMSLNG — \(d_\kappa^2 = \exp((\Omega_k \cdot v)^T \hat\Lambda (\Omega_k \cdot v)) - 1\)
RiemannianDKMSTNG — \(d_\kappa^2 = \exp(r^T \hat\Lambda r) - 1\)
from prosemble.core.manifolds import Grassmannian
from prosemble.models import RiemannianDKRSMNG
manifold = Grassmannian(4, 2)
model = RiemannianDKRSMNG(
manifold=manifold, n_prototypes_per_class=2,
max_iter=100, lr=0.01, use_scan=False,
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(model.kernel_bandwidths)
print(model.relevance_profile)
ONNX Export¶
All 15 differentiating kernel models support ONNX export. The three kernel distance types are implemented as native ONNX subgraphs:
from prosemble.models import DKGLVQ
from prosemble.core.onnx_export import export_onnx
model = DKGLVQ(n_prototypes_per_class=2, max_iter=100, lr=0.01)
model.fit(X_train, y_train)
# Export to ONNX — kernel distance computed in the graph
onnx_model = export_onnx(model, path='dkglvq.onnx')
Per-prototype bandwidths \(\sigma_k\) are clamped at export time
(sigma_min), and relevance logits are normalized via a Softmax node in
the ONNX graph. See the ONNX Export guide for full details.