Differentiating Kernel Models¶

Differentiating kernel models replace Euclidean distances with kernel-induced distances in prototype-based learning. The kernel parameters are adapted via gradient descent alongside the prototypes, enabling the model to learn an optimal non-linear similarity measure from data.

Mathematical Background¶

Gaussian Kernel Distance¶

For a Gaussian kernel with bandwidth \(\sigma\):

\[\kappa(x, w) = \exp\!\left(-\frac{\|x - w\|^2}{2\sigma^2}\right)\]

the induced distance in feature space is:

\[d_\kappa^2(x, w) = \|\phi(x) - \phi(w)\|^2 = 2\bigl(1 - \kappa(x, w)\bigr) = 2\left(1 - \exp\!\left( -\frac{\|x - w\|^2}{2\sigma^2} \right)\right)\]

This distance is bounded in \([0, 2]\) regardless of input magnitude, making it naturally robust to outliers.

Relevance-Weighted Kernel Distance¶

Adding per-feature relevance weights \(\lambda_j = \text{softmax}(\text{relevances})_j\):

\[d_\kappa^2(x, w_k) = 2\left(1 - \exp\!\left( -\frac{\sum_j \lambda_j (x_j - w_{kj})^2}{2\sigma_k^2} \right)\right)\]

This combines feature selection with kernel distance, identifying which input dimensions are most important for classification.

Exponential Kernel Distance¶

The exponential kernel uses a learned transformation matrix \(\hat\Lambda = \hat\Omega \hat\Omega^T\):

\[\kappa_{\exp}(x, w) = \exp\!\bigl(x^T \hat\Lambda\, w\bigr)\]

Unlike the Gaussian kernel, \(\kappa_{\exp}(v, v) \neq 1\), so the full three-term distance formula is required:

\[d_\kappa^2(x, w) = \exp\!\bigl(x^T \hat\Lambda\, x\bigr) + \exp\!\bigl(w^T \hat\Lambda\, w\bigr) - 2\exp\!\bigl(x^T \hat\Lambda\, w\bigr)\]

Supervised Models¶

DKGLVQ¶

Differentiating Kernel GLVQ. Each prototype \(w_k\) has a learnable bandwidth \(\sigma_k\) adapted via gradient descent.

from prosemble.models import DKGLVQ
from prosemble.datasets import load_iris_jax

dataset = load_iris_jax()
X, y = dataset.input_data, dataset.target

model = DKGLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    sigma_init='median',   # per-class median distance initialization
    sigma_min=1e-3,        # prevent bandwidth collapse
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Accuracy: {(preds == y).mean():.2%}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")

The sigma_init parameter controls initialization:

'median' (default): per-class median distance from prototype to class members
'mean': per-class mean distance
float: fixed value for all prototypes

DKGRLVQ¶

Differentiating Kernel GRLVQ. Combines per-feature relevance weighting with per-prototype kernel bandwidth adaptation.

from prosemble.models import DKGRLVQ

model = DKGRLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    sigma_init='median',
    sigma_min=1e-3,
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Accuracy: {(preds == y).mean():.2%}")
print(f"Relevance profile: {model.relevance_profile}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")

The relevance_profile property returns the normalized feature relevance weights \(\lambda = \text{softmax}(\text{relevances})\), identifying which features are most discriminative.

DKGMLVQ¶

Differentiating Kernel GMLVQ with the exponential kernel. Learns a global transformation matrix \(\hat\Omega\) of shape (d, latent_dim).

from prosemble.models import DKGMLVQ

model = DKGMLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    latent_dim=None,          # defaults to input dim
    omega_hat_scale=0.1,      # small init prevents exp overflow
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Omega hat shape: {model.omega_hat_matrix.shape}")
print(f"Lambda hat shape: {model.lambda_hat_matrix.shape}")

The lambda_hat_matrix property returns the symmetric positive semi-definite matrix \(\hat\Lambda = \hat\Omega \hat\Omega^T\), which can be analyzed for feature correlations learned by the model.

One-Class Models¶

The one-class differentiating kernel models combine OC-GLVQ’s \(\theta\)-based hypothesis testing with kernel distances. In standard OC-GLVQ, the classifier function is:

\[\mu_{k^*}(x_i) = s_i \cdot \frac{d(x_i, w_{k^*}) - \theta_{k^*}}{d(x_i, w_{k^*}) + \theta_{k^*}}\]

where \(k^*\) is the nearest prototype, \(\theta_{k^*}\) is a learned per-prototype visibility threshold, and \(s_i = +1\) for target, \(-1\) for outlier. The OC-DK variants replace the Euclidean distance \(d\) with kernel distances.

Critical design detail: The \(\theta_k\) thresholds are initialized in kernel distance scale, not Euclidean scale. Gaussian kernel distances are bounded in \([0, 2]\), so Euclidean-initialized thetas would be far too large.

OCDKGLVQ¶

One-class classification with Gaussian kernel distance and per-prototype bandwidth adaptation.

from prosemble.models import OCDKGLVQ
import jax
import jax.numpy as jnp

# Generate one-class dataset
key = jax.random.PRNGKey(42)
k1, k2 = jax.random.split(key)
X_target = jax.random.normal(k1, (100, 4)) * 0.5
X_outlier = jax.random.normal(k2, (30, 4)) * 0.5 + 3.0
X = jnp.concatenate([X_target, X_outlier])
y = jnp.concatenate([jnp.zeros(100, dtype=jnp.int32),
                     jnp.ones(30, dtype=jnp.int32)])

model = OCDKGLVQ(
    n_prototypes=3,
    max_iter=100,
    lr=0.01,
    sigma_init='median',
    sigma_min=1e-3,
    target_label=0,
    random_seed=42,
)
model.fit(X, y)

scores = model.decision_function(X)
preds = model.predict(X)
print(f"Learned bandwidths: {model.kernel_bandwidths}")
print(f"Visibility radii: {model.visibility_radii}")

OCDKGRLVQ¶

One-class classification with relevance-weighted kernel distance, per-prototype bandwidth, and per-feature relevance learning.

from prosemble.models import OCDKGRLVQ

model = OCDKGRLVQ(
    n_prototypes=3,
    max_iter=100,
    lr=0.01,
    sigma_init='median',
    sigma_min=1e-3,
    target_label=0,
    random_seed=42,
)
model.fit(X, y)

scores = model.decision_function(X)
print(f"Relevance profile: {model.relevance_profile}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")

The relevance_profile property returns the softmax-normalized per-feature weights, identifying which features are most important for the one-class boundary.

OCDKGMLVQ¶

One-class classification with exponential kernel distance and a learned transformation matrix \(\hat\Omega\).

from prosemble.models import OCDKGMLVQ

model = OCDKGMLVQ(
    n_prototypes=3,
    max_iter=100,
    lr=0.01,
    latent_dim=None,
    omega_hat_scale=0.1,
    target_label=0,
    random_seed=42,
)
model.fit(X, y)

scores = model.decision_function(X)
print(f"Omega hat shape: {model.omega_hat_matrix.shape}")
print(f"Lambda hat (PSD): {model.lambda_hat_matrix.shape}")

Supervised Models with Neural Gas¶

The supervised DK-NG variants combine differentiating kernel distances with Neural Gas class-aware neighborhood cooperation. All same-class prototypes participate in the loss, weighted by their distance rank:

\[h_k = \exp\left(-\frac{\text{rank}_k}{\gamma}\right), \quad \text{only for } \text{label}(w_k) = \text{label}(x)\]

where \(\gamma\) decays during training from gamma_init to gamma_final, and the GLVQ margin is computed per prototype:

\[\mu_k = \frac{d_\kappa(x, w_k) - d^-}{d_\kappa(x, w_k) + d^-}\]

with \(d^-\) being the nearest different-class prototype distance.

from prosemble.models import DKGLVQ_NG

model = DKGLVQ_NG(
    n_prototypes_per_class=3,
    max_iter=100,
    lr=0.01,
    sigma_init='median',
    gamma_init=1.5,
    gamma_final=0.01,
    random_seed=42,
)
model.fit(X, y)
preds = model.predict(X)
print(f"Final gamma: {model.gamma_}")

The relevance-weighted variant (DKGRLVQ_NG) adds per-feature relevance weights, while the matrix variant (DKGMLVQ_NG) uses exponential kernel distance with learnable \(\hat\Omega\) transformation.

One-Class Models with Neural Gas¶

The OC-DK-NG variants extend the one-class kernel models with Neural Gas neighborhood cooperation. Instead of only the nearest prototype contributing to the loss, all prototypes participate weighted by their distance rank:

\[h_k = \exp\left(-\frac{\text{rank}_k}{\gamma}\right)\]

where \(\gamma\) decays during training from gamma_init to gamma_final.

from prosemble.models import OCDKGLVQ_NG

model = OCDKGLVQ_NG(
    n_prototypes=3,
    max_iter=100,
    lr=0.01,
    sigma_init='median',
    gamma_init=1.5,
    gamma_final=0.01,
    target_label=0,
    random_seed=42,
)
model.fit(X, y)
print(f"Final gamma: {model.gamma_}")

The relevance-weighted variant (OCDKGRLVQ_NG) and matrix variant (OCDKGMLVQ_NG) follow the same pattern, combining their respective kernel distances with NG cooperation.

Unsupervised Models¶

The unsupervised kernel models use the Gaussian kernel distance for prototype ranking and BMU selection, but \(\sigma\) is a fixed hyperparameter (not learned). Prototypes live in the original data space — only the distance metric changes.

DKNeuralGas¶

Neural Gas with Gaussian kernel distance for ranking.

from prosemble.models import DKNeuralGas
from prosemble.datasets import load_iris_jax

dataset = load_iris_jax()
X = dataset.input_data

model = DKNeuralGas(
    n_prototypes=10,
    kernel_sigma=1.0,
    max_iter=100,
    lr_init=0.5,
    lr_final=0.01,
    lambda_final=0.01,
    random_seed=42,
)
model.fit(X)

labels = model.predict(X)
print(f"Energy: {model.loss_:.4f}")

DKKohonenSOM¶

Kohonen SOM with Gaussian kernel distance for BMU selection. The grid neighborhood is unchanged — only the data-space metric changes.

from prosemble.models import DKKohonenSOM

model = DKKohonenSOM(
    grid_height=5,
    grid_width=5,
    kernel_sigma=1.0,
    sigma_init=2.0,
    sigma_final=0.5,
    lr_init=0.5,
    lr_final=0.01,
    max_iter=100,
    random_seed=42,
)
model.fit(X)

bmu_coords = model.bmu_map(X)
print(f"BMU coordinates shape: {bmu_coords.shape}")

DKHeskesSOM¶

Heskes SOM with Gaussian kernel distance. The Heskes BMU criterion selects the unit whose entire neighborhood best represents the sample:

\[c^*(x) = \arg\min_c \sum_k h(k, c) \cdot d_\kappa^2(x, w_k)\]

from prosemble.models import DKHeskesSOM

model = DKHeskesSOM(
    grid_height=5,
    grid_width=5,
    kernel_sigma=1.0,
    sigma_init=2.0,
    sigma_final=0.5,
    max_iter=100,
    random_seed=42,
)
model.fit(X)

bmu_coords = model.bmu_map(X)
print(f"Energy: {model.loss_:.4f}")

Choosing a Model¶

Model	Kernel	Learned Params	Best For
DKGLVQ	Gaussian	\(w_k, \sigma_k\)	Per-prototype bandwidth adaptation
DKGRLVQ	Gaussian (weighted)	\(w_k, \sigma_k, \lambda\)	Feature selection + kernel adaptation
DKGMLVQ	Exponential	\(w_k, \hat\Omega\)	Full metric adaptation in kernel space
DKGLVQ_NG	Gaussian	\(w_k, \sigma_k, \gamma\)	Supervised kernel + NG cooperation
DKGRLVQ_NG	Gaussian (weighted)	\(w_k, \sigma_k, \lambda, \gamma\)	Supervised kernel + relevances + NG cooperation
DKGMLVQ_NG	Exponential	\(w_k, \hat\Omega, \gamma\)	Supervised kernel matrix + NG cooperation
OCDKGLVQ	Gaussian	\(w_k, \sigma_k, \theta_k\)	One-class with kernel bandwidth adaptation
OCDKGRLVQ	Gaussian (weighted)	\(w_k, \sigma_k, \lambda, \theta_k\)	One-class with feature selection + kernel
OCDKGMLVQ	Exponential	\(w_k, \hat\Omega, \theta_k\)	One-class with full metric adaptation in kernel space
OCDKGLVQ_NG	Gaussian	\(w_k, \sigma_k, \theta_k, \gamma\)	One-class kernel + NG cooperation
OCDKGRLVQ_NG	Gaussian (weighted)	\(w_k, \sigma_k, \lambda, \theta_k, \gamma\)	One-class kernel + relevances + NG cooperation
OCDKGMLVQ_NG	Exponential	\(w_k, \hat\Omega, \theta_k, \gamma\)	One-class kernel matrix + NG cooperation
DKNeuralGas	Gaussian (fixed \(\sigma\))	\(w_k\)	Unsupervised clustering with kernel distance
DKKohonenSOM	Gaussian (fixed \(\sigma\))	\(w_k\)	SOM visualization with kernel distance
DKHeskesSOM	Gaussian (fixed \(\sigma\))	\(w_k\)	Principled SOM with kernel distance

Riemannian Variants¶

Differentiating kernel distances can also be applied on Riemannian manifolds. Three models combine the RiemannianSRNG framework (prototypes on manifold, NG rank cooperation, manifold projection) with kernel distance formulas:

RiemannianDKGLVQ — Gaussian kernel on geodesic distance: \(d_\kappa^2(x, w_k) = 2(1 - \exp(-d_{\text{geo}}^2(x, w_k) / 2\sigma_k^2))\)
RiemannianDKGRLVQ — Relevance-weighted kernel in tangent space: \(d_\kappa^2(x, w_k) = 2(1 - \exp(-\sum_j \lambda_j v_j^2 / 2\sigma_k^2))\) where \(v = \text{Log}_{w_k}(x)_{\text{flat}}\)
RiemannianDKGMLVQ — Exponential kernel in tangent space: \(d_\kappa^2(x, w_k) = \exp(v^T \hat\Lambda v) - 1\) where \(\hat\Lambda = \hat\Omega \hat\Omega^T\)

All three support SO(n), SPD(n), and Grassmannian(n,k) manifolds.

from prosemble.core.manifolds import SO
from prosemble.models import RiemannianDKGLVQ

manifold = SO(3)
model = RiemannianDKGLVQ(
    manifold=manifold, n_prototypes_per_class=2,
    max_iter=100, lr=0.01, use_scan=False,
)
model.fit(X_train, y_train)
preds = model.predict(X_test)

# Inspect learned bandwidths
print(model.kernel_bandwidths)

Riemannian Metric-Adapted DK Variants¶

Nine additional models combine metric adaptation (global/local/subspace) with kernel distance learning on Riemannian manifolds, completing a 3x3 grid (Gaussian/Relevance/Matrix kernels x SMNG/SLNG/STNG bases):

Gaussian kernel variants (per-prototype \(\sigma_k\)):

RiemannianDKSMNG — \(d_\kappa^2 = 2(1 - \exp(-\|\Omega \cdot v\|^2 / 2\sigma_k^2))\)
RiemannianDKSLNG — \(d_\kappa^2 = 2(1 - \exp(-\|\Omega_k \cdot v\|^2 / 2\sigma_k^2))\)
RiemannianDKSTNG — \(d_\kappa^2 = 2(1 - \exp(-\|(I - \Omega_k\Omega_k^T) \cdot v\|^2 / 2\sigma_k^2))\)

Relevance kernel variants (\(\sigma_k\) + relevance \(\lambda\)):

RiemannianDKRSMNG — \(d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j (\Omega \cdot v)_j^2 / 2\sigma_k^2))\)
RiemannianDKRSLNG — \(d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j (\Omega_k \cdot v)_j^2 / 2\sigma_k^2))\)
RiemannianDKRSTNG — \(d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j r_j^2 / 2\sigma_k^2))\)

Matrix kernel variants (exponential with \(\hat\Lambda = \hat\Omega\hat\Omega^T\)):

RiemannianDKMSMNG — \(d_\kappa^2 = \exp((\Omega \cdot v)^T \hat\Lambda (\Omega \cdot v)) - 1\)
RiemannianDKMSLNG — \(d_\kappa^2 = \exp((\Omega_k \cdot v)^T \hat\Lambda (\Omega_k \cdot v)) - 1\)
RiemannianDKMSTNG — \(d_\kappa^2 = \exp(r^T \hat\Lambda r) - 1\)

from prosemble.core.manifolds import Grassmannian
from prosemble.models import RiemannianDKRSMNG

manifold = Grassmannian(4, 2)
model = RiemannianDKRSMNG(
    manifold=manifold, n_prototypes_per_class=2,
    max_iter=100, lr=0.01, use_scan=False,
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(model.kernel_bandwidths)
print(model.relevance_profile)

ONNX Export¶

All 15 differentiating kernel models support ONNX export. The three kernel distance types are implemented as native ONNX subgraphs:

from prosemble.models import DKGLVQ
from prosemble.core.onnx_export import export_onnx

model = DKGLVQ(n_prototypes_per_class=2, max_iter=100, lr=0.01)
model.fit(X_train, y_train)

# Export to ONNX — kernel distance computed in the graph
onnx_model = export_onnx(model, path='dkglvq.onnx')

Per-prototype bandwidths \(\sigma_k\) are clamped at export time (sigma_min), and relevance logits are normalized via a Softmax node in the ONNX graph. See the ONNX Export guide for full details.

Differentiating Kernel Models¶

Mathematical Background¶

Gaussian Kernel Distance¶

Relevance-Weighted Kernel Distance¶

Exponential Kernel Distance¶

Supervised Models¶

DKGLVQ¶

DKGRLVQ¶

DKGMLVQ¶

One-Class Models¶

OCDKGLVQ¶

OCDKGRLVQ¶

OCDKGMLVQ¶

Supervised Models with Neural Gas¶

One-Class Models with Neural Gas¶

Unsupervised Models¶

DKNeuralGas¶

DKKohonenSOM¶

DKHeskesSOM¶

Choosing a Model¶

Riemannian Variants¶

Riemannian Metric-Adapted DK Variants¶

ONNX Export¶

References¶