Differentiating Kernel Models

Differentiating kernel models replace Euclidean distances with kernel-induced distances in prototype-based learning. The kernel parameters are adapted via gradient descent alongside the prototypes, enabling the model to learn an optimal non-linear similarity measure from data.

Mathematical Background

Gaussian Kernel Distance

For a Gaussian kernel with bandwidth \(\sigma\):

\[\kappa(x, w) = \exp\!\left(-\frac{\|x - w\|^2}{2\sigma^2}\right)\]

the induced distance in feature space is:

\[d_\kappa^2(x, w) = \|\phi(x) - \phi(w)\|^2 = 2\bigl(1 - \kappa(x, w)\bigr) = 2\left(1 - \exp\!\left( -\frac{\|x - w\|^2}{2\sigma^2} \right)\right)\]

This distance is bounded in \([0, 2]\) regardless of input magnitude, making it naturally robust to outliers.

Relevance-Weighted Kernel Distance

Adding per-feature relevance weights \(\lambda_j = \text{softmax}(\text{relevances})_j\):

\[d_\kappa^2(x, w_k) = 2\left(1 - \exp\!\left( -\frac{\sum_j \lambda_j (x_j - w_{kj})^2}{2\sigma_k^2} \right)\right)\]

This combines feature selection with kernel distance, identifying which input dimensions are most important for classification.

Exponential Kernel Distance

The exponential kernel uses a learned transformation matrix \(\hat\Lambda = \hat\Omega \hat\Omega^T\):

\[\kappa_{\exp}(x, w) = \exp\!\bigl(x^T \hat\Lambda\, w\bigr)\]

Unlike the Gaussian kernel, \(\kappa_{\exp}(v, v) \neq 1\), so the full three-term distance formula is required:

\[d_\kappa^2(x, w) = \exp\!\bigl(x^T \hat\Lambda\, x\bigr) + \exp\!\bigl(w^T \hat\Lambda\, w\bigr) - 2\exp\!\bigl(x^T \hat\Lambda\, w\bigr)\]

Supervised Models

DKGLVQ

Differentiating Kernel GLVQ. Each prototype \(w_k\) has a learnable bandwidth \(\sigma_k\) adapted via gradient descent.

from prosemble.models import DKGLVQ
from prosemble.datasets import load_iris_jax

dataset = load_iris_jax()
X, y = dataset.input_data, dataset.target

model = DKGLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    sigma_init='median',   # per-class median distance initialization
    sigma_min=1e-3,        # prevent bandwidth collapse
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Accuracy: {(preds == y).mean():.2%}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")

The sigma_init parameter controls initialization:

  • 'median' (default): per-class median distance from prototype to class members

  • 'mean': per-class mean distance

  • float: fixed value for all prototypes

DKGRLVQ

Differentiating Kernel GRLVQ. Combines per-feature relevance weighting with per-prototype kernel bandwidth adaptation.

from prosemble.models import DKGRLVQ

model = DKGRLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    sigma_init='median',
    sigma_min=1e-3,
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Accuracy: {(preds == y).mean():.2%}")
print(f"Relevance profile: {model.relevance_profile}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")

The relevance_profile property returns the normalized feature relevance weights \(\lambda = \text{softmax}(\text{relevances})\), identifying which features are most discriminative.

DKGMLVQ

Differentiating Kernel GMLVQ with the exponential kernel. Learns a global transformation matrix \(\hat\Omega\) of shape (d, latent_dim).

from prosemble.models import DKGMLVQ

model = DKGMLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    latent_dim=None,          # defaults to input dim
    omega_hat_scale=0.1,      # small init prevents exp overflow
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Omega hat shape: {model.omega_hat_matrix.shape}")
print(f"Lambda hat shape: {model.lambda_hat_matrix.shape}")

The lambda_hat_matrix property returns the symmetric positive semi-definite matrix \(\hat\Lambda = \hat\Omega \hat\Omega^T\), which can be analyzed for feature correlations learned by the model.

One-Class Models

The one-class differentiating kernel models combine OC-GLVQ’s \(\theta\)-based hypothesis testing with kernel distances. In standard OC-GLVQ, the classifier function is:

\[\mu_{k^*}(x_i) = s_i \cdot \frac{d(x_i, w_{k^*}) - \theta_{k^*}}{d(x_i, w_{k^*}) + \theta_{k^*}}\]

where \(k^*\) is the nearest prototype, \(\theta_{k^*}\) is a learned per-prototype visibility threshold, and \(s_i = +1\) for target, \(-1\) for outlier. The OC-DK variants replace the Euclidean distance \(d\) with kernel distances.

Critical design detail: The \(\theta_k\) thresholds are initialized in kernel distance scale, not Euclidean scale. Gaussian kernel distances are bounded in \([0, 2]\), so Euclidean-initialized thetas would be far too large.

OCDKGLVQ

One-class classification with Gaussian kernel distance and per-prototype bandwidth adaptation.

from prosemble.models import OCDKGLVQ
import jax
import jax.numpy as jnp

# Generate one-class dataset
key = jax.random.PRNGKey(42)
k1, k2 = jax.random.split(key)
X_target = jax.random.normal(k1, (100, 4)) * 0.5
X_outlier = jax.random.normal(k2, (30, 4)) * 0.5 + 3.0
X = jnp.concatenate([X_target, X_outlier])
y = jnp.concatenate([jnp.zeros(100, dtype=jnp.int32),
                     jnp.ones(30, dtype=jnp.int32)])

model = OCDKGLVQ(
    n_prototypes=3,
    max_iter=100,
    lr=0.01,
    sigma_init='median',
    sigma_min=1e-3,
    target_label=0,
    random_seed=42,
)
model.fit(X, y)

scores = model.decision_function(X)
preds = model.predict(X)
print(f"Learned bandwidths: {model.kernel_bandwidths}")
print(f"Visibility radii: {model.visibility_radii}")

OCDKGRLVQ

One-class classification with relevance-weighted kernel distance, per-prototype bandwidth, and per-feature relevance learning.

from prosemble.models import OCDKGRLVQ

model = OCDKGRLVQ(
    n_prototypes=3,
    max_iter=100,
    lr=0.01,
    sigma_init='median',
    sigma_min=1e-3,
    target_label=0,
    random_seed=42,
)
model.fit(X, y)

scores = model.decision_function(X)
print(f"Relevance profile: {model.relevance_profile}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")

The relevance_profile property returns the softmax-normalized per-feature weights, identifying which features are most important for the one-class boundary.

OCDKGMLVQ

One-class classification with exponential kernel distance and a learned transformation matrix \(\hat\Omega\).

from prosemble.models import OCDKGMLVQ

model = OCDKGMLVQ(
    n_prototypes=3,
    max_iter=100,
    lr=0.01,
    latent_dim=None,
    omega_hat_scale=0.1,
    target_label=0,
    random_seed=42,
)
model.fit(X, y)

scores = model.decision_function(X)
print(f"Omega hat shape: {model.omega_hat_matrix.shape}")
print(f"Lambda hat (PSD): {model.lambda_hat_matrix.shape}")

Supervised Models with Neural Gas

The supervised DK-NG variants combine differentiating kernel distances with Neural Gas class-aware neighborhood cooperation. All same-class prototypes participate in the loss, weighted by their distance rank:

\[h_k = \exp\left(-\frac{\text{rank}_k}{\gamma}\right), \quad \text{only for } \text{label}(w_k) = \text{label}(x)\]

where \(\gamma\) decays during training from gamma_init to gamma_final, and the GLVQ margin is computed per prototype:

\[\mu_k = \frac{d_\kappa(x, w_k) - d^-}{d_\kappa(x, w_k) + d^-}\]

with \(d^-\) being the nearest different-class prototype distance.

from prosemble.models import DKGLVQ_NG

model = DKGLVQ_NG(
    n_prototypes_per_class=3,
    max_iter=100,
    lr=0.01,
    sigma_init='median',
    gamma_init=1.5,
    gamma_final=0.01,
    random_seed=42,
)
model.fit(X, y)
preds = model.predict(X)
print(f"Final gamma: {model.gamma_}")

The relevance-weighted variant (DKGRLVQ_NG) adds per-feature relevance weights, while the matrix variant (DKGMLVQ_NG) uses exponential kernel distance with learnable \(\hat\Omega\) transformation.

One-Class Models with Neural Gas

The OC-DK-NG variants extend the one-class kernel models with Neural Gas neighborhood cooperation. Instead of only the nearest prototype contributing to the loss, all prototypes participate weighted by their distance rank:

\[h_k = \exp\left(-\frac{\text{rank}_k}{\gamma}\right)\]

where \(\gamma\) decays during training from gamma_init to gamma_final.

from prosemble.models import OCDKGLVQ_NG

model = OCDKGLVQ_NG(
    n_prototypes=3,
    max_iter=100,
    lr=0.01,
    sigma_init='median',
    gamma_init=1.5,
    gamma_final=0.01,
    target_label=0,
    random_seed=42,
)
model.fit(X, y)
print(f"Final gamma: {model.gamma_}")

The relevance-weighted variant (OCDKGRLVQ_NG) and matrix variant (OCDKGMLVQ_NG) follow the same pattern, combining their respective kernel distances with NG cooperation.

Unsupervised Models

The unsupervised kernel models use the Gaussian kernel distance for prototype ranking and BMU selection, but \(\sigma\) is a fixed hyperparameter (not learned). Prototypes live in the original data space — only the distance metric changes.

DKNeuralGas

Neural Gas with Gaussian kernel distance for ranking.

from prosemble.models import DKNeuralGas
from prosemble.datasets import load_iris_jax

dataset = load_iris_jax()
X = dataset.input_data

model = DKNeuralGas(
    n_prototypes=10,
    kernel_sigma=1.0,
    max_iter=100,
    lr_init=0.5,
    lr_final=0.01,
    lambda_final=0.01,
    random_seed=42,
)
model.fit(X)

labels = model.predict(X)
print(f"Energy: {model.loss_:.4f}")

DKKohonenSOM

Kohonen SOM with Gaussian kernel distance for BMU selection. The grid neighborhood is unchanged — only the data-space metric changes.

from prosemble.models import DKKohonenSOM

model = DKKohonenSOM(
    grid_height=5,
    grid_width=5,
    kernel_sigma=1.0,
    sigma_init=2.0,
    sigma_final=0.5,
    lr_init=0.5,
    lr_final=0.01,
    max_iter=100,
    random_seed=42,
)
model.fit(X)

bmu_coords = model.bmu_map(X)
print(f"BMU coordinates shape: {bmu_coords.shape}")

DKHeskesSOM

Heskes SOM with Gaussian kernel distance. The Heskes BMU criterion selects the unit whose entire neighborhood best represents the sample:

\[c^*(x) = \arg\min_c \sum_k h(k, c) \cdot d_\kappa^2(x, w_k)\]
from prosemble.models import DKHeskesSOM

model = DKHeskesSOM(
    grid_height=5,
    grid_width=5,
    kernel_sigma=1.0,
    sigma_init=2.0,
    sigma_final=0.5,
    max_iter=100,
    random_seed=42,
)
model.fit(X)

bmu_coords = model.bmu_map(X)
print(f"Energy: {model.loss_:.4f}")

Choosing a Model

Model

Kernel

Learned Params

Best For

DKGLVQ

Gaussian

\(w_k, \sigma_k\)

Per-prototype bandwidth adaptation

DKGRLVQ

Gaussian (weighted)

\(w_k, \sigma_k, \lambda\)

Feature selection + kernel adaptation

DKGMLVQ

Exponential

\(w_k, \hat\Omega\)

Full metric adaptation in kernel space

DKGLVQ_NG

Gaussian

\(w_k, \sigma_k, \gamma\)

Supervised kernel + NG cooperation

DKGRLVQ_NG

Gaussian (weighted)

\(w_k, \sigma_k, \lambda, \gamma\)

Supervised kernel + relevances + NG cooperation

DKGMLVQ_NG

Exponential

\(w_k, \hat\Omega, \gamma\)

Supervised kernel matrix + NG cooperation

OCDKGLVQ

Gaussian

\(w_k, \sigma_k, \theta_k\)

One-class with kernel bandwidth adaptation

OCDKGRLVQ

Gaussian (weighted)

\(w_k, \sigma_k, \lambda, \theta_k\)

One-class with feature selection + kernel

OCDKGMLVQ

Exponential

\(w_k, \hat\Omega, \theta_k\)

One-class with full metric adaptation in kernel space

OCDKGLVQ_NG

Gaussian

\(w_k, \sigma_k, \theta_k, \gamma\)

One-class kernel + NG cooperation

OCDKGRLVQ_NG

Gaussian (weighted)

\(w_k, \sigma_k, \lambda, \theta_k, \gamma\)

One-class kernel + relevances + NG cooperation

OCDKGMLVQ_NG

Exponential

\(w_k, \hat\Omega, \theta_k, \gamma\)

One-class kernel matrix + NG cooperation

DKNeuralGas

Gaussian (fixed \(\sigma\))

\(w_k\)

Unsupervised clustering with kernel distance

DKKohonenSOM

Gaussian (fixed \(\sigma\))

\(w_k\)

SOM visualization with kernel distance

DKHeskesSOM

Gaussian (fixed \(\sigma\))

\(w_k\)

Principled SOM with kernel distance

Riemannian Variants

Differentiating kernel distances can also be applied on Riemannian manifolds. Three models combine the RiemannianSRNG framework (prototypes on manifold, NG rank cooperation, manifold projection) with kernel distance formulas:

  • RiemannianDKGLVQ — Gaussian kernel on geodesic distance: \(d_\kappa^2(x, w_k) = 2(1 - \exp(-d_{\text{geo}}^2(x, w_k) / 2\sigma_k^2))\)

  • RiemannianDKGRLVQ — Relevance-weighted kernel in tangent space: \(d_\kappa^2(x, w_k) = 2(1 - \exp(-\sum_j \lambda_j v_j^2 / 2\sigma_k^2))\) where \(v = \text{Log}_{w_k}(x)_{\text{flat}}\)

  • RiemannianDKGMLVQ — Exponential kernel in tangent space: \(d_\kappa^2(x, w_k) = \exp(v^T \hat\Lambda v) - 1\) where \(\hat\Lambda = \hat\Omega \hat\Omega^T\)

All three support SO(n), SPD(n), and Grassmannian(n,k) manifolds.

from prosemble.core.manifolds import SO
from prosemble.models import RiemannianDKGLVQ

manifold = SO(3)
model = RiemannianDKGLVQ(
    manifold=manifold, n_prototypes_per_class=2,
    max_iter=100, lr=0.01, use_scan=False,
)
model.fit(X_train, y_train)
preds = model.predict(X_test)

# Inspect learned bandwidths
print(model.kernel_bandwidths)

Riemannian Metric-Adapted DK Variants

Nine additional models combine metric adaptation (global/local/subspace) with kernel distance learning on Riemannian manifolds, completing a 3x3 grid (Gaussian/Relevance/Matrix kernels x SMNG/SLNG/STNG bases):

Gaussian kernel variants (per-prototype \(\sigma_k\)):

  • RiemannianDKSMNG\(d_\kappa^2 = 2(1 - \exp(-\|\Omega \cdot v\|^2 / 2\sigma_k^2))\)

  • RiemannianDKSLNG\(d_\kappa^2 = 2(1 - \exp(-\|\Omega_k \cdot v\|^2 / 2\sigma_k^2))\)

  • RiemannianDKSTNG\(d_\kappa^2 = 2(1 - \exp(-\|(I - \Omega_k\Omega_k^T) \cdot v\|^2 / 2\sigma_k^2))\)

Relevance kernel variants (\(\sigma_k\) + relevance \(\lambda\)):

  • RiemannianDKRSMNG\(d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j (\Omega \cdot v)_j^2 / 2\sigma_k^2))\)

  • RiemannianDKRSLNG\(d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j (\Omega_k \cdot v)_j^2 / 2\sigma_k^2))\)

  • RiemannianDKRSTNG\(d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j r_j^2 / 2\sigma_k^2))\)

Matrix kernel variants (exponential with \(\hat\Lambda = \hat\Omega\hat\Omega^T\)):

  • RiemannianDKMSMNG\(d_\kappa^2 = \exp((\Omega \cdot v)^T \hat\Lambda (\Omega \cdot v)) - 1\)

  • RiemannianDKMSLNG\(d_\kappa^2 = \exp((\Omega_k \cdot v)^T \hat\Lambda (\Omega_k \cdot v)) - 1\)

  • RiemannianDKMSTNG\(d_\kappa^2 = \exp(r^T \hat\Lambda r) - 1\)

from prosemble.core.manifolds import Grassmannian
from prosemble.models import RiemannianDKRSMNG

manifold = Grassmannian(4, 2)
model = RiemannianDKRSMNG(
    manifold=manifold, n_prototypes_per_class=2,
    max_iter=100, lr=0.01, use_scan=False,
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(model.kernel_bandwidths)
print(model.relevance_profile)

ONNX Export

All 15 differentiating kernel models support ONNX export. The three kernel distance types are implemented as native ONNX subgraphs:

from prosemble.models import DKGLVQ
from prosemble.core.onnx_export import export_onnx

model = DKGLVQ(n_prototypes_per_class=2, max_iter=100, lr=0.01)
model.fit(X_train, y_train)

# Export to ONNX — kernel distance computed in the graph
onnx_model = export_onnx(model, path='dkglvq.onnx')

Per-prototype bandwidths \(\sigma_k\) are clamped at export time (sigma_min), and relevance logits are normalized via a Softmax node in the ONNX graph. See the ONNX Export guide for full details.

References