Differentiating Kernel Models¶

Differentiating kernel models replace Euclidean distances with kernel-induced distances in prototype-based learning. The kernel parameters are adapted via gradient descent alongside the prototypes, enabling the model to learn an optimal non-linear similarity measure from data.

Mathematical Background¶

Gaussian Kernel Distance¶

For a Gaussian kernel with bandwidth \(\sigma\):

\[\kappa(x, w) = \exp\!\left(-\frac{\|x - w\|^2}{2\sigma^2}\right)\]

the induced distance in feature space is:

\[d_\kappa^2(x, w) = \|\phi(x) - \phi(w)\|^2 = 2\bigl(1 - \kappa(x, w)\bigr) = 2\left(1 - \exp\!\left( -\frac{\|x - w\|^2}{2\sigma^2} \right)\right)\]

This distance is bounded in \([0, 2]\) regardless of input magnitude, making it naturally robust to outliers.

Relevance-Weighted Kernel Distance¶

Adding per-feature relevance weights \(\lambda_j = \text{softmax}(\text{relevances})_j\):

\[d_\kappa^2(x, w_k) = 2\left(1 - \exp\!\left( -\frac{\sum_j \lambda_j (x_j - w_{kj})^2}{2\sigma_k^2} \right)\right)\]

This combines feature selection with kernel distance, identifying which input dimensions are most important for classification.

Exponential Kernel Distance¶

The exponential kernel uses a learned transformation matrix \(\hat\Lambda = \hat\Omega \hat\Omega^T\):

\[\kappa_{\exp}(x, w) = \exp\!\bigl(x^T \hat\Lambda\, w\bigr)\]

Unlike the Gaussian kernel, \(\kappa_{\exp}(v, v) \neq 1\), so the full three-term distance formula is required:

\[d_\kappa^2(x, w) = \exp\!\bigl(x^T \hat\Lambda\, x\bigr) + \exp\!\bigl(w^T \hat\Lambda\, w\bigr) - 2\exp\!\bigl(x^T \hat\Lambda\, w\bigr)\]

Supervised Models¶

DKGLVQ¶

Differentiating Kernel GLVQ. Each prototype \(w_k\) has a learnable bandwidth \(\sigma_k\) adapted via gradient descent.

from prosemble.models import DKGLVQ
from prosemble.datasets import load_iris_jax

dataset = load_iris_jax()
X, y = dataset.input_data, dataset.target

model = DKGLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    sigma_init='median',   # per-class median distance initialization
    sigma_min=1e-3,        # prevent bandwidth collapse
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Accuracy: {(preds == y).mean():.2%}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")

The sigma_init parameter controls initialization:

'median' (default): per-class median distance from prototype to class members
'mean': per-class mean distance
float: fixed value for all prototypes

DKGRLVQ¶

Differentiating Kernel GRLVQ. Combines per-feature relevance weighting with per-prototype kernel bandwidth adaptation.

from prosemble.models import DKGRLVQ

model = DKGRLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    sigma_init='median',
    sigma_min=1e-3,
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Accuracy: {(preds == y).mean():.2%}")
print(f"Relevance profile: {model.relevance_profile}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")

The relevance_profile property returns the normalized feature relevance weights \(\lambda = \text{softmax}(\text{relevances})\), identifying which features are most discriminative.

DKGMLVQ¶

Differentiating Kernel GMLVQ with the exponential kernel. Learns a global transformation matrix \(\hat\Omega\) of shape (d, latent_dim).

from prosemble.models import DKGMLVQ

model = DKGMLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    latent_dim=None,          # defaults to input dim
    omega_hat_scale=0.1,      # small init prevents exp overflow
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Omega hat shape: {model.omega_hat_matrix.shape}")
print(f"Lambda hat shape: {model.lambda_hat_matrix.shape}")

The lambda_hat_matrix property returns the symmetric positive semi-definite matrix \(\hat\Lambda = \hat\Omega \hat\Omega^T\), which can be analyzed for feature correlations learned by the model.

Unsupervised Models¶

The unsupervised kernel models use the Gaussian kernel distance for prototype ranking and BMU selection, but \(\sigma\) is a fixed hyperparameter (not learned). Prototypes live in the original data space — only the distance metric changes.

DKNeuralGas¶

Neural Gas with Gaussian kernel distance for ranking.

from prosemble.models import DKNeuralGas
from prosemble.datasets import load_iris_jax

dataset = load_iris_jax()
X = dataset.input_data

model = DKNeuralGas(
    n_prototypes=10,
    kernel_sigma=1.0,
    max_iter=100,
    lr_init=0.5,
    lr_final=0.01,
    lambda_final=0.01,
    random_seed=42,
)
model.fit(X)

labels = model.predict(X)
print(f"Energy: {model.loss_:.4f}")

DKKohonenSOM¶

Kohonen SOM with Gaussian kernel distance for BMU selection. The grid neighborhood is unchanged — only the data-space metric changes.

from prosemble.models import DKKohonenSOM

model = DKKohonenSOM(
    grid_height=5,
    grid_width=5,
    kernel_sigma=1.0,
    sigma_init=2.0,
    sigma_final=0.5,
    lr_init=0.5,
    lr_final=0.01,
    max_iter=100,
    random_seed=42,
)
model.fit(X)

bmu_coords = model.bmu_map(X)
print(f"BMU coordinates shape: {bmu_coords.shape}")

DKHeskesSOM¶

Heskes SOM with Gaussian kernel distance. The Heskes BMU criterion selects the unit whose entire neighborhood best represents the sample:

\[c^*(x) = \arg\min_c \sum_k h(k, c) \cdot d_\kappa^2(x, w_k)\]

from prosemble.models import DKHeskesSOM

model = DKHeskesSOM(
    grid_height=5,
    grid_width=5,
    kernel_sigma=1.0,
    sigma_init=2.0,
    sigma_final=0.5,
    max_iter=100,
    random_seed=42,
)
model.fit(X)

bmu_coords = model.bmu_map(X)
print(f"Energy: {model.loss_:.4f}")

Choosing a Model¶

Model	Kernel	Learned Params	Best For
DKGLVQ	Gaussian	\(w_k, \sigma_k\)	Per-prototype bandwidth adaptation
DKGRLVQ	Gaussian (weighted)	\(w_k, \sigma_k, \lambda\)	Feature selection + kernel adaptation
DKGMLVQ	Exponential	\(w_k, \hat\Omega\)	Full metric adaptation in kernel space
DKNeuralGas	Gaussian (fixed \(\sigma\))	\(w_k\)	Unsupervised clustering with kernel distance
DKKohonenSOM	Gaussian (fixed \(\sigma\))	\(w_k\)	SOM visualization with kernel distance
DKHeskesSOM	Gaussian (fixed \(\sigma\))	\(w_k\)	Principled SOM with kernel distance

Differentiating Kernel Models¶

Mathematical Background¶

Gaussian Kernel Distance¶

Relevance-Weighted Kernel Distance¶

Exponential Kernel Distance¶

Supervised Models¶

DKGLVQ¶

DKGRLVQ¶

DKGMLVQ¶

Unsupervised Models¶

DKNeuralGas¶

DKKohonenSOM¶

DKHeskesSOM¶

Choosing a Model¶

References¶