Differentiating Kernel Models

Differentiating kernel models replace Euclidean distances with kernel-induced distances in prototype-based learning. The kernel parameters are adapted via gradient descent alongside the prototypes, enabling the model to learn an optimal non-linear similarity measure from data.

Mathematical Background

Gaussian Kernel Distance

For a Gaussian kernel with bandwidth \(\sigma\):

\[\kappa(x, w) = \exp\!\left(-\frac{\|x - w\|^2}{2\sigma^2}\right)\]

the induced distance in feature space is:

\[d_\kappa^2(x, w) = \|\phi(x) - \phi(w)\|^2 = 2\bigl(1 - \kappa(x, w)\bigr) = 2\left(1 - \exp\!\left( -\frac{\|x - w\|^2}{2\sigma^2} \right)\right)\]

This distance is bounded in \([0, 2]\) regardless of input magnitude, making it naturally robust to outliers.

Relevance-Weighted Kernel Distance

Adding per-feature relevance weights \(\lambda_j = \text{softmax}(\text{relevances})_j\):

\[d_\kappa^2(x, w_k) = 2\left(1 - \exp\!\left( -\frac{\sum_j \lambda_j (x_j - w_{kj})^2}{2\sigma_k^2} \right)\right)\]

This combines feature selection with kernel distance, identifying which input dimensions are most important for classification.

Exponential Kernel Distance

The exponential kernel uses a learned transformation matrix \(\hat\Lambda = \hat\Omega \hat\Omega^T\):

\[\kappa_{\exp}(x, w) = \exp\!\bigl(x^T \hat\Lambda\, w\bigr)\]

Unlike the Gaussian kernel, \(\kappa_{\exp}(v, v) \neq 1\), so the full three-term distance formula is required:

\[d_\kappa^2(x, w) = \exp\!\bigl(x^T \hat\Lambda\, x\bigr) + \exp\!\bigl(w^T \hat\Lambda\, w\bigr) - 2\exp\!\bigl(x^T \hat\Lambda\, w\bigr)\]

Supervised Models

DKGLVQ

Differentiating Kernel GLVQ. Each prototype \(w_k\) has a learnable bandwidth \(\sigma_k\) adapted via gradient descent.

from prosemble.models import DKGLVQ
from prosemble.datasets import load_iris_jax

dataset = load_iris_jax()
X, y = dataset.input_data, dataset.target

model = DKGLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    sigma_init='median',   # per-class median distance initialization
    sigma_min=1e-3,        # prevent bandwidth collapse
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Accuracy: {(preds == y).mean():.2%}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")

The sigma_init parameter controls initialization:

  • 'median' (default): per-class median distance from prototype to class members

  • 'mean': per-class mean distance

  • float: fixed value for all prototypes

DKGRLVQ

Differentiating Kernel GRLVQ. Combines per-feature relevance weighting with per-prototype kernel bandwidth adaptation.

from prosemble.models import DKGRLVQ

model = DKGRLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    sigma_init='median',
    sigma_min=1e-3,
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Accuracy: {(preds == y).mean():.2%}")
print(f"Relevance profile: {model.relevance_profile}")
print(f"Learned bandwidths: {model.kernel_bandwidths}")

The relevance_profile property returns the normalized feature relevance weights \(\lambda = \text{softmax}(\text{relevances})\), identifying which features are most discriminative.

DKGMLVQ

Differentiating Kernel GMLVQ with the exponential kernel. Learns a global transformation matrix \(\hat\Omega\) of shape (d, latent_dim).

from prosemble.models import DKGMLVQ

model = DKGMLVQ(
    n_prototypes_per_class=2,
    max_iter=200,
    lr=0.01,
    latent_dim=None,          # defaults to input dim
    omega_hat_scale=0.1,      # small init prevents exp overflow
    random_seed=42,
)
model.fit(X, y)

preds = model.predict(X)
print(f"Omega hat shape: {model.omega_hat_matrix.shape}")
print(f"Lambda hat shape: {model.lambda_hat_matrix.shape}")

The lambda_hat_matrix property returns the symmetric positive semi-definite matrix \(\hat\Lambda = \hat\Omega \hat\Omega^T\), which can be analyzed for feature correlations learned by the model.

Unsupervised Models

The unsupervised kernel models use the Gaussian kernel distance for prototype ranking and BMU selection, but \(\sigma\) is a fixed hyperparameter (not learned). Prototypes live in the original data space — only the distance metric changes.

DKNeuralGas

Neural Gas with Gaussian kernel distance for ranking.

from prosemble.models import DKNeuralGas
from prosemble.datasets import load_iris_jax

dataset = load_iris_jax()
X = dataset.input_data

model = DKNeuralGas(
    n_prototypes=10,
    kernel_sigma=1.0,
    max_iter=100,
    lr_init=0.5,
    lr_final=0.01,
    lambda_final=0.01,
    random_seed=42,
)
model.fit(X)

labels = model.predict(X)
print(f"Energy: {model.loss_:.4f}")

DKKohonenSOM

Kohonen SOM with Gaussian kernel distance for BMU selection. The grid neighborhood is unchanged — only the data-space metric changes.

from prosemble.models import DKKohonenSOM

model = DKKohonenSOM(
    grid_height=5,
    grid_width=5,
    kernel_sigma=1.0,
    sigma_init=2.0,
    sigma_final=0.5,
    lr_init=0.5,
    lr_final=0.01,
    max_iter=100,
    random_seed=42,
)
model.fit(X)

bmu_coords = model.bmu_map(X)
print(f"BMU coordinates shape: {bmu_coords.shape}")

DKHeskesSOM

Heskes SOM with Gaussian kernel distance. The Heskes BMU criterion selects the unit whose entire neighborhood best represents the sample:

\[c^*(x) = \arg\min_c \sum_k h(k, c) \cdot d_\kappa^2(x, w_k)\]
from prosemble.models import DKHeskesSOM

model = DKHeskesSOM(
    grid_height=5,
    grid_width=5,
    kernel_sigma=1.0,
    sigma_init=2.0,
    sigma_final=0.5,
    max_iter=100,
    random_seed=42,
)
model.fit(X)

bmu_coords = model.bmu_map(X)
print(f"Energy: {model.loss_:.4f}")

Choosing a Model

Model

Kernel

Learned Params

Best For

DKGLVQ

Gaussian

\(w_k, \sigma_k\)

Per-prototype bandwidth adaptation

DKGRLVQ

Gaussian (weighted)

\(w_k, \sigma_k, \lambda\)

Feature selection + kernel adaptation

DKGMLVQ

Exponential

\(w_k, \hat\Omega\)

Full metric adaptation in kernel space

DKNeuralGas

Gaussian (fixed \(\sigma\))

\(w_k\)

Unsupervised clustering with kernel distance

DKKohonenSOM

Gaussian (fixed \(\sigma\))

\(w_k\)

SOM visualization with kernel distance

DKHeskesSOM

Gaussian (fixed \(\sigma\))

\(w_k\)

Principled SOM with kernel distance

References