Differentiating Kernel Models ============================== Differentiating kernel models replace Euclidean distances with **kernel-induced distances** in prototype-based learning. The kernel parameters are adapted via gradient descent alongside the prototypes, enabling the model to learn an optimal non-linear similarity measure from data. Mathematical Background ----------------------- Gaussian Kernel Distance ^^^^^^^^^^^^^^^^^^^^^^^^ For a Gaussian kernel with bandwidth :math:`\sigma`: .. math:: \kappa(x, w) = \exp\!\left(-\frac{\|x - w\|^2}{2\sigma^2}\right) the induced distance in feature space is: .. math:: d_\kappa^2(x, w) = \|\phi(x) - \phi(w)\|^2 = 2\bigl(1 - \kappa(x, w)\bigr) = 2\left(1 - \exp\!\left( -\frac{\|x - w\|^2}{2\sigma^2} \right)\right) This distance is bounded in :math:`[0, 2]` regardless of input magnitude, making it naturally robust to outliers. Relevance-Weighted Kernel Distance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Adding per-feature relevance weights :math:`\lambda_j = \text{softmax}(\text{relevances})_j`: .. math:: d_\kappa^2(x, w_k) = 2\left(1 - \exp\!\left( -\frac{\sum_j \lambda_j (x_j - w_{kj})^2}{2\sigma_k^2} \right)\right) This combines feature selection with kernel distance, identifying which input dimensions are most important for classification. Exponential Kernel Distance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The exponential kernel uses a learned transformation matrix :math:`\hat\Lambda = \hat\Omega \hat\Omega^T`: .. math:: \kappa_{\exp}(x, w) = \exp\!\bigl(x^T \hat\Lambda\, w\bigr) Unlike the Gaussian kernel, :math:`\kappa_{\exp}(v, v) \neq 1`, so the full three-term distance formula is required: .. math:: d_\kappa^2(x, w) = \exp\!\bigl(x^T \hat\Lambda\, x\bigr) + \exp\!\bigl(w^T \hat\Lambda\, w\bigr) - 2\exp\!\bigl(x^T \hat\Lambda\, w\bigr) Supervised Models ----------------- DKGLVQ ^^^^^^ Differentiating Kernel GLVQ. Each prototype :math:`w_k` has a learnable bandwidth :math:`\sigma_k` adapted via gradient descent. .. code-block:: python from prosemble.models import DKGLVQ from prosemble.datasets import load_iris_jax dataset = load_iris_jax() X, y = dataset.input_data, dataset.target model = DKGLVQ( n_prototypes_per_class=2, max_iter=200, lr=0.01, sigma_init='median', # per-class median distance initialization sigma_min=1e-3, # prevent bandwidth collapse random_seed=42, ) model.fit(X, y) preds = model.predict(X) print(f"Accuracy: {(preds == y).mean():.2%}") print(f"Learned bandwidths: {model.kernel_bandwidths}") The ``sigma_init`` parameter controls initialization: - ``'median'`` (default): per-class median distance from prototype to class members - ``'mean'``: per-class mean distance - ``float``: fixed value for all prototypes DKGRLVQ ^^^^^^^^ Differentiating Kernel GRLVQ. Combines per-feature relevance weighting with per-prototype kernel bandwidth adaptation. .. code-block:: python from prosemble.models import DKGRLVQ model = DKGRLVQ( n_prototypes_per_class=2, max_iter=200, lr=0.01, sigma_init='median', sigma_min=1e-3, random_seed=42, ) model.fit(X, y) preds = model.predict(X) print(f"Accuracy: {(preds == y).mean():.2%}") print(f"Relevance profile: {model.relevance_profile}") print(f"Learned bandwidths: {model.kernel_bandwidths}") The ``relevance_profile`` property returns the normalized feature relevance weights :math:`\lambda = \text{softmax}(\text{relevances})`, identifying which features are most discriminative. DKGMLVQ ^^^^^^^^ Differentiating Kernel GMLVQ with the exponential kernel. Learns a global transformation matrix :math:`\hat\Omega` of shape ``(d, latent_dim)``. .. code-block:: python from prosemble.models import DKGMLVQ model = DKGMLVQ( n_prototypes_per_class=2, max_iter=200, lr=0.01, latent_dim=None, # defaults to input dim omega_hat_scale=0.1, # small init prevents exp overflow random_seed=42, ) model.fit(X, y) preds = model.predict(X) print(f"Omega hat shape: {model.omega_hat_matrix.shape}") print(f"Lambda hat shape: {model.lambda_hat_matrix.shape}") The ``lambda_hat_matrix`` property returns the symmetric positive semi-definite matrix :math:`\hat\Lambda = \hat\Omega \hat\Omega^T`, which can be analyzed for feature correlations learned by the model. One-Class Models ----------------- The one-class differentiating kernel models combine OC-GLVQ's :math:`\theta`-based hypothesis testing with kernel distances. In standard OC-GLVQ, the classifier function is: .. math:: \mu_{k^*}(x_i) = s_i \cdot \frac{d(x_i, w_{k^*}) - \theta_{k^*}}{d(x_i, w_{k^*}) + \theta_{k^*}} where :math:`k^*` is the nearest prototype, :math:`\theta_{k^*}` is a learned per-prototype visibility threshold, and :math:`s_i = +1` for target, :math:`-1` for outlier. The OC-DK variants replace the Euclidean distance :math:`d` with kernel distances. **Critical design detail:** The :math:`\theta_k` thresholds are initialized in *kernel distance scale*, not Euclidean scale. Gaussian kernel distances are bounded in :math:`[0, 2]`, so Euclidean-initialized thetas would be far too large. OCDKGLVQ ^^^^^^^^^ One-class classification with Gaussian kernel distance and per-prototype bandwidth adaptation. .. code-block:: python from prosemble.models import OCDKGLVQ import jax import jax.numpy as jnp # Generate one-class dataset key = jax.random.PRNGKey(42) k1, k2 = jax.random.split(key) X_target = jax.random.normal(k1, (100, 4)) * 0.5 X_outlier = jax.random.normal(k2, (30, 4)) * 0.5 + 3.0 X = jnp.concatenate([X_target, X_outlier]) y = jnp.concatenate([jnp.zeros(100, dtype=jnp.int32), jnp.ones(30, dtype=jnp.int32)]) model = OCDKGLVQ( n_prototypes=3, max_iter=100, lr=0.01, sigma_init='median', sigma_min=1e-3, target_label=0, random_seed=42, ) model.fit(X, y) scores = model.decision_function(X) preds = model.predict(X) print(f"Learned bandwidths: {model.kernel_bandwidths}") print(f"Visibility radii: {model.visibility_radii}") OCDKGRLVQ ^^^^^^^^^^ One-class classification with relevance-weighted kernel distance, per-prototype bandwidth, and per-feature relevance learning. .. code-block:: python from prosemble.models import OCDKGRLVQ model = OCDKGRLVQ( n_prototypes=3, max_iter=100, lr=0.01, sigma_init='median', sigma_min=1e-3, target_label=0, random_seed=42, ) model.fit(X, y) scores = model.decision_function(X) print(f"Relevance profile: {model.relevance_profile}") print(f"Learned bandwidths: {model.kernel_bandwidths}") The ``relevance_profile`` property returns the softmax-normalized per-feature weights, identifying which features are most important for the one-class boundary. OCDKGMLVQ ^^^^^^^^^^ One-class classification with exponential kernel distance and a learned transformation matrix :math:`\hat\Omega`. .. code-block:: python from prosemble.models import OCDKGMLVQ model = OCDKGMLVQ( n_prototypes=3, max_iter=100, lr=0.01, latent_dim=None, omega_hat_scale=0.1, target_label=0, random_seed=42, ) model.fit(X, y) scores = model.decision_function(X) print(f"Omega hat shape: {model.omega_hat_matrix.shape}") print(f"Lambda hat (PSD): {model.lambda_hat_matrix.shape}") Supervised Models with Neural Gas ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The supervised DK-NG variants combine differentiating kernel distances with Neural Gas class-aware neighborhood cooperation. All **same-class** prototypes participate in the loss, weighted by their distance rank: .. math:: h_k = \exp\left(-\frac{\text{rank}_k}{\gamma}\right), \quad \text{only for } \text{label}(w_k) = \text{label}(x) where :math:`\gamma` decays during training from ``gamma_init`` to ``gamma_final``, and the GLVQ margin is computed per prototype: .. math:: \mu_k = \frac{d_\kappa(x, w_k) - d^-}{d_\kappa(x, w_k) + d^-} with :math:`d^-` being the nearest different-class prototype distance. .. code-block:: python from prosemble.models import DKGLVQ_NG model = DKGLVQ_NG( n_prototypes_per_class=3, max_iter=100, lr=0.01, sigma_init='median', gamma_init=1.5, gamma_final=0.01, random_seed=42, ) model.fit(X, y) preds = model.predict(X) print(f"Final gamma: {model.gamma_}") The relevance-weighted variant (``DKGRLVQ_NG``) adds per-feature relevance weights, while the matrix variant (``DKGMLVQ_NG``) uses exponential kernel distance with learnable :math:`\hat\Omega` transformation. One-Class Models with Neural Gas ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The OC-DK-NG variants extend the one-class kernel models with Neural Gas neighborhood cooperation. Instead of only the nearest prototype contributing to the loss, **all** prototypes participate weighted by their distance rank: .. math:: h_k = \exp\left(-\frac{\text{rank}_k}{\gamma}\right) where :math:`\gamma` decays during training from ``gamma_init`` to ``gamma_final``. .. code-block:: python from prosemble.models import OCDKGLVQ_NG model = OCDKGLVQ_NG( n_prototypes=3, max_iter=100, lr=0.01, sigma_init='median', gamma_init=1.5, gamma_final=0.01, target_label=0, random_seed=42, ) model.fit(X, y) print(f"Final gamma: {model.gamma_}") The relevance-weighted variant (``OCDKGRLVQ_NG``) and matrix variant (``OCDKGMLVQ_NG``) follow the same pattern, combining their respective kernel distances with NG cooperation. Unsupervised Models ------------------- The unsupervised kernel models use the Gaussian kernel distance for prototype ranking and BMU selection, but :math:`\sigma` is a **fixed hyperparameter** (not learned). Prototypes live in the original data space — only the distance metric changes. DKNeuralGas ^^^^^^^^^^^^ Neural Gas with Gaussian kernel distance for ranking. .. code-block:: python from prosemble.models import DKNeuralGas from prosemble.datasets import load_iris_jax dataset = load_iris_jax() X = dataset.input_data model = DKNeuralGas( n_prototypes=10, kernel_sigma=1.0, max_iter=100, lr_init=0.5, lr_final=0.01, lambda_final=0.01, random_seed=42, ) model.fit(X) labels = model.predict(X) print(f"Energy: {model.loss_:.4f}") DKKohonenSOM ^^^^^^^^^^^^^ Kohonen SOM with Gaussian kernel distance for BMU selection. The grid neighborhood is unchanged — only the data-space metric changes. .. code-block:: python from prosemble.models import DKKohonenSOM model = DKKohonenSOM( grid_height=5, grid_width=5, kernel_sigma=1.0, sigma_init=2.0, sigma_final=0.5, lr_init=0.5, lr_final=0.01, max_iter=100, random_seed=42, ) model.fit(X) bmu_coords = model.bmu_map(X) print(f"BMU coordinates shape: {bmu_coords.shape}") DKHeskesSOM ^^^^^^^^^^^^ Heskes SOM with Gaussian kernel distance. The Heskes BMU criterion selects the unit whose **entire neighborhood** best represents the sample: .. math:: c^*(x) = \arg\min_c \sum_k h(k, c) \cdot d_\kappa^2(x, w_k) .. code-block:: python from prosemble.models import DKHeskesSOM model = DKHeskesSOM( grid_height=5, grid_width=5, kernel_sigma=1.0, sigma_init=2.0, sigma_final=0.5, max_iter=100, random_seed=42, ) model.fit(X) bmu_coords = model.bmu_map(X) print(f"Energy: {model.loss_:.4f}") Choosing a Model ----------------- .. list-table:: :header-rows: 1 :widths: 20 20 20 40 * - Model - Kernel - Learned Params - Best For * - DKGLVQ - Gaussian - :math:`w_k, \sigma_k` - Per-prototype bandwidth adaptation * - DKGRLVQ - Gaussian (weighted) - :math:`w_k, \sigma_k, \lambda` - Feature selection + kernel adaptation * - DKGMLVQ - Exponential - :math:`w_k, \hat\Omega` - Full metric adaptation in kernel space * - DKGLVQ_NG - Gaussian - :math:`w_k, \sigma_k, \gamma` - Supervised kernel + NG cooperation * - DKGRLVQ_NG - Gaussian (weighted) - :math:`w_k, \sigma_k, \lambda, \gamma` - Supervised kernel + relevances + NG cooperation * - DKGMLVQ_NG - Exponential - :math:`w_k, \hat\Omega, \gamma` - Supervised kernel matrix + NG cooperation * - OCDKGLVQ - Gaussian - :math:`w_k, \sigma_k, \theta_k` - One-class with kernel bandwidth adaptation * - OCDKGRLVQ - Gaussian (weighted) - :math:`w_k, \sigma_k, \lambda, \theta_k` - One-class with feature selection + kernel * - OCDKGMLVQ - Exponential - :math:`w_k, \hat\Omega, \theta_k` - One-class with full metric adaptation in kernel space * - OCDKGLVQ_NG - Gaussian - :math:`w_k, \sigma_k, \theta_k, \gamma` - One-class kernel + NG cooperation * - OCDKGRLVQ_NG - Gaussian (weighted) - :math:`w_k, \sigma_k, \lambda, \theta_k, \gamma` - One-class kernel + relevances + NG cooperation * - OCDKGMLVQ_NG - Exponential - :math:`w_k, \hat\Omega, \theta_k, \gamma` - One-class kernel matrix + NG cooperation * - DKNeuralGas - Gaussian (fixed :math:`\sigma`) - :math:`w_k` - Unsupervised clustering with kernel distance * - DKKohonenSOM - Gaussian (fixed :math:`\sigma`) - :math:`w_k` - SOM visualization with kernel distance * - DKHeskesSOM - Gaussian (fixed :math:`\sigma`) - :math:`w_k` - Principled SOM with kernel distance Riemannian Variants ------------------- Differentiating kernel distances can also be applied on Riemannian manifolds. Three models combine the RiemannianSRNG framework (prototypes on manifold, NG rank cooperation, manifold projection) with kernel distance formulas: - **RiemannianDKGLVQ** — Gaussian kernel on geodesic distance: :math:`d_\kappa^2(x, w_k) = 2(1 - \exp(-d_{\text{geo}}^2(x, w_k) / 2\sigma_k^2))` - **RiemannianDKGRLVQ** — Relevance-weighted kernel in tangent space: :math:`d_\kappa^2(x, w_k) = 2(1 - \exp(-\sum_j \lambda_j v_j^2 / 2\sigma_k^2))` where :math:`v = \text{Log}_{w_k}(x)_{\text{flat}}` - **RiemannianDKGMLVQ** — Exponential kernel in tangent space: :math:`d_\kappa^2(x, w_k) = \exp(v^T \hat\Lambda v) - 1` where :math:`\hat\Lambda = \hat\Omega \hat\Omega^T` All three support SO(n), SPD(n), and Grassmannian(n,k) manifolds. .. code-block:: python from prosemble.core.manifolds import SO from prosemble.models import RiemannianDKGLVQ manifold = SO(3) model = RiemannianDKGLVQ( manifold=manifold, n_prototypes_per_class=2, max_iter=100, lr=0.01, use_scan=False, ) model.fit(X_train, y_train) preds = model.predict(X_test) # Inspect learned bandwidths print(model.kernel_bandwidths) Riemannian Metric-Adapted DK Variants ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Nine additional models combine metric adaptation (global/local/subspace) with kernel distance learning on Riemannian manifolds, completing a 3x3 grid (Gaussian/Relevance/Matrix kernels x SMNG/SLNG/STNG bases): **Gaussian kernel variants** (per-prototype :math:`\sigma_k`): - **RiemannianDKSMNG** — :math:`d_\kappa^2 = 2(1 - \exp(-\|\Omega \cdot v\|^2 / 2\sigma_k^2))` - **RiemannianDKSLNG** — :math:`d_\kappa^2 = 2(1 - \exp(-\|\Omega_k \cdot v\|^2 / 2\sigma_k^2))` - **RiemannianDKSTNG** — :math:`d_\kappa^2 = 2(1 - \exp(-\|(I - \Omega_k\Omega_k^T) \cdot v\|^2 / 2\sigma_k^2))` **Relevance kernel variants** (:math:`\sigma_k` + relevance :math:`\lambda`): - **RiemannianDKRSMNG** — :math:`d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j (\Omega \cdot v)_j^2 / 2\sigma_k^2))` - **RiemannianDKRSLNG** — :math:`d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j (\Omega_k \cdot v)_j^2 / 2\sigma_k^2))` - **RiemannianDKRSTNG** — :math:`d_\kappa^2 = 2(1 - \exp(-\sum_j \lambda_j r_j^2 / 2\sigma_k^2))` **Matrix kernel variants** (exponential with :math:`\hat\Lambda = \hat\Omega\hat\Omega^T`): - **RiemannianDKMSMNG** — :math:`d_\kappa^2 = \exp((\Omega \cdot v)^T \hat\Lambda (\Omega \cdot v)) - 1` - **RiemannianDKMSLNG** — :math:`d_\kappa^2 = \exp((\Omega_k \cdot v)^T \hat\Lambda (\Omega_k \cdot v)) - 1` - **RiemannianDKMSTNG** — :math:`d_\kappa^2 = \exp(r^T \hat\Lambda r) - 1` .. code-block:: python from prosemble.core.manifolds import Grassmannian from prosemble.models import RiemannianDKRSMNG manifold = Grassmannian(4, 2) model = RiemannianDKRSMNG( manifold=manifold, n_prototypes_per_class=2, max_iter=100, lr=0.01, use_scan=False, ) model.fit(X_train, y_train) preds = model.predict(X_test) print(model.kernel_bandwidths) print(model.relevance_profile) ONNX Export ----------- All 15 differentiating kernel models support ONNX export. The three kernel distance types are implemented as native ONNX subgraphs: .. code-block:: python from prosemble.models import DKGLVQ from prosemble.core.onnx_export import export_onnx model = DKGLVQ(n_prototypes_per_class=2, max_iter=100, lr=0.01) model.fit(X_train, y_train) # Export to ONNX — kernel distance computed in the graph onnx_model = export_onnx(model, path='dkglvq.onnx') Per-prototype bandwidths :math:`\sigma_k` are clamped at export time (``sigma_min``), and relevance logits are normalized via a Softmax node in the ONNX graph. See the :doc:`onnx` guide for full details. References ---------- .. [1] Villmann, T., Haase, S., & Kaden, M. (2015). Kernelized vector quantization in gradient-descent learning. *Neurocomputing*, 147, 83--95.