Differentiating Kernel Models
==============================

Differentiating kernel models replace Euclidean distances with **kernel-induced
distances** in prototype-based learning. The kernel parameters are adapted via
gradient descent alongside the prototypes, enabling the model to learn an
optimal non-linear similarity measure from data.

Mathematical Background
-----------------------

Gaussian Kernel Distance
^^^^^^^^^^^^^^^^^^^^^^^^

For a Gaussian kernel with bandwidth :math:`\sigma`:

.. math::

   \kappa(x, w) = \exp\!\left(-\frac{\|x - w\|^2}{2\sigma^2}\right)

the induced distance in feature space is:

.. math::

   d_\kappa^2(x, w) = \|\phi(x) - \phi(w)\|^2
                     = 2\bigl(1 - \kappa(x, w)\bigr)
                     = 2\left(1 - \exp\!\left(
                         -\frac{\|x - w\|^2}{2\sigma^2}
                       \right)\right)

This distance is bounded in :math:`[0, 2]` regardless of input magnitude,
making it naturally robust to outliers.

Relevance-Weighted Kernel Distance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Adding per-feature relevance weights :math:`\lambda_j = \text{softmax}(\text{relevances})_j`:

.. math::

   d_\kappa^2(x, w_k) = 2\left(1 - \exp\!\left(
       -\frac{\sum_j \lambda_j (x_j - w_{kj})^2}{2\sigma_k^2}
   \right)\right)

This combines feature selection with kernel distance, identifying which
input dimensions are most important for classification.

Exponential Kernel Distance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The exponential kernel uses a learned transformation matrix
:math:`\hat\Lambda = \hat\Omega \hat\Omega^T`:

.. math::

   \kappa_{\exp}(x, w) = \exp\!\bigl(x^T \hat\Lambda\, w\bigr)

Unlike the Gaussian kernel, :math:`\kappa_{\exp}(v, v) \neq 1`, so the full
three-term distance formula is required:

.. math::

   d_\kappa^2(x, w) = \exp\!\bigl(x^T \hat\Lambda\, x\bigr)
                    + \exp\!\bigl(w^T \hat\Lambda\, w\bigr)
                    - 2\exp\!\bigl(x^T \hat\Lambda\, w\bigr)


Supervised Models
-----------------

DKGLVQ
^^^^^^

Differentiating Kernel GLVQ. Each prototype :math:`w_k` has a learnable
bandwidth :math:`\sigma_k` adapted via gradient descent.

.. code-block:: python

   from prosemble.models import DKGLVQ
   from prosemble.datasets import load_iris_jax

   dataset = load_iris_jax()
   X, y = dataset.input_data, dataset.target

   model = DKGLVQ(
       n_prototypes_per_class=2,
       max_iter=200,
       lr=0.01,
       sigma_init='median',   # per-class median distance initialization
       sigma_min=1e-3,        # prevent bandwidth collapse
       random_seed=42,
   )
   model.fit(X, y)

   preds = model.predict(X)
   print(f"Accuracy: {(preds == y).mean():.2%}")
   print(f"Learned bandwidths: {model.kernel_bandwidths}")

The ``sigma_init`` parameter controls initialization:

- ``'median'`` (default): per-class median distance from prototype to class members
- ``'mean'``: per-class mean distance
- ``float``: fixed value for all prototypes

DKGRLVQ
^^^^^^^^

Differentiating Kernel GRLVQ. Combines per-feature relevance weighting
with per-prototype kernel bandwidth adaptation.

.. code-block:: python

   from prosemble.models import DKGRLVQ

   model = DKGRLVQ(
       n_prototypes_per_class=2,
       max_iter=200,
       lr=0.01,
       sigma_init='median',
       sigma_min=1e-3,
       random_seed=42,
   )
   model.fit(X, y)

   preds = model.predict(X)
   print(f"Accuracy: {(preds == y).mean():.2%}")
   print(f"Relevance profile: {model.relevance_profile}")
   print(f"Learned bandwidths: {model.kernel_bandwidths}")

The ``relevance_profile`` property returns the normalized feature relevance
weights :math:`\lambda = \text{softmax}(\text{relevances})`, identifying
which features are most discriminative.

DKGMLVQ
^^^^^^^^

Differentiating Kernel GMLVQ with the exponential kernel. Learns a global
transformation matrix :math:`\hat\Omega` of shape ``(d, latent_dim)``.

.. code-block:: python

   from prosemble.models import DKGMLVQ

   model = DKGMLVQ(
       n_prototypes_per_class=2,
       max_iter=200,
       lr=0.01,
       latent_dim=None,          # defaults to input dim
       omega_hat_scale=0.1,      # small init prevents exp overflow
       random_seed=42,
   )
   model.fit(X, y)

   preds = model.predict(X)
   print(f"Omega hat shape: {model.omega_hat_matrix.shape}")
   print(f"Lambda hat shape: {model.lambda_hat_matrix.shape}")

The ``lambda_hat_matrix`` property returns the symmetric positive
semi-definite matrix :math:`\hat\Lambda = \hat\Omega \hat\Omega^T`,
which can be analyzed for feature correlations learned by the model.


Unsupervised Models
-------------------

The unsupervised kernel models use the Gaussian kernel distance for prototype
ranking and BMU selection, but :math:`\sigma` is a **fixed hyperparameter**
(not learned). Prototypes live in the original data space — only the distance
metric changes.

DKNeuralGas
^^^^^^^^^^^^

Neural Gas with Gaussian kernel distance for ranking.

.. code-block:: python

   from prosemble.models import DKNeuralGas
   from prosemble.datasets import load_iris_jax

   dataset = load_iris_jax()
   X = dataset.input_data

   model = DKNeuralGas(
       n_prototypes=10,
       kernel_sigma=1.0,
       max_iter=100,
       lr_init=0.5,
       lr_final=0.01,
       lambda_final=0.01,
       random_seed=42,
   )
   model.fit(X)

   labels = model.predict(X)
   print(f"Energy: {model.loss_:.4f}")

DKKohonenSOM
^^^^^^^^^^^^^

Kohonen SOM with Gaussian kernel distance for BMU selection. The grid
neighborhood is unchanged — only the data-space metric changes.

.. code-block:: python

   from prosemble.models import DKKohonenSOM

   model = DKKohonenSOM(
       grid_height=5,
       grid_width=5,
       kernel_sigma=1.0,
       sigma_init=2.0,
       sigma_final=0.5,
       lr_init=0.5,
       lr_final=0.01,
       max_iter=100,
       random_seed=42,
   )
   model.fit(X)

   bmu_coords = model.bmu_map(X)
   print(f"BMU coordinates shape: {bmu_coords.shape}")

DKHeskesSOM
^^^^^^^^^^^^

Heskes SOM with Gaussian kernel distance. The Heskes BMU criterion selects
the unit whose **entire neighborhood** best represents the sample:

.. math::

   c^*(x) = \arg\min_c \sum_k h(k, c) \cdot d_\kappa^2(x, w_k)

.. code-block:: python

   from prosemble.models import DKHeskesSOM

   model = DKHeskesSOM(
       grid_height=5,
       grid_width=5,
       kernel_sigma=1.0,
       sigma_init=2.0,
       sigma_final=0.5,
       max_iter=100,
       random_seed=42,
   )
   model.fit(X)

   bmu_coords = model.bmu_map(X)
   print(f"Energy: {model.loss_:.4f}")


Choosing a Model
-----------------

.. list-table::
   :header-rows: 1
   :widths: 20 20 20 40

   * - Model
     - Kernel
     - Learned Params
     - Best For
   * - DKGLVQ
     - Gaussian
     - :math:`w_k, \sigma_k`
     - Per-prototype bandwidth adaptation
   * - DKGRLVQ
     - Gaussian (weighted)
     - :math:`w_k, \sigma_k, \lambda`
     - Feature selection + kernel adaptation
   * - DKGMLVQ
     - Exponential
     - :math:`w_k, \hat\Omega`
     - Full metric adaptation in kernel space
   * - DKNeuralGas
     - Gaussian (fixed :math:`\sigma`)
     - :math:`w_k`
     - Unsupervised clustering with kernel distance
   * - DKKohonenSOM
     - Gaussian (fixed :math:`\sigma`)
     - :math:`w_k`
     - SOM visualization with kernel distance
   * - DKHeskesSOM
     - Gaussian (fixed :math:`\sigma`)
     - :math:`w_k`
     - Principled SOM with kernel distance

References
----------

.. [1] Villmann, T., Haase, S., & Kaden, M. (2015). Kernelized vector
       quantization in gradient-descent learning. *Neurocomputing*, 147,
       83--95.