Supervised Models
=================

All supervised models in prosemble follow the same ``fit`` / ``predict`` API.
They differ in the distance metric, loss function, and representation.

GLVQ — Baseline
----------------

Generalized Learning Vector Quantization is the foundation model. It
learns prototype positions to minimize the relative distance cost
:math:`\mu = (d^+ - d^-) / (d^+ + d^-)`.

.. code-block:: python

   from prosemble.models import GLVQ
   from prosemble.datasets import load_iris_jax
   from prosemble.core.utils import train_test_split_jax

   dataset = load_iris_jax()
   X, y = dataset.input_data, dataset.labels
   X_train, X_test, y_train, y_test = train_test_split_jax(X, y)

   model = GLVQ(
       n_prototypes_per_class=2,
       max_iter=100,
       lr=0.01,
       random_seed=42,
   )
   model.fit(X_train, y_train)

   predictions = model.predict(X_test)
   probabilities = model.predict_proba(X_test)

**Unequal prototypes per class:**

.. code-block:: python

   model = GLVQ(
       n_prototypes_per_class={0: 3, 1: 2, 2: 1},
       prototypes_initializer='class_conditional_mean',
   )

**Prototype initializers** (pass as string):

- ``'stratified_random'`` — random samples per class (default)
- ``'class_mean'`` — class centroids
- ``'class_conditional_mean'`` — class centroids replicated
- ``'stratified_noise'`` — random samples + Gaussian noise
- ``'random_normal'`` — random normal initialization
- ``'uniform'`` — random uniform
- ``'zeros'`` — zero initialization
- ``'ones'`` — ones initialization
- ``'fill_value'`` — constant fill value (pass ``value=`` kwarg)

For unsupervised models, ``selection_init`` and ``mean_init`` from
``prosemble.core.initializers`` provide classless alternatives.

You can also pass a callable for custom initialization (e.g., ``literal_init(values)``
from ``prosemble.core.initializers``).

GRLVQ — Feature Relevances
---------------------------

Adds per-feature relevance weights :math:`\lambda_j` that reveal which
features matter for classification.

.. code-block:: python

   from prosemble.models import GRLVQ

   model = GRLVQ(
       n_prototypes_per_class=2,
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)

   # Inspect learned relevances
   print(model.relevances_)  # shape: (n_features,)

The relevance vector satisfies :math:`\lambda_j \geq 0` and
:math:`\sum_j \lambda_j = 1`. Features with large :math:`\lambda_j` are
important for classification.

GMLVQ — Matrix Metric Learning
-------------------------------

Learns a full linear transformation :math:`\Omega` such that the distance
is :math:`d(x, w) = \|\Omega(x - w)\|^2`. This captures feature
correlations.

.. code-block:: python

   from prosemble.models import GMLVQ

   model = GMLVQ(
       n_prototypes_per_class=1,
       latent_dim=2,        # project to 2D
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)

   # Learned matrices
   print(model.omega_matrix.shape)   # (latent_dim, n_features)
   print(model.lambda_matrix)        # Omega^T @ Omega — feature importance

   # Feature importance from diagonal
   import jax.numpy as jnp
   feature_importance = jnp.diag(model.lambda_matrix)

When ``latent_dim < n_features``, :math:`\Omega` also performs dimensionality
reduction optimized for classification.

LGMLVQ — Local Metrics
-----------------------

Each prototype gets its own :math:`\Omega_k` matrix, allowing different
regions of the feature space to use different metrics.

.. code-block:: python

   from prosemble.models import LGMLVQ

   model = LGMLVQ(
       n_prototypes_per_class=2,
       latent_dim=2,
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)

GTLVQ — Tangent Distance
-------------------------

Learns per-prototype invariance subspaces. The tangent distance measures
distance only in directions orthogonal to the invariance subspace:
:math:`d(x, w_k) = \|(I - \Omega_k \Omega_k^T)(x - w_k)\|^2`.

.. code-block:: python

   from prosemble.models import GTLVQ

   model = GTLVQ(
       n_prototypes_per_class=2,
       tangent_dim=1,        # 1D invariance subspace
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)

Wasserstein Prototype LVQ
--------------------------

The Wasserstein LVQ family uses Gaussian distributional prototypes instead
of point prototypes.  Each prototype :math:`k` is a diagonal Gaussian
:math:`\mathcal{N}(\mu_k, \text{diag}(\sigma_k^2))`.  The distance from
a data point :math:`x` (treated as a Dirac delta) to prototype :math:`k`
is the squared 2-Wasserstein distance:

.. math::

   W_2^2(\delta_x, \mathcal{N}(\mu_k, \text{diag}(\sigma_k^2)))
   = \|x - \mu_k\|^2 + \sum_j \sigma_{kj}^2

The variance term acts as a per-prototype learned "spread" — prototypes
with smaller variance are more specific and attract nearby points more
strongly.

**WGLVQ** — Wasserstein GLVQ (base variant):

.. code-block:: python

   from prosemble.models import WGLVQ

   model = WGLVQ(
       n_prototypes_per_class=2,
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)

   # Learned prototype means and variances
   print(model.prototype_means.shape)      # (n_prototypes, n_features)
   print(model.prototype_variances.shape)   # (n_prototypes, n_features)

**WGMLVQ** — Adds a global :math:`\Omega` matrix projection:

.. math::

   W_2^2(x, k) = \|\Omega(x - \mu_k)\|^2 + \sum_j \sigma_{kj}^2

.. code-block:: python

   from prosemble.models import WGMLVQ

   model = WGMLVQ(
       n_prototypes_per_class=2,
       latent_dim=2,          # project to 2D
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)

   print(model.omega_matrix.shape)   # (n_features, latent_dim)
   print(model.lambda_matrix)        # Omega^T @ Omega

**WGRLVQ** — Adds per-feature relevance weighting:

.. math::

   W_2^2(x, k) = \sum_j \lambda_j (x_j - \mu_{kj})^2 + \sum_j \sigma_{kj}^2

where :math:`\lambda_j \geq 0` and :math:`\sum_j \lambda_j = 1`.

.. code-block:: python

   from prosemble.models import WGRLVQ

   model = WGRLVQ(
       n_prototypes_per_class=2,
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)

   # Learned relevance profile
   print(model.relevance_profile)   # shape: (n_features,), sums to 1

.. note::

   Variances are parameterized internally as :math:`\log(\sigma^2)` to ensure
   positivity during gradient optimization.  Access the actual variances via
   ``model.prototype_variances``.

CELVQ — Cross-Entropy
----------------------

Replaces the GLVQ :math:`\mu` cost with cross-entropy loss over class
probabilities derived from distances. Produces calibrated probabilities.

.. code-block:: python

   from prosemble.models import CELVQ

   model = CELVQ(
       n_prototypes_per_class=2,
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)
   proba = model.predict_proba(X_test)  # calibrated class probabilities

LVQ1 and LVQ2.1 — Classic Non-Gradient
---------------------------------------

Simple competitive learning without explicit gradients. LVQ1 updates only
the nearest prototype; LVQ2.1 updates both nearest correct and incorrect.

.. code-block:: python

   from prosemble.models import LVQ1, LVQ21

   model = LVQ1(
       n_prototypes_per_class=2,
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)

Probabilistic Models (SLVQ, RSLVQ)
-----------------------------------

Treat prototypes as Gaussian components. RSLVQ provides robust
log-likelihood training with confidence-based rejection.

.. code-block:: python

   from prosemble.models import RSLVQ

   model = RSLVQ(
       n_prototypes_per_class=2,
       sigma=1.0,
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)

**MRSLVQ** — Matrix RSLVQ with global Omega metric learning:

.. code-block:: python

   from prosemble.models import MRSLVQ

   model = MRSLVQ(
       n_prototypes_per_class=1,
       latent_dim=2,
       sigma=1.0,
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)
   print(model.omega_matrix.shape)   # (n_features, latent_dim)
   print(model.lambda_matrix)        # Omega^T @ Omega

**LMRSLVQ** — Local Matrix RSLVQ with per-prototype Omega matrices:

.. code-block:: python

   from prosemble.models import LMRSLVQ

   model = LMRSLVQ(
       n_prototypes_per_class=1,
       latent_dim=2,
       sigma=1.0,
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)

Probabilistic LVQ with Neural Gas (RSLVQ-NG Family)
----------------------------------------------------

The RSLVQ-NG family combines RSLVQ's Gaussian mixture probabilistic
assignment with Neural Gas rank-based neighborhood cooperation. All
prototypes participate in the loss, weighted by both their Gaussian
probability and NG rank.

**RSLVQ_NG** — Euclidean distance (base variant):

.. code-block:: python

   from prosemble.models import RSLVQ_NG

   model = RSLVQ_NG(
       n_prototypes_per_class=3,
       sigma=1.0,
       gamma_init=5.0,
       gamma_final=0.01,
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)

**MRSLVQ_NG** — Global Omega matrix metric learning:

.. code-block:: python

   from prosemble.models import MRSLVQ_NG

   model = MRSLVQ_NG(
       n_prototypes_per_class=2,
       latent_dim=2,
       sigma=1.0,
       gamma_init=5.0,
       gamma_final=0.01,
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)

   # Learned metric matrices
   print(model.omega_matrix.shape)   # (n_features, latent_dim)
   print(model.lambda_matrix)        # Omega^T @ Omega

**LMRSLVQ_NG** — Per-prototype local Omega matrices:

.. code-block:: python

   from prosemble.models import LMRSLVQ_NG

   model = LMRSLVQ_NG(
       n_prototypes_per_class=2,
       latent_dim=2,
       sigma=1.0,
       gamma_init=5.0,
       gamma_final=0.01,
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)

Median LVQ
----------

Constrains prototypes to be actual data points for maximum interpretability.

.. code-block:: python

   from prosemble.models import MedianLVQ

   model = MedianLVQ(
       n_prototypes_per_class=1,
       max_iter=50,
       random_seed=42,
   )
   model.fit(X_train, y_train)
   # model.prototypes_ are actual training examples

Deep Variants (LVQMLN, PLVQ)
-----------------------------

Add a trainable MLP backbone for nonlinear feature extraction.
Prototypes live in the latent space (not the input space).

.. code-block:: python

   from prosemble.models import LVQMLN

   model = LVQMLN(
       n_prototypes_per_class=2,
       hidden_dims=[64, 32],  # MLP architecture
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)

Siamese Variants
----------------

Like LVQMLN, but prototypes are in the **input space** and pass through
the same backbone as input data. This makes prototypes interpretable.

.. code-block:: python

   from prosemble.models import SiameseGLVQ, SiameseGMLVQ, SiameseGTLVQ

   model = SiameseGLVQ(
       n_prototypes_per_class=2,
       hidden_dims=[64, 32],
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)
   # model.prototypes_ is in the original input space

Image LVQ
---------

Siamese architecture with a CNN backbone for image classification.

.. code-block:: python

   from prosemble.models import ImageGLVQ, ImageGMLVQ, ImageGTLVQ

   model = ImageGLVQ(
       n_prototypes_per_class=2,
       max_iter=50,
       lr=0.001,
   )
   # X_train should be image data: (n_samples, height, width, channels)
   model.fit(X_train, y_train)

CBC — Classification-By-Components
-----------------------------------

Uses classless components with learned reasoning matrices for explainable
classification.

.. code-block:: python

   from prosemble.models import CBC

   model = CBC(
       n_components=6,
       num_classes=3,
       max_iter=100,
       lr=0.001,
   )
   model.fit(X_train, y_train)

Supervised Relevance Neural Gas (SRNG)
--------------------------------------

Combines GLVQ loss with Neural Gas neighborhood cooperation. All same-class
prototypes are updated per sample, weighted by rank. SRNG also learns
per-feature relevance weights (like GRLVQ).

.. code-block:: python

   from prosemble.models import SRNG

   model = SRNG(
       n_prototypes_per_class=3,
       lambda_init=5.0,       # initial neighborhood range
       lambda_final=0.01,     # final (narrower)
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)

   # Inspect relevances (SRNG learns feature weights like GRLVQ)
   print(model.relevances_)

Supervised Matrix Neural Gas (SMNG, SLNG, STNG)
------------------------------------------------

The SMNG family extends SRNG with matrix metric adaptation while keeping
Neural Gas neighborhood cooperation. All same-class prototypes participate
in the loss weighted by rank.

**SMNG** — Global Omega matrix (like GMLVQ + Neural Gas):

.. code-block:: python

   from prosemble.models import SMNG

   model = SMNG(
       n_prototypes_per_class=3,
       gamma_init=5.0,
       gamma_final=0.01,
       max_iter=200,
       lr=0.01,
   )
   model.fit(X_train, y_train)
   print(model.omega_matrix.shape)   # (n_features, latent_dim)
   print(model.lambda_matrix.shape)  # (n_features, n_features)

**SLNG** — Per-prototype Omega matrices (like LGMLVQ + Neural Gas):

.. code-block:: python

   from prosemble.models import SLNG

   model = SLNG(
       n_prototypes_per_class=3,
       gamma_init=5.0,
       gamma_final=0.01,
       max_iter=200,
       lr=0.01,
   )
   model.fit(X_train, y_train)
   print(model.omegas_.shape)  # (n_protos, n_features, latent_dim)

**STNG** — Per-prototype tangent subspaces (like GTLVQ + Neural Gas):

.. code-block:: python

   from prosemble.models import STNG

   model = STNG(
       n_prototypes_per_class=3,
       subspace_dim=2,
       gamma_init=5.0,
       gamma_final=0.01,
       max_iter=200,
       lr=0.01,
   )
   model.fit(X_train, y_train)
   print(model.omegas_.shape)  # (n_protos, n_features, subspace_dim)

Cross-Entropy Neural Gas (CELVQ-NG Family)
-------------------------------------------

The CELVQ-NG family combines cross-entropy loss over all-class softmax logits
with Neural Gas rank-based neighborhood cooperation. Unlike SRNG (which uses
pairwise GLVQ :math:`\mu` cost), CELVQ-NG considers all classes simultaneously
via softmax, providing better calibrated probabilities and gradient flow to
all prototypes.

Neural Gas cooperation replaces the hard per-class ``min`` pooling in CELVQ
with NG-weighted soft pooling: for each class, prototypes are ranked by
distance and weighted by :math:`h_k = \exp(-\text{rank} / \gamma)`. When
:math:`\gamma \to 0`, CELVQ-NG recovers standard CELVQ.

**CELVQ_NG** — Euclidean distance (base variant):

.. code-block:: python

   from prosemble.models import CELVQ_NG

   model = CELVQ_NG(
       n_prototypes_per_class=3,
       gamma_init=5.0,        # initial neighborhood range
       gamma_final=0.01,      # final (narrower)
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)
   proba = model.predict_proba(X_test)  # calibrated probabilities

**MCELVQ_NG** — Global Omega matrix metric learning:

.. code-block:: python

   from prosemble.models import MCELVQ_NG

   model = MCELVQ_NG(
       n_prototypes_per_class=3,
       latent_dim=2,          # project to 2D
       gamma_init=5.0,
       gamma_final=0.01,
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)

   # Learned metric matrices
   print(model.omega_matrix.shape)   # (n_features, latent_dim)
   print(model.lambda_matrix)        # Omega^T @ Omega — feature importance

**LCELVQ_NG** — Per-prototype local Omega matrices:

.. code-block:: python

   from prosemble.models import LCELVQ_NG

   model = LCELVQ_NG(
       n_prototypes_per_class=2,
       latent_dim=2,
       gamma_init=5.0,
       gamma_final=0.01,
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)

   # Each prototype has its own Omega_k
   print(model.omegas_.shape)  # (n_prototypes, n_features, latent_dim)

**TCELVQ_NG** — Tangent subspace distance:

.. code-block:: python

   from prosemble.models import TCELVQ_NG

   model = TCELVQ_NG(
       n_prototypes_per_class=2,
       subspace_dim=1,        # 1D invariance subspace
       gamma_init=5.0,
       gamma_final=0.01,
       max_iter=100,
       lr=0.01,
   )
   model.fit(X_train, y_train)

   # Learned orthogonal tangent bases
   print(model.omegas_.shape)  # (n_prototypes, n_features, subspace_dim)

The tangent variant measures distance orthogonal to learned invariance
subspaces: :math:`d(x, w_k) = \|(I - \Omega_k \Omega_k^T)(x - w_k)\|^2`.
Best suited for high-dimensional data with invariance structure (images,
spectra, signals).

.. list-table:: CELVQ-NG Family Summary
   :header-rows: 1
   :widths: 20 30 25 25

   * - Model
     - Distance Metric
     - Learnable Parameters
     - Best For
   * - CELVQ_NG
     - Euclidean
     - Prototypes only
     - General-purpose, fast training
   * - MCELVQ_NG
     - :math:`\|\Omega(x-w)\|^2`
     - Global :math:`\Omega` matrix
     - Feature selection, dimensionality reduction
   * - LCELVQ_NG
     - :math:`\|\Omega_k(x-w_k)\|^2`
     - Per-prototype :math:`\Omega_k`
     - Heterogeneous feature spaces
   * - TCELVQ_NG
     - :math:`\|(I-\Omega_k\Omega_k^T)(x-w_k)\|^2`
     - Tangent bases :math:`\Omega_k`
     - High-dimensional data with invariances

Common Patterns
---------------

**Resume training:**

.. code-block:: python

   model.fit(X_train, y_train, max_iter=50)
   model.fit(X_train, y_train, resume=True, max_iter=50)  # continue from last state

**Fitted attributes** (available after ``fit``):

- ``model.prototypes_`` — prototype positions
- ``model.prototype_labels_`` — class labels per prototype
- ``model.n_iter_`` — number of iterations run
- ``model.loss_`` — final loss value
- ``model.loss_history_`` — loss per iteration

**All models** support:

- ``predict(X)`` — hard labels
- ``predict_proba(X)`` — soft class probabilities
- ``save(path)`` / ``Model.load(path)`` — persistence