`olm.train.optim.lion`

Source: src/olm/train/optim/lion.py:1

Classes

`Lion(params: Iterable, lr: float = 0.0001, betas: Tuple[float, float] = (0.9, 0.99), weight_decay: float = 0.0, use_triton: bool = False)`

Bases: olm.train.optim.base.OptimizerBase

Source: src/olm/train/optim/lion.py:7

Lion optimizer (EvoLved Sign Momentum).

Implements the Lion algorithm from "Symbolic Discovery of Optimization Algorithms" (Chen et al., 2023). Lion uses only the sign of the gradient for updates, making it more memory-efficient than Adam while often achieving better performance.

Key differences from Adam:

Uses sign of interpolated gradient for updates (memory efficient)
Single momentum buffer instead of two (m and v in Adam)
Typically requires smaller learning rates (1/3 to 1/10 of AdamW)
Larger weight decay (3-10x that of AdamW)

Parameters

params: iterable of parameters to optimize or dicts defining parameter groups
lr: learning rate (default: 1e-4, typically 3-10x smaller than AdamW)
betas: coefficients used for computing running averages (default: (0.9, 0.99))
weight_decay: weight decay coefficient (default: 0.0)
use_triton: whether to use Triton kernel for faster computation (default: False)

Example

model = nn.Linear(10, 5)
optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.1)
optimizer.zero_grad()
loss = model(input).sum()
loss.backward()
optimizer.step()

Methods

`zero_grad(self, set_to_none: bool = True)`

Source: src/olm/train/optim/lion.py:126

Sets gradients of all optimized tensors to zero.

Parameters

set_to_none: instead of setting to zero, set the grads to None. This is more memory efficient and can slightly improve performance.