olm.nn.embeddings.positional¶
class olm.nn.embeddings.positional.ALiBiPositionalBias(*args: Any, **kwargs: Any)¶
Bases: PositionalEmbeddingBase
Attention with Linear Biases (ALiBi) as described in “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation” (arXiv 2108.12409).
Instead of adding positional information to embeddings, ALiBi adds a bias to attention scores that is proportional to the distance between query and key positions. This allows the model to extrapolate to longer sequences than seen during training.
The bias is computed as: bias[i,j] = -m * |i - j|
where m is a head-specific slope.
forward(seq_len_q: int, seq_len_k: int, device: torch.device | None = None) → torch.Tensor¶
Get ALiBi bias for the given query and key sequence lengths.
- Parameters:
- seq_len_q – length of query sequence
- seq_len_k – length of key sequence (usually same as seq_len_q)
- device – device to place the bias tensor on
- Returns: Bias tensor of shape (1, num_heads, seq_len_q, seq_len_k) This should be added to attention scores before softmax.
class olm.nn.embeddings.positional.AbsolutePositionalEmbedding(*args: Any, **kwargs: Any)¶
Bases: PositionalEmbeddingBase
Absolute (Learned) Positional Embedding.
This is the standard positional embedding used in the original Transformer paper and models like GPT-2. It learns a separate embedding vector for each position in the sequence, up to a maximum sequence length.
These embeddings are typically added to token embeddings before passing through the transformer blocks.
forward(x: torch.Tensor, seq_positions: torch.LongTensor | None = None) → torch.Tensor¶
Apply absolute positional embedding to input tensor x.
- Parameters:
- x – shape (batch_size, seq_len, embed_dim) - token embeddings
- seq_positions – optional tensor of shape (batch_size, seq_len) with position indices. If None, assumes positions are 0..seq_len-1 for each batch.
- Returns: Tensor of same shape as x, with positional embeddings added.
class olm.nn.embeddings.positional.PartialRotaryPositionalEmbedding(*args: Any, **kwargs: Any)¶
Bases: PositionalEmbeddingBase
Partial Rotary Positional Embedding (LLaMA-style RoPE).
Only applies rotary embeddings to a fraction of the head dimensions, leaving the remaining dimensions unchanged. This is the approach used in models like LLaMA, where typically 25-50% of dimensions are rotated.
For example, with head_dim=128 and rotary_percentage=0.5, only the first 64 dimensions are rotated, while the last 64 dimensions pass through unchanged.
forward(x: torch.Tensor, seq_positions: torch.LongTensor | None = None) → torch.Tensor¶
Apply partial rotary positional embedding to input tensor x.
- Parameters:
- x – shape (batch_size, seq_len, num_heads, head_dim)
- seq_positions – optional tensor of shape (batch_size, seq_len) with position indices. If None, assumes positions are 0..seq_len-1 for each batch.
- Returns: Tensor of same shape as x, with partial RoPE applied.
class olm.nn.embeddings.positional.PositionalEmbeddingBase(*args: Any, **kwargs: Any)¶
Bases: Module, ABC
Abstract base class for all positional embedding implementations.
Positional embeddings add information about token positions in a sequence to help the model understand order and relative positions. Different positional embedding strategies have different properties:
- Learned (Absolute): Simple, effective, but limited to max_seq_len
- Sinusoidal: Deterministic, can extrapolate to longer sequences
- RoPE: Applied to Q/K directly, enables relative position modeling
- ALiBi: Adds bias to attention scores, excellent extrapolation
All positional embedding implementations should inherit from this base class and implement the forward method.
extra_repr() → str¶
String representation of the module for debugging.
Override this in subclasses to provide useful information.
abstractmethod forward(*args, **kwargs) → torch.Tensor¶
Apply positional information to input tensor(s).
The signature and behavior of this method varies by implementation: - Some add to embeddings (Absolute, Sinusoidal) - Some rotate representations (RoPE) - Some return bias to add to attention scores (ALiBi)
- Returns: Transformed tensor(s) with positional information applied
class olm.nn.embeddings.positional.RotaryPositionalEmbedding(*args: Any, **kwargs: Any)¶
Bases: PositionalEmbeddingBase
Rotary Positional Embedding (RoPE) as described in “RoFormer: Enhanced Transformer with Rotary Position Embedding” (arXiv 2104.09864).
This module precomputes sin/cos rotation frequencies for a given head‐dim, and then applies to query/key representations via interleaving real/imag parts (or equivalently pairs of dims).
forward(x: torch.Tensor, seq_positions: torch.LongTensor | None = None) → torch.Tensor¶
Apply rotary positional embedding to input tensor x.
- Parameters:
- x – shape (batch_size, seq_len, num_heads, head_dim)
- seq_positions – optional tensor of shape (batch_size, seq_len) with position indices. If None, assumes positions are 0..seq_len-1 for each batch.
- Returns: Tensor of same shape as x, with RoPE applied.
class olm.nn.embeddings.positional.SinusoidalPositionalEmbedding(*args: Any, **kwargs: Any)¶
Bases: PositionalEmbeddingBase
Sinusoidal Positional Embedding as described in “Attention Is All You Need” (Vaswani et al., 2017).
Uses fixed sine and cosine functions of different frequencies to encode positions. Unlike learned embeddings, these are deterministic and can extrapolate to longer sequences than seen during training.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
forward(x: torch.Tensor, seq_positions: torch.LongTensor | None = None) → torch.Tensor¶
Apply sinusoidal positional embedding to input tensor x.
- Parameters:
- x – shape (batch_size, seq_len, embed_dim) - token embeddings
- seq_positions – optional tensor of shape (batch_size, seq_len) with position indices. If None, assumes positions are 0..seq_len-1 for each batch.
- Returns: Tensor of same shape as x, with positional embeddings added.
Modules¶
absolute |
|
|---|---|
alibi |
|
base |
|
rope |
|
sinusoidal |