Source: src/olm/nn/embeddings/positional/__init__.py:1
Classes
ALiBiPositionalBias(num_heads: int, max_seq_len: int = 2048)
Bases: olm.nn.embeddings.positional.base.PositionalEmbeddingBase
Source: src/olm/nn/embeddings/positional/alibi.py:9
Attention with Linear Biases (ALiBi) as described in "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (arXiv 2108.12409).
Instead of adding positional information to embeddings, ALiBi adds a bias to attention scores that is proportional to the distance between query and key positions. This allows the model to extrapolate to longer sequences than seen during training.
The bias is computed as: bias[i,j] = -m * |i - j|
where m is a head-specific slope.
Methods
forward(self, seq_len_q: int, seq_len_k: int, device: torch.device | None = None) -> torch.Tensor
Source: src/olm/nn/embeddings/positional/alibi.py:85
Get ALiBi bias for the given query and key sequence lengths.
Parameters
seq_len_q: length of query sequenceseq_len_k: length of key sequence (usually same as seq_len_q)device: device to place the bias tensor on
Returns
Bias tensor of shape (1, num_heads, seq_len_q, seq_len_k) This should be added to attention scores before softmax.
AbsolutePositionalEmbedding(max_seq_len: int, embed_dim: int, dropout: float = 0.0)
Bases: olm.nn.embeddings.positional.base.PositionalEmbeddingBase
Source: src/olm/nn/embeddings/positional/absolute.py:8
Absolute (Learned) Positional Embedding.
This is the standard positional embedding used in the original Transformer paper and models like GPT-2. It learns a separate embedding vector for each position in the sequence, up to a maximum sequence length.
These embeddings are typically added to token embeddings before passing through the transformer blocks.
Methods
forward(self, x: torch.Tensor, seq_positions: torch.LongTensor | None = None) -> torch.Tensor
Source: src/olm/nn/embeddings/positional/absolute.py:34
Apply absolute positional embedding to input tensor x.
Parameters
x: shape (batch_size, seq_len, embed_dim) - token embeddingsseq_positions: optional tensor of shape (batch_size, seq_len) with position indices. If None, assumes positions are 0..seq_len-1 for each batch.
Returns
Tensor of same shape as x, with positional embeddings added.
PartialRotaryPositionalEmbedding(head_dim: int, rotary_percentage: float = 0.5, base: int = 10000, max_seq_len: int = 2048)
Bases: olm.nn.embeddings.positional.base.PositionalEmbeddingBase
Source: src/olm/nn/embeddings/positional/rope.py:112
Partial Rotary Positional Embedding (LLaMA-style RoPE).
Only applies rotary embeddings to a fraction of the head dimensions, leaving the remaining dimensions unchanged. This is the approach used in models like LLaMA, where typically 25-50% of dimensions are rotated.
For example, with head_dim=128 and rotary_percentage=0.5, only the first 64 dimensions are rotated, while the last 64 dimensions pass through unchanged.
Methods
forward(self, x: torch.Tensor, seq_positions: torch.LongTensor | None = None) -> torch.Tensor
Source: src/olm/nn/embeddings/positional/rope.py:183
Apply partial rotary positional embedding to input tensor x.
Parameters
x: shape (batch_size, seq_len, num_heads, head_dim)seq_positions: optional tensor of shape (batch_size, seq_len) with position indices. If None, assumes positions are 0..seq_len-1 for each batch.
Returns
Tensor of same shape as x, with partial RoPE applied.
PartialScaledRotaryPositionalEmbedding(head_dim: int, rotary_percentage: float = 0.5, max_seq_len: int = 2048, base: int = 10000, scaling_type: Literal['linear', 'ntk', 'dynamic_ntk', 'yarn', 'xpos'] = 'linear', scaling_factor: float = 1.0, original_max_seq_len: int | None = None, yarn_alpha: float = 1.0, yarn_beta: float = 32.0, xpos_scale_base: int | None = None)
Bases: olm.nn.embeddings.positional.base.PositionalEmbeddingBase
Source: src/olm/nn/embeddings/positional/rope.py:478
Partial Rotary Positional Embedding with scaling support.
Combines partial RoPE (only rotating a fraction of dimensions) with various scaling strategies for extended context lengths.
Methods
forward(self, x: torch.Tensor, seq_positions: torch.LongTensor | None = None) -> torch.Tensor
Source: src/olm/nn/embeddings/positional/rope.py:624
Apply partial scaled rotary positional embedding to input tensor x.
Parameters
x: shape (batch_size, seq_len, num_heads, head_dim)seq_positions: optional tensor of shape (batch_size, seq_len) with position indices.
Returns
Tensor of same shape as x, with partial scaled RoPE applied.
PositionalEmbeddingBase(*args: Any, **kwargs: Any) -> None
Bases: Module, ABC
Source: src/olm/nn/embeddings/positional/base.py:8
Abstract base class for all positional embedding implementations.
Positional embeddings add information about token positions in a sequence to help the model understand order and relative positions. Different positional embedding strategies have different properties:
- Learned (Absolute): Simple, effective, but limited to max_seq_len
- Sinusoidal: Deterministic, can extrapolate to longer sequences
- RoPE: Applied to Q/K directly, enables relative position modeling
- ALiBi: Adds bias to attention scores, excellent extrapolation
All positional embedding implementations should inherit from this base class and implement the forward method.
Methods
extra_repr(self) -> str
Source: src/olm/nn/embeddings/positional/base.py:40
String representation of the module for debugging.
Override this in subclasses to provide useful information.
forward(self, *args, **kwargs) -> torch.Tensor
Source: src/olm/nn/embeddings/positional/base.py:25
Apply positional information to input tensor(s).
The signature and behavior of this method varies by implementation:
- Some add to embeddings (Absolute, Sinusoidal)
- Some rotate representations (RoPE)
- Some return bias to add to attention scores (ALiBi)
Returns
Transformed tensor(s) with positional information applied
RotaryPositionalEmbedding(head_dim: int, max_seq_len: int, base: int = 10000)
Bases: olm.nn.embeddings.positional.base.PositionalEmbeddingBase
Source: src/olm/nn/embeddings/positional/rope.py:8
Rotary Positional Embedding (RoPE) as described in “RoFormer: Enhanced Transformer with Rotary Position Embedding” (arXiv 2104.09864).
This module precomputes sin/cos rotation frequencies for a given head‐dim, and then applies to query/key representations via interleaving real/imag parts (or equivalently pairs of dims).
Methods
forward(self, x: torch.Tensor, seq_positions: torch.LongTensor | None = None) -> torch.Tensor
Source: src/olm/nn/embeddings/positional/rope.py:54
Apply rotary positional embedding to input tensor x.
Parameters
x: shape (batch_size, seq_len, num_heads, head_dim)seq_positions: optional tensor of shape (batch_size, seq_len) with position indices. If None, assumes positions are 0..seq_len-1 for each batch.
Returns
Tensor of same shape as x, with RoPE applied.
ScaledRotaryPositionalEmbedding(head_dim: int, max_seq_len: int = 2048, base: int = 10000, scaling_type: Literal['linear', 'ntk', 'dynamic_ntk', 'yarn', 'xpos'] = 'linear', scaling_factor: float = 1.0, original_max_seq_len: int | None = None, yarn_alpha: float = 1.0, yarn_beta: float = 32.0, xpos_scale_base: int | None = None)
Bases: olm.nn.embeddings.positional.base.PositionalEmbeddingBase
Source: src/olm/nn/embeddings/positional/rope.py:250
Scaled Rotary Positional Embedding with multiple scaling strategies.
Supports the following scaling methods for extending context length:
- 'linear': Linear position interpolation (Position Interpolation, arXiv:2306.15595)
- 'ntk': NTK-aware scaling (dynamically adjusts base frequency)
- 'dynamic_ntk': Dynamic NTK (adjusts base based on current sequence length)
- 'yarn': YaRN (Yet another RoPE extensioN method, arXiv:2309.00071)
- 'xpos': XPos (exponential decay for better extrapolation, arXiv:2212.10554)
Methods
forward(self, x: torch.Tensor, seq_positions: torch.LongTensor | None = None) -> torch.Tensor
Source: src/olm/nn/embeddings/positional/rope.py:398
Apply scaled rotary positional embedding to input tensor x.
Parameters
x: shape (batch_size, seq_len, num_heads, head_dim)seq_positions: optional tensor of shape (batch_size, seq_len) with position indices. If None, assumes positions are 0..seq_len-1 for each batch.
Returns
Tensor of same shape as x, with scaled RoPE applied.
SinusoidalPositionalEmbedding(embed_dim: int, max_seq_len: int = 5000, base: int = 10000, dropout: float = 0.0)
Bases: olm.nn.embeddings.positional.base.PositionalEmbeddingBase
Source: src/olm/nn/embeddings/positional/sinusoidal.py:9
Sinusoidal Positional Embedding as described in "Attention Is All You Need" (Vaswani et al., 2017).
Uses fixed sine and cosine functions of different frequencies to encode positions. Unlike learned embeddings, these are deterministic and can extrapolate to longer sequences than seen during training.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Methods
forward(self, x: torch.Tensor, seq_positions: torch.LongTensor | None = None) -> torch.Tensor
Source: src/olm/nn/embeddings/positional/sinusoidal.py:80
Apply sinusoidal positional embedding to input tensor x.
Parameters
x: shape (batch_size, seq_len, embed_dim) - token embeddingsseq_positions: optional tensor of shape (batch_size, seq_len) with position indices. If None, assumes positions are 0..seq_len-1 for each batch.
Returns
Tensor of same shape as x, with positional embeddings added.