olm.nn.blocks¶
class olm.nn.blocks.LM(*args: Any, **kwargs: Any)¶
Bases: Block
A simple Language Model (LM) architecture.
This model consists of an embedding layer, a stack of Transformer blocks, and a final output projection to the vocabulary size. It is designed for causal language modeling (next-token prediction).
Structure: : Input IDs -> Embedding -> [TransformerBlock] x N -> OutputHead -> Logits
- Parameters:
- vocab_size (int) – Size of the vocabulary.
- embed_dim (int) – Dimension of the embeddings and hidden states.
- num_heads (int) – Number of attention heads in Transformer blocks.
- num_layers (int) – Number of Transformer blocks.
- max_seq_len (int) – Maximum sequence length for the model.
- dropout (float , optional) – Dropout probability. Defaults to 0.0.
- causal (bool , optional) – Whether to use causal masking. Defaults to True.
- ff_multiplier (float , optional) – Multiplier for FFN hidden dimension. Defaults to 2.5.
layers¶
The sequence of layers in the model.
- Type: nn.ModuleList
class olm.nn.blocks.OutputHead(*args: Any, **kwargs: Any)¶
Bases: Block
Final output projection layer for the Language Model.
Consists of a LayerNorm followed by a Linear projection to the vocabulary size. Typical structure: LayerNorm -> Linear(vocab_size).
- Parameters:
- embed_dim (int) – The dimension of the embedding space.
- vocab_size (int) – The size of the vocabulary.
- bias (bool , optional) – Whether to include bias in the linear layer. Defaults to False.
layers¶
The normalization and linear layers.
- Type: nn.ModuleList
class olm.nn.blocks.QKVProjection(*args: Any, **kwargs: Any)¶
Bases: Module
Computes Query, Key, and Value projections for attention mechanisms.
Applies three separate linear transformations to the input to generate Q, K, and V tensors. Supports various weight initialization schemes.
W_q¶
Linear layer for Query projection.
- Type: Linear
W_k¶
Linear layer for Key projection.
- Type: Linear
W_v¶
Linear layer for Value projection.
- Type: Linear
forward(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor]¶
Performs the Q, K, V projections.
- Parameters: x (torch.Tensor) – Input tensor of shape (batch, seq_len, dim_in).
- Returns: A tuple containing (Q, K, V) tensors.
- Return type: tuple[torch.Tensor, torch.Tensor, torch.Tensor]
class olm.nn.blocks.TransformerBlock(*args: Any, **kwargs: Any)¶
Bases: Block
A single Transformer block containing Multi-Head Attention and a FeedForward Network.
This block implements the standard Transformer architecture with pre-normalization, Rotary Positional Embeddings (RoPE), and SwiGLU activation in the feedforward layer. It supports causal masking for autoregressive modeling.
Structure: : Input -> LayerNorm -> MHA(RoPE) -> Residual -> LayerNorm -> SwiGLU FFN -> Residual -> Output
- Parameters:
- embed_dim (int) – The dimension of the embedding space (d_model).
- num_heads (int) – Number of attention heads. verify that embed_dim % num_heads == 0.
- max_seq_len (int) – Maximum sequence length supported by the model (for RoPE).
- dropout (float , optional) – Dropout probability for attention and FFN. Defaults to 0.0.
- causal (bool , optional) – Whether to apply causal masking in attention. Defaults to False.
- ff_multiplier (float , optional) – Multiplier for the hidden dimension of the FFN. Commonly 4.0 (standard) or 8/3 (SwiGLU). Defaults to 2.5.
layers¶
The sequential list of layers within the block.
- Type: nn.ModuleList
Modules¶
LM(*args, **kwargs) |
A simple Language Model (LM) architecture. |
|---|---|
linear_projections |
|
output_head |
|
transformer_block |