olm.nn.blocks¶

class olm.nn.blocks.LM(*args: Any, **kwargs: Any)¶

Bases: Block

A simple Language Model (LM) architecture.

This model consists of an embedding layer, a stack of Transformer blocks, and a final output projection to the vocabulary size. It is designed for causal language modeling (next-token prediction).

Structure: : Input IDs -> Embedding -> [TransformerBlock] x N -> OutputHead -> Logits

Parameters:
vocab_size (int) – Size of the vocabulary.
embed_dim (int) – Dimension of the embeddings and hidden states.
num_heads (int) – Number of attention heads in Transformer blocks.
num_layers (int) – Number of Transformer blocks.
max_seq_len (int) – Maximum sequence length for the model.
dropout (float , optional) – Dropout probability. Defaults to 0.0.
causal (bool , optional) – Whether to use causal masking. Defaults to True.
ff_multiplier (float , optional) – Multiplier for FFN hidden dimension. Defaults to 2.5.

layers¶

The sequence of layers in the model.

Type: nn.ModuleList

class olm.nn.blocks.OutputHead(*args: Any, **kwargs: Any)¶

Bases: Block

Final output projection layer for the Language Model.

Consists of a LayerNorm followed by a Linear projection to the vocabulary size. Typical structure: LayerNorm -> Linear(vocab_size).

Parameters:
embed_dim (int) – The dimension of the embedding space.
vocab_size (int) – The size of the vocabulary.
bias (bool , optional) – Whether to include bias in the linear layer. Defaults to False.

layers¶

The normalization and linear layers.

Type: nn.ModuleList

class olm.nn.blocks.QKVProjection(*args: Any, **kwargs: Any)¶

Bases: Module

Computes Query, Key, and Value projections for attention mechanisms.

Applies three separate linear transformations to the input to generate Q, K, and V tensors. Supports various weight initialization schemes.

W_q¶

Linear layer for Query projection.

Type: Linear

W_k¶

Linear layer for Key projection.

Type: Linear

W_v¶

Linear layer for Value projection.

Type: Linear

forward(x: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor]¶

Performs the Q, K, V projections.

Parameters: x (torch.Tensor) – Input tensor of shape (batch, seq_len, dim_in).
Returns: A tuple containing (Q, K, V) tensors.
Return type: tuple[torch.Tensor, torch.Tensor, torch.Tensor]

class olm.nn.blocks.TransformerBlock(*args: Any, **kwargs: Any)¶

Bases: Block

A single Transformer block containing Multi-Head Attention and a FeedForward Network.

This block implements the standard Transformer architecture with pre-normalization, Rotary Positional Embeddings (RoPE), and SwiGLU activation in the feedforward layer. It supports causal masking for autoregressive modeling.

Structure: : Input -> LayerNorm -> MHA(RoPE) -> Residual -> LayerNorm -> SwiGLU FFN -> Residual -> Output

Parameters:
embed_dim (int) – The dimension of the embedding space (d_model).
num_heads (int) – Number of attention heads. verify that embed_dim % num_heads == 0.
max_seq_len (int) – Maximum sequence length supported by the model (for RoPE).
dropout (float , optional) – Dropout probability for attention and FFN. Defaults to 0.0.
causal (bool , optional) – Whether to apply causal masking in attention. Defaults to False.
ff_multiplier (float , optional) – Multiplier for the hidden dimension of the FFN. Commonly 4.0 (standard) or 8/3 (SwiGLU). Defaults to 2.5.

layers¶

The sequential list of layers within the block.

Type: nn.ModuleList

Modules¶

`LM`(args, *kwargs)	A simple Language Model (LM) architecture.
`linear_projections`
`output_head`
`transformer_block`