OLM API Reference

`olm.nn.blocks.transformer_block`

Source: src/olm/nn/blocks/transformer_block.py:1

Classes

TransformerBlock(embed_dim: int, num_heads: int, max_seq_len: int, dropout: float = 0.0, causal: bool = False, ff_multiplier: float = 2.5)

Bases: olm.nn.structure.block.Block

Source: src/olm/nn/blocks/transformer_block.py:9

Pre-norm transformer block for causal language modeling.

The block contains two residual branches: LayerNorm + FlashAttention with RoPE, followed by LayerNorm + SwiGLU feed-forward network. It is the default repeated block used by LM.

Structure

x -> residual(LayerNorm, attention) -> residual(LayerNorm, FFN).

Forward

Accepts hidden states with shape [batch, seq_len, embed_dim] and returns hidden states with the same shape.

Parameters

  • embed_dim (int): The dimension of the embedding space (d_model).
  • num_heads (int): Number of attention heads. verify that embed_dim % num_heads == 0.
  • max_seq_len (int): Maximum sequence length supported by the model (for RoPE).
  • dropout (float, optional): Dropout probability for attention and FFN. Defaults to 0.0.
  • causal (bool, optional): Whether to apply causal masking in attention. Defaults to False.
  • ff_multiplier (float, optional): Multiplier for the hidden dimension of the FFN. Commonly 4.0 (standard) or 8/3 (SwiGLU). Defaults to 2.5.

Attributes

  • blocks (nn.ModuleList): Sequential OLM block structure.

Methods

forward(self, x: torch.Tensor) -> torch.Tensor (inherited from Block)

Source: src/olm/nn/structure/block.py:26

Apply each block to the input in sequence.

Parameters

  • x: Input tensor.

Returns

Output tensor after all blocks have been applied.