OLM Docs

Building Blocks

This guide is a conceptual tour of the components in olm.nn — the layers you compose into models. For exact signatures and parameters, follow the links into the API reference; this page focuses on what each component is and when to choose it.

Every component here is a plain torch.nn.Module and follows a base-class-per-family pattern: an abstract base defines the interface, and concrete classes implement variants. To add your own variant, subclass the relevant base.

Embeddings

Embeddings map token ids to vectors and inject position information. How position is handled determines which attention layer you pair them with (see Attention below).

Token embeddingEmbedding(vocab_size, embedding_dim) is the lookup table from token ids to vectors. Every model starts with one.

Positional schemes:

ComponentWhere position entersNotable use
AbsolutePositionalEmbeddingLearned vector added to token embeddingsGPT-2
SinusoidalPositionalEmbeddingFixed sin/cos added to token embeddingsOriginal Transformer
RotaryPositionalEmbedding (RoPE)Rotation applied to queries/keys inside attentionLlama, Qwen, most modern LLMs
ALiBiPositionalBiasDistance-based bias added to attention scoresLength extrapolation

For long-context work, RoPE has scaled variants — PartialRotaryPositionalEmbedding (rotate only a fraction of dimensions, as in some Llama configs) and ScaledRotaryPositionalEmbedding (linear, NTK, dynamic-NTK, YaRN, and XPos scaling for extending context beyond the trained length).

Note

Absolute and sinusoidal embeddings are added to token embeddings at the input. RoPE and ALiBi act inside attention. This is why the choice of positional scheme is tied to the choice of attention layer.

Attention

OLM provides a family of attention mechanisms that share a common base (AttentionBase, or AttentionwithRoPEBase for the rotary variants).

LayerPosition handlingBackendUse when
MultiHeadAttentionnone (add a positional embedding at the input)explicit softmaxTeaching, full control, GPT-2-style models
MultiHeadAttentionwithRoPERoPE, internalexplicit softmaxModern decoder with readable internals
FlashAttentionnonescaled_dot_product_attentionSpeed/memory, with an input positional embedding
FlashAttentionwithRoPERoPE, internalscaled_dot_product_attentionFast modern decoder (recommended default)
GroupedQueryAttentionRoPE, internalscaled_dot_product_attentionLarge models / inference efficiency
MultiHeadAttentionwithALiBiALiBi, internalexplicit softmaxTraining short, testing long

A few practical notes:

  • The Flash* and GroupedQueryAttention layers call PyTorch's scaled_dot_product_attention, which automatically dispatches to fused Flash-Attention kernels when the hardware and inputs allow. The plain MultiHeadAttention* layers compute attention explicitly (matmul → softmax → matmul), which is slower but easy to read and modify.
  • Grouped-query attention uses fewer key/value heads than query heads. Setting num_kv_heads == num_heads recovers standard MHA; num_kv_heads == 1 gives multi-query attention. It also offers optional QK-normalization (a Qwen-2 feature).
  • For causal language modeling, pass causal=True (the layer applies a causal mask) — or let the SDPA backend handle it.
from olm.nn.attention import FlashAttentionwithRoPE, GroupedQueryAttention

# A fast, modern self-attention layer
attn = FlashAttentionwithRoPE(embed_dim=768, num_heads=12, max_seq_len=2048, causal=True)

# Grouped-query attention: 32 query heads, 8 KV heads
gqa = GroupedQueryAttention(embed_dim=4096, num_heads=32, num_kv_heads=8, max_seq_len=8192)

Normalization

LayerFormulaUsed by
LayerNormnormalize by mean and variance, then scale and shiftGPT-2, BERT
RMSNormnormalize by root-mean-square only, then scaleLlama, Qwen, Gemma

Both compute in float32 internally for numerical stability and cast back to the input dtype, which matters under mixed precision. RMSNorm drops the mean-centering and bias of LayerNorm, making it slightly cheaper; it is the common choice in recent LLMs.

Feed-forward networks

The position-wise feed-forward network (FFN) is applied to every token independently.

LayerStructureUsed by
ClassicFFNLinear → activation → Linear (default hidden = 4×)GPT-2 (GELU)
SwiGLUFFNgated: (SiLU(W₁x) ⊙ W₂x) → LinearLlama, PaLM
GeGLUFFNgated, GELU variant of the aboveGemma

Gated FFNs (SwiGLU, GeGLU) split the up-projection into a gate and a value, multiply them elementwise, and tend to outperform a plain MLP at equal parameter count — which is why modern models use them. OLM's gated FFNs default their hidden dimension via an ff_multiplier so the gated and ungated variants land at comparable parameter counts.

Mixture-of-Experts

For sparse scaling, each FFN has a Mixture-of-Experts counterpart — ClassicMoEFFN, SwiGLUMoEFFN, and GeGLUMoEFFN — built on MoEFeedForwardBase. A MoERouter performs top-k softmax gating over num_experts experts, optionally with a number of always-on num_shared_experts. Only the selected experts run per token, so capacity grows without a proportional increase in compute.

Note

The current MoE layers focus on readable top-k routing and expert composition. Load-balancing auxiliary losses and expert-parallel dispatch are roadmap items, so add your own auxiliary term in a custom training loop if your experiment depends on balanced expert usage.

from olm.nn.feedforward import SwiGLUFFN

ffn = SwiGLUFFN(embed_dim=768)         # dense gated FFN

from olm.nn.feedforward.swiglu_moe import SwiGLUMoEFFN
moe = SwiGLUMoEFFN(embed_dim=768, num_experts=8, top_k=2)   # sparse MoE FFN

Activations

All activations subclass ActivationBase and are registered in the ACTIVATIONS registry, so they can be selected by name in config-driven workflows.

  • Pointwise: ReLU, LeakyReLU, ELU, SELU, PReLU, GELU, SiLU, Mish, Softplus, Tanh, Sigmoid, Softmax, Identity.
  • Gated linear units (used inside gated FFNs): SwiGLU, GeGLU, ReGLU, LiGLU, GLU. These halve their input along the last dimension into two parts and gate one with the other.
from olm.core.registry import ACTIVATIONS

act = ACTIVATIONS.get("gelu")()   # look up by name

Putting it together

These components are designed to be combined with the Block system. A single modern transformer layer pairs a normalization, an attention, and a gated FFN inside residual connections:

from olm.nn.structure import Block
from olm.nn.structure.combinators import Residual
from olm.nn.norms import RMSNorm
from olm.nn.attention import FlashAttentionwithRoPE
from olm.nn.feedforward import SwiGLUFFN

def layer(d_model, n_heads, max_seq_len):
    return Block([
        Residual(Block([RMSNorm(d_model),
                        FlashAttentionwithRoPE(d_model, n_heads, max_seq_len, causal=True)])),
        Residual(Block([RMSNorm(d_model), SwiGLUFFN(d_model)])),
    ])

Next steps