olm.nn.blocks.transformer_block¶

Classes¶

`TransformerBlock`(args, *kwargs)	A single Transformer block containing Multi-Head Attention and a FeedForward Network.

class olm.nn.blocks.transformer_block.Block(*args: Any, **kwargs: Any)¶

Bases: Module

Lightweight sequential container for composable submodules.

Similar to nn.Sequential, but exposes the underlying list for inspection or dynamic manipulation by higher-level builders.

Parameters: blocks – Ordered list of modules applied to the input in sequence.

blocks¶

ModuleList storing the ordered blocks.

forward(x: torch.Tensor) → torch.Tensor¶

Apply each block to the input in sequence.

Parameters: x – Input tensor.
Returns: Output tensor after all blocks have been applied.

class olm.nn.blocks.transformer_block.LayerNorm(*args: Any, **kwargs: Any)¶

Bases: NormBase

Layer Normalization layer.

Implements Layer Normalization as described in “Layer Normalization” (https://arxiv.org/abs/1607.06450). Normalizes the input across the features dimension.

Parameters:
d_model (int) – The dimension of the model to normalize.
eps (float , optional) – Small constant for numerical stability. Defaults to 1e-5.
device (torch.device , optional) – Target device.
dtype (torch.dtype , optional) – Target data type.

gamma¶

Learnable scale parameter.

Type: nn.Parameter

beta¶

Learnable shift parameter.

Type: nn.Parameter

forward(x: torch.Tensor) → torch.Tensor¶

Forward pass of LayerNorm.

Parameters: x (torch.Tensor) – Input tensor of shape (batch_size, sequence_length, d_model).
Returns: Normalized output tensor of the same shape.
Return type: torch.Tensor

class olm.nn.blocks.transformer_block.MultiHeadAttentionwithRoPE(*args: Any, **kwargs: Any)¶

Bases: AttentionwithRoPEBase

Implements Multi-Head Attention (MHA) with Rotary Positional Embedding (RoPE).

Splits the input into multiple heads, computes scaled dot-product attention for each, and concatenates the results. Uses RoPE for positional information.

Parameters:
embed_dims (int) – Total dimension of the model.
num_heads (int) – Number of parallel attention heads.
max_seq_len (int) – Maximum sequence length.
dropout (float , optional) – Dropout probability on attention weights. Defaults to 0.0.
causal (bool , optional) – If True, applies a causal mask. Defaults to False.

scale¶

Scaling factor (1 / sqrt(head_dim)).

Type: float

causal¶

Whether to apply a causal mask.

Type: bool

compute_attention(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, mask: torch.Tensor | None = None) → torch.Tensor¶

Computes the scaled dot-product attention suited for RoPE.

Parameters:
q (torch.Tensor) – Query tensor of shape [batch, heads, seq, head_dim].
k (torch.Tensor) – Key tensor of shape [batch, heads, seq, head_dim].
v (torch.Tensor) – Value tensor of shape [batch, heads, seq, head_dim].
mask (torch.Tensor , optional) – Attention mask. Defaults to None.
Returns: The result of the attention mechanism applied to v.
Return type: torch.Tensor

class olm.nn.blocks.transformer_block.Parallel(*args: Any, **kwargs: Any)¶

Bases: BaseCombinator

Apply multiple blocks to the same input and merge their outputs.

The merge function takes a list of tensors and a dimension argument.

Parameters:
blocks – Modules applied in parallel to the same input.
merge – Function that combines the list of outputs and a dimension.
dim – Dimension used by the merge function when applicable.

blocks¶

ModuleList storing the parallel blocks.

merge¶

Merge function used to combine outputs.

dim¶

Dimension passed to the merge function.

forward(x: torch.Tensor) → torch.Tensor¶

Apply all blocks in parallel and merge their outputs.

Parameters: x – Input tensor.
Returns: Merged output tensor.

class olm.nn.blocks.transformer_block.Repeat(*args: Any, **kwargs: Any)¶

Bases: BaseCombinator

Repeat a module a fixed number of times in sequence.

The module function should return a new module instance each call.

Parameters:
module_func – Callable returning a new module instance.
num_repeat – Number of times to repeat the module.

module¶

Factory callable used to create new modules.

num_repeat¶

Number of repeats.

stack¶

ModuleList containing the repeated modules.

forward(x: torch.Tensor) → torch.Tensor¶

Apply the repeated modules in sequence.

Parameters: x – Input tensor.
Returns: Output tensor after all repeats.

class olm.nn.blocks.transformer_block.Residual(*args: Any, **kwargs: Any)¶

Bases: BaseCombinator

Residual wrapper that adds the block output to its input.

Parameters: block – Module applied to the input before residual addition.

block¶

Module used for the residual transformation.

forward(x: torch.Tensor) → torch.Tensor¶

Apply the block and add the result to the input.

Parameters: x – Input tensor.
Returns: Output tensor with residual connection applied.

class olm.nn.blocks.transformer_block.SwiGLUFFN(*args: Any, **kwargs: Any)¶

Bases: FeedForwardBase

SwiGLU-based feed-forward network used in modern Transformers (e.g., LLaMA, PaLM).

This layer implements the gated linear unit with Swish (SiLU) activation, which has been shown to improve performance over standard GELU/ReLU FFNs.

Structure: : Input -> Linear(embed_dim -> 2 * hidden_dim) [Splits into Gate and Value] -> SwiGLU(Gate * SiLU(Value)) -> Linear(hidden_dim -> embed_dim) -> Dropout

Parameters:
embed_dim (int) – The dimension of the input and output.
hidden_dim (int , optional) – The intermediate inner dimension. If None, defaults to int(ff_multiplier * embed_dim).
dropout (float , optional) – Dropout probability. Defaults to 0.0.
bias (bool , optional) – Whether to use bias in linear layers. Defaults to True.
ff_multiplier (float , optional) – Multiplier for default hidden dimension. Defaults to 2.5 (commonly 8/3 for SwiGLU).

up_proj¶

Projects and splits input into gate and value parts.

Type: Linear

act¶

The activation function.

Type: SwiGLU

down_proj¶

Projects back to embedding dimension.

Type: Linear

dropout¶

Dropout layer.

Type: nn.Dropout

forward(x)¶

Forward pass of the feedforward network.

Parameters: x (torch.Tensor) – Input tensor of shape (batch, seq_len, embed_dim).
Returns: Output tensor of shape (batch, seq_len, embed_dim).
Return type: torch.Tensor

class olm.nn.blocks.transformer_block.TransformerBlock(*args: Any, **kwargs: Any)¶

Bases: Block

A single Transformer block containing Multi-Head Attention and a FeedForward Network.

This block implements the standard Transformer architecture with pre-normalization, Rotary Positional Embeddings (RoPE), and SwiGLU activation in the feedforward layer. It supports causal masking for autoregressive modeling.

Structure: : Input -> LayerNorm -> MHA(RoPE) -> Residual -> LayerNorm -> SwiGLU FFN -> Residual -> Output

Parameters:
embed_dim (int) – The dimension of the embedding space (d_model).
num_heads (int) – Number of attention heads. verify that embed_dim % num_heads == 0.
max_seq_len (int) – Maximum sequence length supported by the model (for RoPE).
dropout (float , optional) – Dropout probability for attention and FFN. Defaults to 0.0.
causal (bool , optional) – Whether to apply causal masking in attention. Defaults to False.
ff_multiplier (float , optional) – Multiplier for the hidden dimension of the FFN. Commonly 4.0 (standard) or 8/3 (SwiGLU). Defaults to 2.5.

layers¶

The sequential list of layers within the block.

Type: nn.ModuleList