Skip to content

olm.nn.feedforward

class olm.nn.feedforward.ClassicFFN(*args: Any, **kwargs: Any)

Bases: FeedForwardBase

Standard Multi-Layer Perceptron (MLP) used in Transformer blocks.

Implements a position-wise feed-forward network consisting of two linear transformations with a non-linear activation function in between.

Structure: : Input -> Linear(embed_dim -> hidden_dim) -> Activation -> Dropout -> Linear(hidden_dim -> embed_dim) -> Dropout

hidden_dim

Dimension of the inner hidden layer.

  • Type: int

up_proj

Projection from embedding dim to hidden dim.

act

Activation function.

  • Type: nn.Module

down_proj

Projection from hidden dim to embedding dim.

dropout

Dropout layer.

  • Type: nn.Dropout

forward(x)

Forward pass of the feedforward network.

  • Parameters: x (torch.Tensor) – Input tensor of shape (batch, seq_len, embed_dim).
  • Returns: Output tensor of shape (batch, seq_len, embed_dim).
  • Return type: torch.Tensor

class olm.nn.feedforward.FeedForwardBase(*args: Any, **kwargs: Any)

Bases: Module, ABC

Abstract base class for feedforward networks in a transformer block.

Defines the interface for FFNs/MLPs. Subclasses must implement the forward method.

embed_dim

The input and output dimension.

  • Type: int

abstractmethod forward(x: torch.Tensor) → torch.Tensor

Forward pass of the feedforward network.

  • Parameters: x (torch.Tensor) – Input tensor of shape (batch, seq_len, embed_dim).
  • Returns: Output tensor of shape (batch, seq_len, embed_dim).
  • Return type: torch.Tensor

class olm.nn.feedforward.GeGLUFFN(*args: Any, **kwargs: Any)

Bases: FeedForwardBase

Feed-Forward Network using GeGLU activation.

Implements: x = DownProj(GeGLU(UpProj(x))). UpProj expands to 2 * hidden_dim to support splitting for the gate.

  • Parameters:
  • embed_dim (int) – Input dimension.
  • hidden_dim (int , optional) – Hidden dimension. Defaults to 4 * embed_dim if None.
  • dropout (float , optional) – Dropout probability. Defaults to 0.0.
  • bias (bool , optional) – Whether to usage bias in linear layers. Defaults to True.
  • ff_multiplier (float , optional) – Expansion factor if hidden_dim is None. Defaults to 4.0.

forward(x)

Forward pass of the feedforward network.

  • Parameters: x (torch.Tensor) – Input tensor of shape (batch, seq_len, embed_dim).
  • Returns: Output tensor of shape (batch, seq_len, embed_dim).
  • Return type: torch.Tensor

class olm.nn.feedforward.SwiGLUFFN(*args: Any, **kwargs: Any)

Bases: FeedForwardBase

SwiGLU-based feed-forward network used in modern Transformers (e.g., LLaMA, PaLM).

This layer implements the gated linear unit with Swish (SiLU) activation, which has been shown to improve performance over standard GELU/ReLU FFNs.

Structure: : Input -> Linear(embed_dim -> 2 * hidden_dim) [Splits into Gate and Value] -> SwiGLU(Gate * SiLU(Value)) -> Linear(hidden_dim -> embed_dim) -> Dropout

  • Parameters:
  • embed_dim (int) – The dimension of the input and output.
  • hidden_dim (int , optional) – The intermediate inner dimension. If None, defaults to int(ff_multiplier * embed_dim).
  • dropout (float , optional) – Dropout probability. Defaults to 0.0.
  • bias (bool , optional) – Whether to use bias in linear layers. Defaults to True.
  • ff_multiplier (float , optional) – Multiplier for default hidden dimension. Defaults to 2.5 (commonly 8/3 for SwiGLU).

up_proj

Projects and splits input into gate and value parts.

act

The activation function.

down_proj

Projects back to embedding dimension.

dropout

Dropout layer.

  • Type: nn.Dropout

forward(x)

Forward pass of the feedforward network.

  • Parameters: x (torch.Tensor) – Input tensor of shape (batch, seq_len, embed_dim).
  • Returns: Output tensor of shape (batch, seq_len, embed_dim).
  • Return type: torch.Tensor

Modules

base
classic_ffn
geglu_ffn
swiglu_ffn