Skip to content

olm.nn.feedforward.swiglu_ffn

Classes

SwiGLUFFN(*args, **kwargs) SwiGLU-based feed-forward network used in modern Transformers (e.g., LLaMA, PaLM).

class olm.nn.feedforward.swiglu_ffn.FeedForwardBase(*args: Any, **kwargs: Any)

Bases: Module, ABC

Abstract base class for feedforward networks in a transformer block.

Defines the interface for FFNs/MLPs. Subclasses must implement the forward method.

embed_dim

The input and output dimension.

  • Type: int

abstractmethod forward(x: torch.Tensor) → torch.Tensor

Forward pass of the feedforward network.

  • Parameters: x (torch.Tensor) – Input tensor of shape (batch, seq_len, embed_dim).
  • Returns: Output tensor of shape (batch, seq_len, embed_dim).
  • Return type: torch.Tensor

class olm.nn.feedforward.swiglu_ffn.Linear(*args: Any, **kwargs: Any)

Bases: Linear

forward(x)

class olm.nn.feedforward.swiglu_ffn.SwiGLU(*args: Any, **kwargs: Any)

Bases: ActivationBase

SwiGLU activation function.

Implements the SwiGLU activation as described in “GLU Variants Improve Transformer”. It applies the SiLU activation to one half of the input (the gate) and multiplies it by the other half (the value).

Equation: : SwiGLU(x, W, V) = Swish_1(xW) * (xV) Here, we assume the input x is already projected/concatenated such that we chunk it. So: SwiGLU(x) = (x_1 * SiLU(x_2)) where x = [x_1, x_2]

  • Parameters:
  • device (torch.device , optional) – Target device.
  • dtype (torch.dtype , optional) – Target data type.

forward(x: torch.Tensor) → torch.Tensor

Forward pass of SwiGLU.

  • Parameters: x (torch.Tensor) – Input tensor. Expected to have an even last dimension size.
  • Returns: Output tensor with half the last dimension of the input.
  • Return type: torch.Tensor

class olm.nn.feedforward.swiglu_ffn.SwiGLUFFN(*args: Any, **kwargs: Any)

Bases: FeedForwardBase

SwiGLU-based feed-forward network used in modern Transformers (e.g., LLaMA, PaLM).

This layer implements the gated linear unit with Swish (SiLU) activation, which has been shown to improve performance over standard GELU/ReLU FFNs.

Structure: : Input -> Linear(embed_dim -> 2 * hidden_dim) [Splits into Gate and Value] -> SwiGLU(Gate * SiLU(Value)) -> Linear(hidden_dim -> embed_dim) -> Dropout

  • Parameters:
  • embed_dim (int) – The dimension of the input and output.
  • hidden_dim (int , optional) – The intermediate inner dimension. If None, defaults to int(ff_multiplier * embed_dim).
  • dropout (float , optional) – Dropout probability. Defaults to 0.0.
  • bias (bool , optional) – Whether to use bias in linear layers. Defaults to True.
  • ff_multiplier (float , optional) – Multiplier for default hidden dimension. Defaults to 2.5 (commonly 8/3 for SwiGLU).

up_proj

Projects and splits input into gate and value parts.

act

The activation function.

down_proj

Projects back to embedding dimension.

dropout

Dropout layer.

  • Type: nn.Dropout

forward(x)

Forward pass of the feedforward network.

  • Parameters: x (torch.Tensor) – Input tensor of shape (batch, seq_len, embed_dim).
  • Returns: Output tensor of shape (batch, seq_len, embed_dim).
  • Return type: torch.Tensor