ModelArgs

Overview

ModelArgs is a dataclass that defines all configuration parameters for the Llama 2 Transformer model. It controls model dimensions, layer counts, attention heads, and training constraints.

Definition

@dataclass
class ModelArgs:
    dim: int = 4096
    n_layers: int = 32
    n_heads: int = 32
    n_kv_heads: Optional[int] = None
    vocab_size: int = -1
    multiple_of: int = 256
    ffn_dim_multiplier: Optional[float] = None
    norm_eps: float = 1e-5
    max_batch_size: int = 32
    max_seq_len: int = 2048

Parameters

dim

int

default:"4096"

Model dimension size. This is the primary dimensionality of the model’s hidden states and embeddings.

n_layers

int

default:"32"

Number of Transformer layers in the model.

n_heads

int

default:"32"

Number of attention heads for queries in the multi-head attention mechanism.

n_kv_heads

Optional[int]

default:"None"

Number of key and value heads for Grouped-Query Attention (GQA). If None, defaults to n_heads (standard multi-head attention). When set to a value less than n_heads, enables GQA for improved efficiency.

vocab_size

int

default:"-1"

Vocabulary size. Typically set to -1 initially and defined later by the tokenizer during model initialization.

multiple_of

int

default:"256"

Ensures the SwiGLU feedforward hidden layer size is a multiple of this value (a large power of 2). This improves computational efficiency on modern hardware.

ffn_dim_multiplier

Optional[float]

default:"None"

Optional multiplier for the feedforward network hidden dimension. When specified, scales the computed hidden dimension by this factor.

norm_eps

float

default:"1e-5"

Epsilon value for RMSNorm layer normalization, added for numerical stability.

max_batch_size

int

default:"32"

Maximum batch size for inference. Used to pre-allocate KV cache tensors in the Attention module.

max_seq_len

int

default:"2048"

Maximum sequence length for inference. Used to pre-allocate KV cache tensors and precompute rotary position embeddings.

Usage in Transformer

The ModelArgs configuration is used to initialize the Transformer model and all its components:

# Initialize model configuration
model_args = ModelArgs(
    dim=4096,
    n_layers=32,
    n_heads=32,
    n_kv_heads=8,  # GQA with 8 KV heads
    vocab_size=32000,
    max_batch_size=4,
    max_seq_len=2048
)

# Create Transformer model
transformer = Transformer(model_args)

The configuration is passed to:

Attention: Uses dim, n_heads, n_kv_heads, max_batch_size, and max_seq_len (see model.py:178-251)
FeedForward: Uses dim, multiple_of, and ffn_dim_multiplier (see model.py:376-381)
TransformerBlock: Uses dim, n_heads, and norm_eps (see model.py:352-384)
Transformer: Uses all parameters for layer construction and initialization (see model.py:414-454)

Notes

The Llama 2 7B model uses 32 layers, 32 attention heads, and a dimension of 4096
For Grouped-Query Attention, set n_kv_heads to a value less than n_heads (e.g., 8 KV heads with 32 query heads)
The max_seq_len is multiplied by 2 when precomputing frequencies to support the 4096 token limit dynamically

Core Classes

Model Components

Types

Overview

Definition

Parameters

Usage in Transformer

Notes

Core Classes

Model Components

Types

Documentation Index

​Overview

​Definition

​Parameters

​Usage in Transformer

​Notes

Overview

Definition

Parameters

Usage in Transformer

Notes