Documentation Index
Fetch the complete documentation index at: https://mintlify.com/meta-llama/llama/llms.txt
Use this file to discover all available pages before exploring further.
Overview
ModelArgs is a dataclass that defines all configuration parameters for the Llama 2 Transformer model. It controls model dimensions, layer counts, attention heads, and training constraints.
Definition
Parameters
Model dimension size. This is the primary dimensionality of the model’s hidden states and embeddings.
Number of Transformer layers in the model.
Number of attention heads for queries in the multi-head attention mechanism.
Number of key and value heads for Grouped-Query Attention (GQA). If
None, defaults to n_heads (standard multi-head attention). When set to a value less than n_heads, enables GQA for improved efficiency.Vocabulary size. Typically set to
-1 initially and defined later by the tokenizer during model initialization.Ensures the SwiGLU feedforward hidden layer size is a multiple of this value (a large power of 2). This improves computational efficiency on modern hardware.
Optional multiplier for the feedforward network hidden dimension. When specified, scales the computed hidden dimension by this factor.
Epsilon value for RMSNorm layer normalization, added for numerical stability.
Maximum batch size for inference. Used to pre-allocate KV cache tensors in the Attention module.
Maximum sequence length for inference. Used to pre-allocate KV cache tensors and precompute rotary position embeddings.
Usage in Transformer
TheModelArgs configuration is used to initialize the Transformer model and all its components:
- Attention: Uses
dim,n_heads,n_kv_heads,max_batch_size, andmax_seq_len(see model.py:178-251) - FeedForward: Uses
dim,multiple_of, andffn_dim_multiplier(see model.py:376-381) - TransformerBlock: Uses
dim,n_heads, andnorm_eps(see model.py:352-384) - Transformer: Uses all parameters for layer construction and initialization (see model.py:414-454)
Notes
- The Llama 2 7B model uses 32 layers, 32 attention heads, and a dimension of 4096
- For Grouped-Query Attention, set
n_kv_headsto a value less thann_heads(e.g., 8 KV heads with 32 query heads) - The
max_seq_lenis multiplied by 2 when precomputing frequencies to support the 4096 token limit dynamically