Documentation Index
Fetch the complete documentation index at: https://mintlify.com/meta-llama/llama/llms.txt
Use this file to discover all available pages before exploring further.
Overview
TheTransformer class implements the core neural network architecture for Llama 2 models. It consists of token embeddings, multiple transformer blocks with attention and feedforward layers, normalization, and an output projection layer.
Class Definition
Methods
__init__
Model configuration parameters containing:
dim(int): Model dimension (default: 4096)n_layers(int): Number of transformer layers (default: 32)n_heads(int): Number of attention heads (default: 32)n_kv_heads(Optional[int]): Number of key-value heads for grouped-query attentionvocab_size(int): Size of the vocabularymultiple_of(int): Make SwiGLU hidden layer size multiple of this value (default: 256)ffn_dim_multiplier(Optional[float]): Multiplier for feedforward dimensionnorm_eps(float): Epsilon for RMSNorm (default: 1e-5)max_batch_size(int): Maximum batch size (default: 32)max_seq_len(int): Maximum sequence length (default: 2048)
params(ModelArgs): Model configuration parameters.vocab_size(int): Vocabulary size.n_layers(int): Number of layers in the model.tok_embeddings(ParallelEmbedding): Token embeddings.layers(torch.nn.ModuleList): List of Transformer blocks.norm(RMSNorm): Layer normalization for the model output.output(ColumnParallelLinear): Linear layer for final output.freqs_cis(torch.Tensor): Precomputed cosine and sine frequencies for rotary positional embeddings.
forward
Input token indices. Shape:
(batch_size, sequence_length)Starting position for attention caching. Used for efficient generation by caching key-value pairs from previous tokens.
Output logits after applying the Transformer model. Shape:
(batch_size, sequence_length, vocab_size)@torch.inference_mode() for optimized inference. It processes input tokens through:
- Token embeddings
- Multiple transformer blocks (attention + feedforward)
- Final layer normalization
- Output projection to vocabulary size
Architecture Details
The Transformer model implements the Llama 2 architecture with the following key components:Token Embeddings
Converts input token IDs to dense vector representations using parallel embeddings for efficient distributed training.Transformer Blocks
Each block contains:- Multi-head Attention: Implements grouped-query attention with rotary positional embeddings (RoPE)
- Feedforward Network: Uses SwiGLU activation function
- RMSNorm: Root Mean Square Layer Normalization applied before attention and feedforward layers
Attention Caching
The model supports key-value caching for efficient autoregressive generation:- First pass: Process entire prompt with
start_pos=0 - Subsequent passes: Process one token at a time with
start_posindicating cache position
Output Layer
Final linear projection from model dimension to vocabulary size produces logits for next token prediction.Usage with Llama Class
While you can use theTransformer class directly, it’s typically used internally by the Llama class: