Documentation Index
Fetch the complete documentation index at: https://mintlify.com/meta-llama/llama/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The Tokenizer class handles tokenizing and encoding/decoding text using SentencePiece. It provides methods to convert between text strings and token IDs for use with Llama 2 models.
Class Definition
class Tokenizer:
"""tokenizing and encoding/decoding text using SentencePiece."""
Methods
__init__
def __init__(self, model_path: str)
Initializes the Tokenizer with a SentencePiece model.
The path to the SentencePiece model file.
Raises:
AssertionError: If the model_path does not point to a valid file.
Attributes:
After initialization, the Tokenizer instance has the following attributes:
sp_model (SentencePieceProcessor): The loaded SentencePiece processor.
n_words (int): Vocabulary size.
bos_id (int): Beginning-of-sequence token ID.
eos_id (int): End-of-sequence token ID.
pad_id (int): Padding token ID.
Example:
from llama.tokenizer import Tokenizer
# Initialize tokenizer
tokenizer = Tokenizer(model_path="/path/to/tokenizer.model")
print(f"Vocabulary size: {tokenizer.n_words}")
print(f"BOS token ID: {tokenizer.bos_id}")
print(f"EOS token ID: {tokenizer.eos_id}")
print(f"PAD token ID: {tokenizer.pad_id}")
encode
def encode(self, s: str, bos: bool, eos: bool) -> List[int]
Encodes a string into a list of token IDs.
The input string to be encoded.
Whether to prepend the beginning-of-sequence token.
Whether to append the end-of-sequence token.
A list of token IDs representing the encoded string.
Raises:
AssertionError: If the input s is not a string.
Example:
from llama.tokenizer import Tokenizer
tokenizer = Tokenizer(model_path="/path/to/tokenizer.model")
# Encode with BOS token
text = "Hello, world!"
tokens = tokenizer.encode(text, bos=True, eos=False)
print(f"Tokens with BOS: {tokens}")
# Encode with both BOS and EOS tokens
tokens = tokenizer.encode(text, bos=True, eos=True)
print(f"Tokens with BOS and EOS: {tokens}")
# Encode without special tokens
tokens = tokenizer.encode(text, bos=False, eos=False)
print(f"Tokens without special tokens: {tokens}")
decode
def decode(self, t: List[int]) -> str
Decodes a list of token IDs into a string.
The list of token IDs to be decoded.
Example:
from llama.tokenizer import Tokenizer
tokenizer = Tokenizer(model_path="/path/to/tokenizer.model")
# Encode and decode
original_text = "The quick brown fox jumps over the lazy dog."
tokens = tokenizer.encode(original_text, bos=True, eos=True)
print(f"Tokens: {tokens}")
decoded_text = tokenizer.decode(tokens)
print(f"Decoded: {decoded_text}")
# Decode partial tokens
partial_tokens = tokens[:5]
partial_text = tokenizer.decode(partial_tokens)
print(f"Partial text: {partial_text}")
Usage Example
Here’s a complete example showing typical tokenizer usage:
from llama.tokenizer import Tokenizer
# Initialize tokenizer
tokenizer = Tokenizer(model_path="/path/to/tokenizer.model")
# Prepare text for model input
input_text = "What is the meaning of life?"
# Encode with BOS, without EOS (for prompts)
prompt_tokens = tokenizer.encode(input_text, bos=True, eos=False)
print(f"Prompt tokens: {prompt_tokens}")
# After generation, decode tokens back to text
generated_tokens = [1, 1724, 338, 278, 6593, 310, 2834, 29973] # Example tokens
generated_text = tokenizer.decode(generated_tokens)
print(f"Generated text: {generated_text}")
# Check vocabulary information
print(f"\nTokenizer info:")
print(f" Vocabulary size: {tokenizer.n_words}")
print(f" BOS ID: {tokenizer.bos_id}")
print(f" EOS ID: {tokenizer.eos_id}")
print(f" PAD ID: {tokenizer.pad_id}")