Tokenization

Llama 2 uses SentencePiece for tokenization, a language-independent subword tokenizer that can handle any language and produces a consistent vocabulary.

SentencePiece Tokenizer

The Tokenizer class wraps the SentencePiece model:

class Tokenizer:
    """tokenizing and encoding/decoding text using SentencePiece."""
    def __init__(self, model_path: str):
        # reload tokenizer
        assert os.path.isfile(model_path), model_path
        self.sp_model = SentencePieceProcessor(model_file=model_path)
        logger.info(f"Reloaded SentencePiece model from {model_path}")

        # BOS / EOS token IDs
        self.n_words: int = self.sp_model.vocab_size()
        self.bos_id: int = self.sp_model.bos_id()
        self.eos_id: int = self.sp_model.eos_id()
        self.pad_id: int = self.sp_model.pad_id()
        logger.info(
            f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
        )
        assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()

Key attributes:

n_words: Vocabulary size (typically 32,000 tokens)
bos_id: Beginning-of-sequence token ID
eos_id: End-of-sequence token ID
pad_id: Padding token ID

Encoding and Decoding

Encoding Text

The encode method converts text to token IDs with optional BOS/EOS tokens:

def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
    """
    Encodes a string into a list of token IDs.

    Args:
        s (str): The input string to be encoded.
        bos (bool): Whether to prepend the beginning-of-sequence token.
        eos (bool): Whether to append the end-of-sequence token.

    Returns:
        List[int]: A list of token IDs.
    """
    assert type(s) is str
    t = self.sp_model.encode(s)
    if bos:
        t = [self.bos_id] + t
    if eos:
        t = t + [self.eos_id]
    return t

Example usage:

# Encode with BOS token for prompt start
tokens = tokenizer.encode("Hello world", bos=True, eos=False)
# Output: [1, 15043, 3186]  # 1 is BOS token

# Encode with both BOS and EOS for complete sequences
tokens = tokenizer.encode("Hello world", bos=True, eos=True)
# Output: [1, 15043, 3186, 2]  # 2 is EOS token

Decoding Tokens

The decode method converts token IDs back to text:

def decode(self, t: List[int]) -> str:
    """
    Decodes a list of token IDs into a string.

    Args:
        t (List[int]): The list of token IDs to be decoded.

    Returns:
        str: The decoded string.
    """
    return self.sp_model.decode(t)

Special Tokens for Chat

Llama 2 Chat models use special tags to format conversational prompts:

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

SPECIAL_TAGS = [B_INST, E_INST, "<<SYS>>", "<</SYS>>"]

Chat Format Structure

Basic instruction format:

[INST] User message here [/INST]

With system prompt:

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

User message here [/INST]

Multi-turn conversation:

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

First user message [/INST] First assistant response [INST] Second user message [/INST] Second assistant response

Chat Tokenization

The chat_completion method formats dialogs with special tags:

if dialog[0]["role"] == "system":
    dialog = [
        {
            "role": dialog[1]["role"],
            "content": B_SYS
            + dialog[0]["content"]
            + E_SYS
            + dialog[1]["content"],
        }
    ] + dialog[2:]

System messages are wrapped in <<SYS>> tags and prepended to the first user message. Full dialog encoding:

dialog_tokens: List[int] = sum(
    [
        self.tokenizer.encode(
            f"{B_INST} {(prompt['content']).strip()} {E_INST} {(answer['content']).strip()} ",
            bos=True,
            eos=True,
        )
        for prompt, answer in zip(
            dialog[::2],
            dialog[1::2],
        )
    ],
    [],
)

The final user turn (without response) is encoded separately:

dialog_tokens += self.tokenizer.encode(
    f"{B_INST} {(dialog[-1]['content']).strip()} {E_INST}",
    bos=True,
    eos=False,
)

Example: Complete Chat Tokenization

Input dialog:

dialog = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

Formatted prompt:

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

What is the capital of France? [/INST]

Tokenization:

prompt_tokens = tokenizer.encode(
    "[INST] <<SYS>>\nYou are a helpful assistant.\n<</SYS>>\n\nWhat is the capital of France? [/INST]",
    bos=True,
    eos=False
)

Text Completion vs Chat Completion

Text Completion

For general text completion, prompts are encoded with BOS:

prompt_tokens = [self.tokenizer.encode(x, bos=True, eos=False) for x in prompts]

Chat Completion

For chat, dialogs are formatted with special tags as shown above. The system enforces alternating user/assistant roles:

assert all([msg["role"] == "user" for msg in dialog[::2]]) and all(
    [msg["role"] == "assistant" for msg in dialog[1::2]]
), (
    "model only supports 'system', 'user' and 'assistant' roles, "
    "starting with 'system', then 'user' and alternating (u/a/u/a/u...)"
)

Safety Validation

The chat interface validates that user prompts don’t contain special tags to prevent prompt injection:

unsafe_requests = [
    any([tag in msg["content"] for tag in SPECIAL_TAGS for msg in dialog])
]

If special tags are detected, the model returns an error:

UNSAFE_ERROR = "Error: special tags are not allowed as part of the prompt."

Get Started

Model Usage

Core Concepts

Model Variants

Tokenization

Tokenization

SentencePiece Tokenizer

Encoding and Decoding

Encoding Text

Decoding Tokens

Special Tokens for Chat

Chat Format Structure

Chat Tokenization

Example: Complete Chat Tokenization

Text Completion vs Chat Completion

Text Completion

Chat Completion

Safety Validation

Get Started

Model Usage

Core Concepts

Model Variants

Documentation Index

​Tokenization

​SentencePiece Tokenizer

​Encoding and Decoding

​Encoding Text

​Decoding Tokens

​Special Tokens for Chat

​Chat Format Structure

​Chat Tokenization

​Example: Complete Chat Tokenization

​Text Completion vs Chat Completion

​Text Completion

​Chat Completion

​Safety Validation

Tokenization

SentencePiece Tokenizer

Encoding and Decoding

Encoding Text

Decoding Tokens

Special Tokens for Chat

Chat Format Structure

Chat Tokenization

Example: Complete Chat Tokenization

Text Completion vs Chat Completion

Text Completion

Chat Completion

Safety Validation