Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Instruction Formatting

Why Formatting Matters (Really Matters)

With language models, they’re just predicting the next token. That’s it.

Think about what happens when you type “The capital of France is” — the model sees those tokens and thinks “ah yes, I’ve seen this pattern before in my training data, the next token is probably ‘Paris’.”

But now imagine you want the model to follow an instruction. You type:

Tell me about Paris.

And... the model might continue with anything. Maybe it starts generating a conversation. Maybe it writes a story. Maybe it keeps asking more questions. Why? Because during pre-training, it saw all sorts of text on the internet — conversations, stories, articles, Q&A forums — and it has no idea which pattern you want it to follow right now.

This is where formatting comes in.

When we fine-tune a model on instructions, we’re teaching it a very specific pattern:

  • When you see text formatted THIS way → that’s an instruction

  • When you see THIS special marker → start your response

  • When you see THAT special token → stop generating

It’s like training a dog with consistent commands. You can’t say “sit” one day and “please lower your rear end to the ground” the next and expect the dog to understand. Same with models — they need consistency.

Without proper formatting:

  • The model doesn’t know when to stop generating (it just keeps going...)

  • It can’t tell instructions from responses (is this part of the question or the answer?)

  • Multi-turn conversations become impossible (who’s talking right now?)

  • The model might repeat your instruction back to you instead of answering it

Chat templates solve all of this by wrapping messages in a consistent structure with special tokens. And once a model learns a template, it works beautifully. But only if you use the exact same format every time.

(This is why you can’t just grab a model fine-tuned with one chat template and use it with a different one — it’s like speaking German to someone who only learned French.)

The Big Three Chat Formats

Different research labs invented different formats. None is objectively “better” — they’re just different conventions. Let’s look at the three most popular ones.

Alpaca Format (Stanford)

Stanford’s Alpaca team wanted something human-readable. Look at this:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{response}

See what they did? The ### Instruction: and ### Response: markers are clear delimiters. A human can read this and immediately understand what’s what. The model learns to recognize these markers and knows “aha, after I see ### Response:, I should start generating my answer.”

There’s also a variant with an ### Input: field for tasks where you need both an instruction AND some data to work with (like “Summarize this text: [long article]”).

ChatML Format (OpenAI)

OpenAI went a different direction — special tokens that won’t appear in normal text:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>

The <|im_start|> and <|im_end|> tokens (im = “instant message”) are added to the tokenizer vocabulary. They’re designed to never appear in regular text, which means the model can be absolutely certain that when it sees <|im_start|>assistant, it’s time to generate a response.

Notice the system, user, and assistant roles? This lets you:

  • Set a system prompt that guides behavior

  • Track who said what in multi-turn conversations

  • Handle back-and-forth dialogue naturally

Llama 2 Format (Meta)

Meta created their own format for Llama 2:

<s>[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

{instruction} [/INST] {response} </s>

Here:

  • <s> = beginning of sequence (special token)

  • [INST] and [/INST] = instruction boundaries

  • <<SYS>> and <</SYS>> = system message boundaries

  • </s> = end of sequence (special token)

The <s> and </s> tokens are borrowed from sequence-to-sequence models (think translation). They tell the model “this is the start” and “this is the end.”

The key insight: All three formats do the same job — they create clear boundaries. The model just needs to learn whichever pattern you pick. But once you pick one, you’re committed to it.

# Let's implement Alpaca formatting (it's the most human-readable)

ALPACA_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{response}"""

ALPACA_TEMPLATE_WITH_INPUT = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{response}"""

def format_alpaca(instruction: str, response: str = "", input_text: str = "") -> str:
    """
    Format a training example in Alpaca style.
    
    Args:
        instruction: The task description ("Summarize this text")
        response: The model's expected response
        input_text: Optional additional context (like the text to summarize)
    
    Returns:
        Formatted string ready for training
    """
    if input_text:
        return ALPACA_TEMPLATE_WITH_INPUT.format(
            instruction=instruction,
            input=input_text,
            response=response
        )
    return ALPACA_TEMPLATE.format(
        instruction=instruction,
        response=response
    )

# Let's see it in action
print("Example 1: Simple instruction")
print("=" * 60)
formatted = format_alpaca(
    instruction="Explain quantum computing in simple terms.",
    response="Quantum computing uses quantum mechanics to process information in fundamentally different ways than classical computers, potentially solving certain problems exponentially faster."
)
print(formatted)

print("\n\nExample 2: Instruction with input")
print("=" * 60)
formatted_with_input = format_alpaca(
    instruction="Translate this sentence to French.",
    input_text="The cat sits on the mat.",
    response="Le chat est assis sur le tapis."
)
print(formatted_with_input)
Example 1: Simple instruction
============================================================
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Explain quantum computing in simple terms.

### Response:
Quantum computing uses quantum mechanics to process information in fundamentally different ways than classical computers, potentially solving certain problems exponentially faster.


Example 2: Instruction with input
============================================================
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate this sentence to French.

### Input:
The cat sits on the mat.

### Response:
Le chat est assis sur le tapis.
# Now let's implement ChatML (the one with special tokens)

def format_chatml(
    instruction: str,
    response: str = "",
    system: str = "You are a helpful assistant."
) -> str:
    """
    Format a conversation in ChatML style.
    
    This format uses special tokens (<|im_start|> and <|im_end|>) that are
    added to the tokenizer's vocabulary. They're designed to never appear
    in regular text, giving the model unambiguous boundaries.
    
    Args:
        instruction: What the user is asking
        response: What the assistant should say back
        system: System prompt that sets the assistant's behavior
    
    Returns:
        Formatted string with special tokens
    """
    formatted = f"<|im_start|>system\n{system}<|im_end|>\n"
    formatted += f"<|im_start|>user\n{instruction}<|im_end|>\n"
    formatted += f"<|im_start|>assistant\n{response}"
    
    # Only add the closing token if there's a response
    # (During training, we'll want to generate starting from here)
    if response:
        formatted += "<|im_end|>"
    
    return formatted

# Let's see how it looks
print("ChatML Example:")
print("=" * 60)
formatted = format_chatml(
    instruction="What is the capital of France?",
    response="The capital of France is Paris."
)
print(formatted)

print("\n\nNotice:")
print("- The system message comes first (sets the assistant's personality)")
print("- Each role is clearly marked: system, user, assistant")
print("- Special tokens make boundaries crystal clear")
print("- This format makes multi-turn conversations easy to handle")
ChatML Example:
============================================================
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>


Notice:
- The system message comes first (sets the assistant's personality)
- Each role is clearly marked: system, user, assistant
- Special tokens make boundaries crystal clear
- This format makes multi-turn conversations easy to handle

Using Built-in Chat Templates

Here’s the good news: you usually don’t have to write this formatting code yourself.

Modern tokenizers (from HuggingFace) come with chat templates built in. When someone releases a fine-tuned model, they bake the chat template right into the tokenizer config. This means:

  • You can’t accidentally use the wrong format

  • Multi-turn conversations are handled automatically

  • The format stays consistent between training and inference

Let’s load a tokenizer and see its chat template in action.

from transformers import AutoTokenizer

# Load a tokenizer with chat template support
# (DialoGPT is a conversational model, so it has a simple chat template)
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")

# Create a multi-turn conversation
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing well, thank you! How can I help you today?"},
    {"role": "user", "content": "Can you explain machine learning?"}
]

# Check if tokenizer has a chat template
if hasattr(tokenizer, 'chat_template') and tokenizer.chat_template:
    formatted = tokenizer.apply_chat_template(messages, tokenize=False)
    print("Using built-in chat template:")
    print("=" * 60)
    print(formatted)
    print("\n" + "=" * 60)
    print("\nWhat happened here?")
    print("The tokenizer automatically:")
    print("- Added special tokens between turns")
    print("- Formatted the multi-turn conversation")
    print("- Made sure everything is ready for the model")
else:
    print("This tokenizer doesn't have a chat template.")
    print("(That's okay — we can use our custom formats from above)")
Using built-in chat template:
============================================================
Hello, how are you?<|endoftext|>I'm doing well, thank you! How can I help you today?<|endoftext|>Can you explain machine learning?<|endoftext|>

============================================================

What happened here?
The tokenizer automatically:
- Added special tokens between turns
- Formatted the multi-turn conversation
- Made sure everything is ready for the model

Finding Where the Response Starts (This is Critical for Training)

Okay, here’s a key insight about how we actually train these models.

When you fine-tune on instruction data, you don’t want to compute loss on the instruction part. Why? Because the model doesn’t need to learn how to generate instructions — it needs to learn how to generate responses.

Think about it:

  • Instruction: “Explain quantum computing in simple terms.”

  • Response: “Quantum computing uses quantum mechanics...”

We want the model to get better at generating that response. We don’t care if it can predict the instruction — that’s the input, not the output!

So during training, we “mask” the instruction tokens. We only compute loss on the response tokens. This is called loss masking or attention masking.

To do this, we need to know: where does the response start?

Let’s write some code to find that boundary.

def find_response_start(formatted_text: str, response_marker: str = "### Response:\n") -> int:
    """
    Find the character position where the response starts.
    
    Args:
        formatted_text: The full formatted prompt
        response_marker: The string that marks the start of the response
    
    Returns:
        Character index where the response begins
    """
    idx = formatted_text.find(response_marker)
    if idx == -1:
        raise ValueError(f"Response marker '{response_marker}' not found in text")
    # Return position AFTER the marker (where the actual response starts)
    return idx + len(response_marker)

def find_response_start_tokens(tokenizer, formatted_text: str, response_marker: str = "### Response:\n"):
    """
    Find the token position where the response starts.
    
    This is trickier than finding the character position because:
    - Tokens don't always align with character boundaries
    - We need to count tokens, not characters
    
    Args:
        tokenizer: The tokenizer to use
        formatted_text: The full formatted prompt
        response_marker: The string that marks the start of the response
    
    Returns:
        Token index where the response begins
    """
    # First, find where the response starts in character space
    char_pos = find_response_start(formatted_text, response_marker)
    
    # Tokenize just the prompt part (everything before the response)
    prompt_text = formatted_text[:char_pos]
    prompt_tokens = tokenizer.encode(prompt_text, add_special_tokens=False)
    
    # The number of prompt tokens is where the response starts!
    return len(prompt_tokens)

# Let's test this
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Create a formatted example
text = format_alpaca(
    instruction="What is 2+2?",
    response="2+2 equals 4. This is basic arithmetic."
)

print("Formatted text:")
print("=" * 60)
print(text)
print("=" * 60)

# Find where the response starts
response_start_char = find_response_start(text)
response_start_token = find_response_start_tokens(tokenizer, text)

print(f"\nResponse starts at:")
print(f"  Character position: {response_start_char}")
print(f"  Token position: {response_start_token}")

# Show the breakdown
tokens = tokenizer.encode(text)
print(f"\nToken breakdown:")
print(f"  Total tokens: {len(tokens)}")
print(f"  Prompt tokens: {response_start_token} (we don't compute loss here)")
print(f"  Response tokens: {len(tokens) - response_start_token} (we DO compute loss here)")

# Visualize the split
print("\n" + "=" * 60)
print("The prompt part (no loss):")
print(repr(text[:response_start_char]))
print("\n" + "=" * 60)
print("The response part (compute loss):")
print(repr(text[response_start_char:]))
Formatted text:
============================================================
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is 2+2?

### Response:
2+2 equals 4. This is basic arithmetic.
============================================================

Response starts at:
  Character position: 152
  Token position: 36

Token breakdown:
  Total tokens: 47
  Prompt tokens: 36 (we don't compute loss here)
  Response tokens: 11 (we DO compute loss here)

============================================================
The prompt part (no loss):
'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is 2+2?\n\n### Response:\n'

============================================================
The response part (compute loss):
'2+2 equals 4. This is basic arithmetic.'

Best Practices (Learn from Others’ Mistakes)

Here are the things that trip people up when formatting instruction data:

1. Be Consistent (No Really, Be Obsessively Consistent)

Use the exact same format for every single training example. Not “mostly the same” — exactly the same. If you have:

  • 10,000 examples in Alpaca format

  • 5 examples where you forgot the preamble

  • 3 examples where you used a different marker

Your model will be confused on those 8 examples and might learn the wrong pattern.

2. Match Training and Inference

This is the #1 mistake people make. They:

  • Train with Alpaca format

  • Then try to use it with ChatML format at inference time

  • Wonder why the model outputs gibberish

The model learned that responses come after ### Response:. If you don’t include that marker when generating, it’s like asking someone to “fetch” when you trained them to respond to “sit”.

3. Add Special Tokens to the Tokenizer

If you’re using special tokens like <|im_start|> or <s>, you need to:

  • Add them to the tokenizer’s vocabulary

  • Tell the tokenizer they’re special (so they don’t get split up)

  • Make sure they’re present in all examples

4. Handle Edge Cases

Think about:

  • Empty inputs (what if someone sends just the instruction?)

  • Very long texts (what if the instruction+response exceeds context length?)

  • Special characters (what if the instruction contains your marker string?)

  • Unicode and emojis (does your tokenizer handle them?)

5. Document Your Format

Seriously. Write down:

  • Which chat template you used

  • Why you chose it

  • How to use the model at inference time

  • Any special tokens you added

Future you (or someone else using your model) will thank you.

What We Learned

Let’s recap what we covered:

Why formatting matters: Language models are next-token predictors. Without consistent formatting, they don’t know when to stop, when to respond, or how to handle multi-turn conversations. Chat templates solve this by creating clear boundaries with special tokens.

The three main formats:

  • Alpaca: Human-readable with ### Instruction: and ### Response: markers

  • ChatML: Special tokens with role-based messages (system, user, assistant)

  • Llama 2: Meta’s format with [INST] and </s> tokens

Built-in templates: Modern tokenizers have chat templates baked in, so you usually don’t need to write formatting code yourself. Just call tokenizer.apply_chat_template().

Loss masking: During training, we only compute loss on response tokens, not instruction tokens. This means we need to find where the response starts and mask everything before it.

Consistency is key: Pick a format and stick with it. Use the same template during training and inference. Otherwise, your model won’t understand what you’re asking it to do.

Up Next

Now that we understand formatting, the next notebook will dive into loss masking — how to actually implement the selective loss computation during training. We’ll see how to create attention masks, why we mask the instruction part, and how this affects what the model learns.

(It makes a huge difference in model quality.)