What Decoding Is
After processing input through all transformer layers, the model must generate output.
Decoding is the process of selecting tokens one at a time to construct the response.
Each token selection is a probabilistic choice from the model's vocabulary.
The Autoregressive Loop
LLMs generate text autoregressively:
- Process entire context
- Predict probability distribution over next token
- Select one token
- Append to context
- Repeat until done
Each new token becomes part of the context for the next prediction.
This is why generation is sequential, even when input processing is parallelized.
Probability Distributions
At each step, the model outputs a probability for every token in its vocabulary.
Example (simplified):
After "The capital of France is", the model might assign:
- "Paris" → 0.89
- "a" → 0.03
- "the" → 0.02
- "Lyon" → 0.01
- ... (50,000+ other tokens with tiny probabilities)
The decoding strategy determines how to convert these probabilities into a selection.
Decoding Strategies
Greedy Decoding
Always pick the highest-probability token.
- Fast
- Deterministic
- Often repetitive and boring
- Can get stuck in loops
Temperature Sampling
Scale probabilities before sampling:
- Temperature < 1.0 → sharper distribution → more predictable
- Temperature > 1.0 → flatter distribution → more random
- Temperature = 0 → equivalent to greedy
Top-k Sampling
Only consider the k most likely tokens, then sample from those.
- k=1 → greedy
- k=50 → reasonable variety
- k=vocabulary size → pure sampling
Top-p (Nucleus) Sampling
Include tokens until cumulative probability reaches p.
- p=0.9 → include tokens accounting for 90% of probability mass
- Adapts to distribution shape
- Most commonly used in production
Why Decoding Matters
Different decoding parameters produce different outputs from the same model.
| Use Case | Recommended Settings |
|---|---|
| Factual Q&A | Low temperature (0.1-0.3) |
| Creative writing | Higher temperature (0.7-0.9) |
| Code generation | Low temperature + top-p |
| Brainstorming | Higher temperature + top-k |
Decoding and AI Visibility
Decoding is downstream from retrieval and context assembly.
By the time decoding starts:
- Your content is either in context or not
- Meaning interpretation is locked
- The model has committed to its understanding
Decoding determines how the response is expressed, not what information it contains.
However, decoding can affect:
- Whether your brand name gets mentioned (vs. paraphrased)
- How confidently claims are stated
- Whether alternative options are listed
The Role of Stopping Conditions
Generation continues until:
- A stop token is produced
- Maximum length is reached
- A stop sequence is matched
Premature stopping can truncate your citation.
If the model starts listing competitors and hits max length, your brand might be cut off.
Beam Search
An alternative to sampling: maintain multiple candidate sequences simultaneously.
- Start with top-k tokens
- Expand each candidate by top-k tokens
- Keep best k sequences by total probability
- Repeat until done
Produces more coherent long-form output but is computationally expensive.
Less common in production chat systems, more common in translation and summarization.
Practical Implications
For content creators, decoding is largely outside your control.
What you can influence:
- Getting into the context (retrievability)
- Being the highest-probability answer (authority signals)
- Using exact terminology users expect (vocabulary alignment)
What you cannot control:
- The user's temperature setting
- The system's sampling strategy
- Random variation between runs
Key Takeaway
Decoding is execution, not decision-making.
The model decides what to say during forward pass and attention. Decoding decides exactly which tokens express that decision.
By the time tokens are being sampled, the battle for visibility is already won or lost.