What Decoding Is

After processing input through all transformer layers, the model must generate output.

Decoding is the process of selecting tokens one at a time to construct the response.

Each token selection is a probabilistic choice from the model's vocabulary.

The Autoregressive Loop

LLMs generate text autoregressively:

Process entire context
Predict probability distribution over next token
Select one token
Append to context
Repeat until done

Each new token becomes part of the context for the next prediction.

This is why generation is sequential, even when input processing is parallelized.

Probability Distributions

At each step, the model outputs a probability for every token in its vocabulary.

Example (simplified):

After "The capital of France is", the model might assign:

"Paris" → 0.89
"a" → 0.03
"the" → 0.02
"Lyon" → 0.01
... (50,000+ other tokens with tiny probabilities)

The decoding strategy determines how to convert these probabilities into a selection.

Decoding Strategies

Greedy Decoding

Always pick the highest-probability token.

Fast
Deterministic
Often repetitive and boring
Can get stuck in loops

Temperature Sampling

Scale probabilities before sampling:

Temperature < 1.0 → sharper distribution → more predictable
Temperature > 1.0 → flatter distribution → more random
Temperature = 0 → equivalent to greedy

Top-k Sampling

Only consider the k most likely tokens, then sample from those.

k=1 → greedy
k=50 → reasonable variety
k=vocabulary size → pure sampling

Top-p (Nucleus) Sampling

Include tokens until cumulative probability reaches p.

p=0.9 → include tokens accounting for 90% of probability mass
Adapts to distribution shape
Most commonly used in production

Why Decoding Matters

Different decoding parameters produce different outputs from the same model.

Use Case	Recommended Settings
Factual Q&A	Low temperature (0.1-0.3)
Creative writing	Higher temperature (0.7-0.9)
Code generation	Low temperature + top-p
Brainstorming	Higher temperature + top-k

Decoding and AI Visibility

Decoding is downstream from retrieval and context assembly.

By the time decoding starts:

Your content is either in context or not
Meaning interpretation is locked
The model has committed to its understanding

Decoding determines how the response is expressed, not what information it contains.

However, decoding can affect:

Whether your brand name gets mentioned (vs. paraphrased)
How confidently claims are stated
Whether alternative options are listed

The Role of Stopping Conditions

Generation continues until:

A stop token is produced
Maximum length is reached
A stop sequence is matched

Premature stopping can truncate your citation.

If the model starts listing competitors and hits max length, your brand might be cut off.

Beam Search

An alternative to sampling: maintain multiple candidate sequences simultaneously.

Start with top-k tokens
Expand each candidate by top-k tokens
Keep best k sequences by total probability
Repeat until done

Produces more coherent long-form output but is computationally expensive.

Less common in production chat systems, more common in translation and summarization.

Practical Implications

For content creators, decoding is largely outside your control.

What you can influence:

Getting into the context (retrievability)
Being the highest-probability answer (authority signals)
Using exact terminology users expect (vocabulary alignment)

What you cannot control:

The user's temperature setting
The system's sampling strategy
Random variation between runs

Key Takeaway

Decoding is execution, not decision-making.

The model decides what to say during forward pass and attention. Decoding decides exactly which tokens express that decision.

By the time tokens are being sampled, the battle for visibility is already won or lost.

Decoding: How AI Generates Text