Back to Blog

Decoding: How AI Generates Text

Youssef El Ramy4 min read

What Decoding Is

After processing input through all transformer layers, the model must generate output.

Decoding is the process of selecting tokens one at a time to construct the response.

Each token selection is a probabilistic choice from the model's vocabulary.


The Autoregressive Loop

LLMs generate text autoregressively:

  1. Process entire context
  2. Predict probability distribution over next token
  3. Select one token
  4. Append to context
  5. Repeat until done

Each new token becomes part of the context for the next prediction.

This is why generation is sequential, even when input processing is parallelized.


Probability Distributions

At each step, the model outputs a probability for every token in its vocabulary.

Example (simplified):

After "The capital of France is", the model might assign:

  • "Paris" → 0.89
  • "a" → 0.03
  • "the" → 0.02
  • "Lyon" → 0.01
  • ... (50,000+ other tokens with tiny probabilities)

The decoding strategy determines how to convert these probabilities into a selection.


Decoding Strategies

Greedy Decoding

Always pick the highest-probability token.

  • Fast
  • Deterministic
  • Often repetitive and boring
  • Can get stuck in loops

Temperature Sampling

Scale probabilities before sampling:

  • Temperature < 1.0 → sharper distribution → more predictable
  • Temperature > 1.0 → flatter distribution → more random
  • Temperature = 0 → equivalent to greedy

Top-k Sampling

Only consider the k most likely tokens, then sample from those.

  • k=1 → greedy
  • k=50 → reasonable variety
  • k=vocabulary size → pure sampling

Top-p (Nucleus) Sampling

Include tokens until cumulative probability reaches p.

  • p=0.9 → include tokens accounting for 90% of probability mass
  • Adapts to distribution shape
  • Most commonly used in production

Why Decoding Matters

Different decoding parameters produce different outputs from the same model.

Use CaseRecommended Settings
Factual Q&ALow temperature (0.1-0.3)
Creative writingHigher temperature (0.7-0.9)
Code generationLow temperature + top-p
BrainstormingHigher temperature + top-k

Decoding and AI Visibility

Decoding is downstream from retrieval and context assembly.

By the time decoding starts:

  • Your content is either in context or not
  • Meaning interpretation is locked
  • The model has committed to its understanding

Decoding determines how the response is expressed, not what information it contains.

However, decoding can affect:

  • Whether your brand name gets mentioned (vs. paraphrased)
  • How confidently claims are stated
  • Whether alternative options are listed

The Role of Stopping Conditions

Generation continues until:

  • A stop token is produced
  • Maximum length is reached
  • A stop sequence is matched

Premature stopping can truncate your citation.

If the model starts listing competitors and hits max length, your brand might be cut off.


Beam Search

An alternative to sampling: maintain multiple candidate sequences simultaneously.

  1. Start with top-k tokens
  2. Expand each candidate by top-k tokens
  3. Keep best k sequences by total probability
  4. Repeat until done

Produces more coherent long-form output but is computationally expensive.

Less common in production chat systems, more common in translation and summarization.


Practical Implications

For content creators, decoding is largely outside your control.

What you can influence:

  • Getting into the context (retrievability)
  • Being the highest-probability answer (authority signals)
  • Using exact terminology users expect (vocabulary alignment)

What you cannot control:

  • The user's temperature setting
  • The system's sampling strategy
  • Random variation between runs

Key Takeaway

Decoding is execution, not decision-making.

The model decides what to say during forward pass and attention. Decoding decides exactly which tokens express that decision.

By the time tokens are being sampled, the battle for visibility is already won or lost.

YR
About the author
Youssef El Ramy

Founder of VisibilityLens. Analyzes how AI models interpret and cite website content, publishing independent research on companies like Gong, Loom, and Basecamp.

See This in Action

This is one of five dimensions in the AI Visibility framework. See how it plays out in real analyses:

Want Your Site Analyzed?

Get a complete AI visibility analysis with actionable recommendations.

Request Your Analysis