7 Decoding Knobs That Control LLM Output — How to Tune Them
Maximizing control over a large language model’s output is mostly a decoding problem: you manipulate the model’s next-token distribution with a small set of sampling and generation parameters. Each knob changes the shape of the candidate token distribution or the stopping behavior, and the knobs interact in predictable ways. Below are concise definitions and practical tuning notes for the seven most important parameters.
Max tokens
A hard cap on how many tokens the model may emit for a given response. This does not expand the model’s context window: input tokens plus output tokens still must fit within the model’s maximum context length. If the cap is reached first, many APIs mark the response as truncated or incomplete.
When to tune: to constrain latency and cost, or to prevent outputs from overrunning a delimiter when stop sequences might not be fully reliable.
Temperature
Temperature rescales logits before softmax: softmax(z/T)_i = e^{z_i/T} / \sum_j e^{z_j/T}. Lower temperature sharpens the distribution and makes sampling more deterministic; higher temperature flattens the distribution and increases randomness. Public APIs typically expose a range around 0–2. Use low temperatures for analytical or factual tasks and higher temperatures for creative or exploratory generation.
Nucleus sampling (top_p)
Nucleus sampling selects from the smallest set of candidate tokens whose cumulative probability mass is at least p. This trims the long low-probability tail that often causes rambling and repetition. Typical open-ended text settings use top_p in the 0.9–0.95 range. A common vendor recommendation is to tune either temperature or top_p rather than both simultaneously to keep the effective randomness interpretable.
Top-k sampling
Top-k restricts candidates at each step to the k most likely tokens, renormalizes their probabilities, and samples. Historically used to improve novelty relative to beam search, top-k is often combined with temperature or top_p in modern pipelines. Reasonable k ranges for balanced diversity are roughly 5–50. When both top_k and top_p are set, many libraries apply k-filtering first and then p-filtering as an implementation detail.
Frequency penalty
A frequency penalty reduces the probability of tokens proportionally to how often they have already appeared in the generated context, thereby discouraging verbatim repetition. Vendors like Azure and OpenAI document ranges around -2.0 to +2.0, with positive values reducing repetition and negative values encouraging it. Apply this when long generations loop or echo phrasing excessively, such as lists or poetic output.
Presence penalty
The presence penalty is a boolean-style penalty by token: it penalizes tokens that have already appeared at least once, encouraging introduction of new tokens or topics. Documented ranges are similar to frequency penalty (about -2.0 to +2.0). Start near zero and nudge upward if the model remains too on-rails and resists exploring alternatives.
Stop sequences
Stop sequences are strings that force the decoder to halt when they appear, without emitting the stop text itself. They are valuable for bounding structured outputs like JSON objects or distinct sections of text. Many APIs allow multiple stop strings. Choose unambiguous delimiters that are unlikely to appear in normal text and pair stop sequences with a max_tokens cap for robust termination.
How these parameters interact
Temperature versus top_p and top_k: raising temperature spreads probability mass into the tail; top_p and top_k then crop that tail. Because of this coupling, many providers advise changing one randomness control at a time to keep behavior predictable.
Degeneration control: nucleus sampling typically reduces repetition and blandness by cutting unreliable tails; combining it with a light frequency penalty is effective for longer outputs.
Latency and cost: max_tokens is the most direct lever to control cost and latency. Streaming responses can improve perceived latency but does not reduce token-based cost.
Model-specific differences
Some reasoning-oriented endpoints expose fewer or different knobs, and some models may ignore particular parameters. Always check model-specific documentation before porting generation settings between providers.
References
The guidance above is grounded in the decoding and sampling literature and in vendor documentation for public generation APIs. Representative references include Holtzman et al. on nucleus sampling, early top-k and beam search work, and documentation from Hugging Face, OpenAI, Anthropic, Google Vertex AI, and Microsoft Azure.