Master LLM Behavior: 5 Key Parameters with Practical Examples

Large language models expose several parameters that let you steer their behavior and control how they generate responses. When outputs aren’t what you expect, the culprit is often one or more of these settings. Below are five commonly used parameters — max_completion_tokens, temperature, top_p, presence_penalty, and frequency_penalty — explained with hands-on examples.

Installing dependencies

Use pip to install the required packages before running the examples.

pip install openai pandas matplotlib

Loading your OpenAI API key

Set the API key as an environment variable so the client can authenticate.

import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')

Initializing the model

Create a client instance and specify the model you want to use.

from openai import OpenAI
model="gpt-4.1"
client = OpenAI()

Max Tokens

Max Tokens limits how many tokens the model can generate in a single run. If the model hits this limit, the run ends and may be marked incomplete. Smaller values (e.g., 16) force very short answers; larger values (e.g., 80) give the model room to elaborate, explain, or format its response.

Example: vary max_completion_tokens to see how answer length changes.

prompt = "What is the most popular French cheese?"
for tokens in [16, 30, 80]:
  print(f"\n--- max_output_tokens = {tokens} ---")
  response = client.chat.completions.create(
    model=model,
    messages=[
      {"role": "developer", "content": "You are a helpful assistant."},
      {"role": "user", "content": prompt}
    ],
    max_completion_tokens=tokens
  )
  print(response.choices[0].message.content)

Temperature

Temperature controls randomness and diversity. Low temperature values make the output more deterministic and focused on high-probability tokens. Higher values encourage creativity by flattening the token probability distribution and letting the model explore less likely tokens.

The example below asks for 10 different single-word responses to the same prompt across a range of temperatures, so you can observe how diversity increases as temperature rises.

prompt = "What is one intriguing place worth visiting? Give a single-word answer and think globally."
 
temperatures = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]
n_choices = 10 
results = {}
 
for temp in temperatures:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=temp,
        n=n_choices
    )
    
    # Collect all n responses in a list
    results[temp] = [response.choices[i].message.content.strip() for i in range(n_choices)]
 
# Display results
for temp, responses in results.items():
    print(f"\n--- temperature = {temp} ---")
    print(responses)

At moderate temperatures (around 0.6) responses become noticeably more varied. At very high temperatures (e.g., 1.5) you may see a broader spread of answers like Kyoto or Machu Picchu.

Top P (nucleus sampling)

Top P constrains sampling to the smallest set of tokens whose cumulative probability reaches a threshold p. This lets the model focus on the most likely tokens while still allowing some variability. Note that when temperature is 0 (deterministic), Top P has no effect.

The process: apply temperature, keep only tokens up to cumulative probability p, renormalize, and sample.

In this example Top P is set to 0.5, meaning only tokens that together account for 50% of the probability mass remain. If one token dominates that mass, Top P can force the model to repeatedly pick that token.

prompt = "What is one intriguing place worth visiting? Give a single-word answer and think globally."
 
temperatures = [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.5]
n_choices = 10 
results_ = {}
 
for temp in temperatures:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=temp,
        n=n_choices,
        top_p=0.5
    )
    
    # Collect all n responses in a list
    results_[temp] = [response.choices[i].message.content.strip() for i in range(n_choices)]
 
# Display results
for temp, responses in results_.items():
    print(f"\n--- temperature = {temp} ---")
    print(responses)

If a single answer like "Petra" accounts for more than 50% of the probability mass, Top P = 0.5 will filter out alternatives and produce the same answer repeatedly.

Frequency Penalty

Frequency Penalty discourages the model from repeating the same tokens multiple times. It is typically in the range -2 to 2, with the default 0. Higher positive values reduce repetition and encourage variety. Negative values can increase repetition.

Example: ask for 10 fantasy book titles and vary frequency_penalty to see how repetition changes.

prompt = "List 10 possible titles for a fantasy book. Give the titles only and each title on a new line."
frequency_penalties = [-2.0, -1.0, 0.0, 0.5, 1.0, 1.5, 2.0]
results = {}
 
for fp in frequency_penalties:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        frequency_penalty=fp,
        temperature=0.2
    )
 
    text = response.choices[0].message.content
    items = [line.strip("- ").strip() for line in text.split("\n") if line.strip()]
    results[fp] = items
 
# Display results
for fp, items in results.items():
    print(f"\n--- frequency_penalty = {fp} ---")
    print(items)

Lower penalties tend to produce repeated patterns and familiar titles. As you increase the penalty, the model generates more varied and creative names.

Presence Penalty

Presence Penalty discourages the reuse of tokens that have already appeared anywhere in the generated output. It also typically ranges from -2 to 2 with default 0. Unlike frequency penalty, presence penalty applies once per token to reduce its chance of appearing again at all.

The same book-title prompt can be used to compare presence penalties.

prompt = "List 10 possible titles for a fantasy book. Give the titles only and each title on a new line."
presence_penalties = [-2.0, -1.0, 0.0, 0.5, 1.0, 1.5, 2.0]
results = {}
 
for fp in frequency_penalties:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        presence_penalty=fp,
        temperature=0.2
    )
 
    text = response.choices[0].message.content
    items = [line.strip("- ").strip() for line in text.split("\n") if line.strip()]
    results[fp] = items
 
# Display results
for fp, items in results.items():
    print(f"\n--- presence_penalties = {fp} ---")
    print(items)

With higher presence penalties, the output becomes more diverse and avoids repeating the same motifs. At extreme values you may still see the most dominant options in the first few items, but the remainder will be far more varied.

Where to find the full code

Check the original post or the project GitHub for full code, notebooks, and additional tutorials. Follow the project on social media or join the community channels if you want updates and examples.