Run a Local GPT-Style Chatbot with Hugging Face Transformers
'A practical walkthrough to run a GPT-style conversational agent locally using Hugging Face Transformers, complete with code, prompt patterns, and a demo.'
Setup and dependencies
To run a local GPT-style conversational agent you need the Hugging Face Transformers stack, PyTorch and some utility libraries. Install essentials and import modules:
!pip install transformers accelerate sentencepiece --quiet
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List, Tuple, Optional
import textwrap, json, osModel configuration and system prompt
Choose a lightweight instruction-tuned model that understands conversational prompts and define a system prompt to steer behavior. Set token generation limits.
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
BASE_SYSTEM_PROMPT = (
"You are a custom GPT running locally. "
"Follow user instructions carefully. "
"Be concise and structured. "
"If something is unclear, say it is unclear. "
"Prefer practical examples over corporate examples unless explicitly asked. "
"When asked for code, give runnable code."
)
MAX_NEW_TOKENS = 256Loading the model into memory
Load the tokenizer and model from Hugging Face. Adjust device and dtype automatically to take advantage of available GPU.
print("Loading model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto"
)
model.eval()
print("Model loaded.")Conversation format and prompt builder
Keep a structured conversation history that includes the system prompt and alternating user/assistant turns. Serialize the history into the model's expected format with clear role markers.
ConversationHistory = List[Tuple[str, str]]
history: ConversationHistory = [("system", BASE_SYSTEM_PROMPT)]
def wrap_text(s: str, w: int = 100) -> str:
return "\n".join(textwrap.wrap(s, width=w))
def build_chat_prompt(history: ConversationHistory, user_msg: str) -> str:
prompt_parts = []
for role, content in history:
if role == "system":
prompt_parts.append(f"<|system|>\n{content}\n")
elif role == "user":
prompt_parts.append(f"<|user|>\n{content}\n")
elif role == "assistant":
prompt_parts.append(f"<|assistant|>\n{content}\n")
prompt_parts.append(f"<|user|>\n{user_msg}\n")
prompt_parts.append("<|assistant|>\n")
return "".join(prompt_parts)Lightweight local tools and routing
Add small built-in tools to simulate searches or documentation extracts. This router checks for specific prefixes and returns short contextual responses that the model can consume.
def local_tool_router(user_msg: str) -> Optional[str]:
msg = user_msg.strip().lower()
if msg.startswith("search:"):
query = user_msg.split(":", 1)[-1].strip()
return f"Search results about '{query}':\n- Key point 1\n- Key point 2\n- Key point 3"
if msg.startswith("docs:"):
topic = user_msg.split(":", 1)[-1].strip()
return f"Documentation extract on '{topic}':\n1. The agent orchestrates tools.\n2. The model consumes output.\n3. Responses become memory."
return NoneGenerating replies and persistence
Compose the final prompt by combining the conversation history and any tool-generated context. Tokenize and generate with the model, decode the output, extract the assistant reply, and append turns to history. Include utilities to save and load chat history.
def generate_reply(history: ConversationHistory, user_msg: str) -> str:
tool_context = local_tool_router(user_msg)
if tool_context:
user_msg = user_msg + "\n\nUseful context:\n" + tool_context
prompt = build_chat_prompt(history, user_msg)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=MAX_NEW_TOKENS,
do_sample=True,
top_p=0.9,
temperature=0.6,
pad_token_id=tokenizer.eos_token_id
)
decoded = tokenizer.decode(output_ids[0], skip_special_tokens=True)
reply = decoded.split("<|assistant|>")[-1].strip() if "<|assistant|>" in decoded else decoded[len(prompt):].strip()
history.append(("user", user_msg))
history.append(("assistant", reply))
return reply
def save_history(history: ConversationHistory, path: str = "chat_history.json") -> None:
data = [{"role": r, "content": c} for (r, c) in history]
with open(path, "w") as f:
json.dump(data, f, indent=2)
def load_history(path: str = "chat_history.json") -> ConversationHistory:
if not os.path.exists(path):
return [("system", BASE_SYSTEM_PROMPT)]
with open(path, "r") as f:
data = json.load(f)
return [(item["role"], item["content"]) for item in data]Demo and interactive loop
Run a couple of demo prompts to validate behavior and optionally start an interactive input loop to chat with the assistant.
print("\n--- Demo turn 1 ---")
demo_reply_1 = generate_reply(history, "Explain what this custom GPT setup is doing in 5 bullet points.")
print(wrap_text(demo_reply_1))
print("\n--- Demo turn 2 ---")
demo_reply_2 = generate_reply(history, "search: agentic ai with local models")
print(wrap_text(demo_reply_2))
def interactive_chat():
print("\nChat ready. Type 'exit' to stop.")
while True:
try:
user_msg = input("\nUser: ").strip()
except EOFError:
break
if user_msg.lower() in ("exit", "quit", "q"):
break
reply = generate_reply(history, user_msg)
print("\nAssistant:\n" + wrap_text(reply))
# interactive_chat()
print("\nCustom GPT initialized successfully.")What this setup gives you
This tutorial shows how to orchestrate a local instruction-tuned model into a conversational agent with a clear system role, conversation history, lightweight tool routing, and persistence. You get a runnable pattern to experiment with different prompts, tools, and local data integration while keeping everything offline.
Сменить язык
Читать эту статью на русском