Build a Brain‑Inspired Hierarchical Reasoning Agent Locally with Hugging Face
Overview
This tutorial demonstrates how to recreate the spirit of a Hierarchical Reasoning Model (HRM) using a free Hugging Face model that can run locally. The workflow structures reasoning into planning, code-based solving, critique, and synthesis. By breaking tasks into subgoals and executing short deterministic Python snippets for each subgoal, a small model can deliver robust, inspectable reasoning without expensive APIs.
Setup and model selection
Install the required libraries and pick a compact instruction-tuned model. The example uses Qwen2.5-1.5B-Instruct and adjusts numeric precision depending on whether a GPU is available.
!pip -q install -U transformers accelerate bitsandbytes rich
import os, re, json, textwrap, traceback
from typing import Dict, Any, List
from rich import print as rprint
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32
Load tokenizer, model and pipeline
Load the tokenizer and model, configure 4-bit loading for efficiency, and create a text-generation pipeline to interact with the model in Colab or a similar environment.
tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
torch_dtype=DTYPE,
load_in_4bit=True
)
gen = pipeline(
"text-generation",
model=model,
tokenizer=tok,
return_full_text=False
)
Interaction and JSON parsing helpers
Define a chat wrapper to call the pipeline with optional system instructions and a helper to extract JSON reliably from model outputs, including responses wrapped in code fences.
def chat(prompt: str, system: str = "", max_new_tokens: int = 512, temperature: float = 0.3) -> str:
msgs = []
if system:
msgs.append({"role":"system","content":system})
msgs.append({"role":"user","content":prompt})
inputs = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
out = gen(inputs, max_new_tokens=max_new_tokens, do_sample=(temperature>0), temperature=temperature, top_p=0.9)
return out[0]["generated_text"].strip()
def extract_json(txt: str) -> Dict[str, Any]:
m = re.search(r"\{[\s\S]*\}$", txt.strip())
if not m:
m = re.search(r"\{[\s\S]*?\}", txt)
try:
return json.loads(m.group(0)) if m else {}
except Exception:
# fallback: strip code fences
s = re.sub(r"^```.*?\n|\n```$", "", txt, flags=re.S)
try:
return json.loads(s)
except Exception:
return {}
Execution helpers and safe Python runner
Extract code blocks, execute short deterministic Python snippets safely while capturing stdout and a RESULT variable. These utilities let the agent run self-contained computations for each subgoal.
def extract_code(txt: str) -> str:
m = re.search(r"```(?:python)?\s*([\s\S]*?)```", txt, flags=re.I)
return (m.group(1) if m else txt).strip()
def run_python(code: str, env: Dict[str, Any] | None = None) -> Dict[str, Any]:
import io, contextlib
g = {"__name__": "__main__"}; l = {}
if env: g.update(env)
buf = io.StringIO()
try:
with contextlib.redirect_stdout(buf):
exec(code, g, l)
out = l.get("RESULT", g.get("RESULT"))
return {"ok": True, "result": out, "stdout": buf.getvalue()}
except Exception as e:
return {"ok": False, "error": str(e), "trace": traceback.format_exc(), "stdout": buf.getvalue()}
Role prompts: Planner, Solver, Critic, Synthesizer
The HRM is driven by four role-specific system prompts. Planner splits the task into 2–4 atomic subgoals. Solver returns a single short Python snippet that computes a RESULT. Critic inspects subgoal outcomes and either accepts or asks for refinement. Synthesizer produces the final formatted answer.
PLANNER_SYS = """You are the HRM Planner.
Decompose the TASK into 2–4 atomic, code-solvable subgoals.
Return compact JSON only: {"subgoals":[...], "final_format":"<one-line answer format>"}."""
SOLVER_SYS = """You are the HRM Solver.
Given SUBGOAL and CONTEXT vars, output a single Python snippet.
Rules:
- Compute deterministically.
- Set a variable RESULT to the answer.
- Keep code short; stdlib only.
Return only a Python code block."""
CRITIC_SYS = """You are the HRM Critic.
Given TASK and LOGS (subgoal results), decide if final answer is ready.
Return JSON only: {"action":"submit"|"revise","critique":"...", "fix_hint":"<if revise>"}."""
SYNTH_SYS = """You are the HRM Synthesizer.
Given TASK, LOGS, and final_format, output only the final answer (no steps).
Follow final_format exactly."""
Planner/Solver/Critic loop and orchestration
Implement the HRM functions to plan, solve subgoals by generating and running Python, critique the results, refine if necessary, and synthesize the final answer. hrm_agent executes several rounds up to a budget and carries intermediate results forward as context.
def plan(task: str) -> Dict[str, Any]:
p = f"TASK:\n{task}\nReturn JSON only."
return extract_json(chat(p, PLANNER_SYS, temperature=0.2, max_new_tokens=300))
def solve_subgoal(subgoal: str, context: Dict[str, Any]) -> Dict[str, Any]:
prompt = f"SUBGOAL:\n{subgoal}\nCONTEXT vars: {list(context.keys())}\nReturn Python code only."
code = extract_code(chat(prompt, SOLVER_SYS, temperature=0.2, max_new_tokens=400))
res = run_python(code, env=context)
return {"subgoal": subgoal, "code": code, "run": res}
def critic(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
pl = [{"subgoal": L["subgoal"], "result": L["run"].get("result"), "ok": L["run"]["ok"]} for L in logs]
out = chat("TASK:\n"+task+"\nLOGS:\n"+json.dumps(pl, ensure_ascii=False, indent=2)+"\nReturn JSON only.",
CRITIC_SYS, temperature=0.1, max_new_tokens=250)
return extract_json(out)
def refine(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
sys = "Refine subgoals minimally to fix issues. Return same JSON schema as planner."
out = chat("TASK:\n"+task+"\nLOGS:\n"+json.dumps(logs, ensure_ascii=False)+"\nReturn JSON only.",
sys, temperature=0.2, max_new_tokens=250)
j = extract_json(out)
return j if j.get("subgoals") else {}
def synthesize(task: str, logs: List[Dict[str, Any]], final_format: str) -> str:
packed = [{"subgoal": L["subgoal"], "result": L["run"].get("result")} for L in logs]
return chat("TASK:\n"+task+"\nLOGS:\n"+json.dumps(packed, ensure_ascii=False)+
f"\nfinal_format: {final_format}\nOnly the final answer.",
SYNTH_SYS, temperature=0.0, max_new_tokens=120).strip()
def hrm_agent(task: str, context: Dict[str, Any] | None = None, budget: int = 2) -> Dict[str, Any]:
ctx = dict(context or {})
trace, plan_json = [], plan(task)
for round_id in range(1, budget+1):
logs = [solve_subgoal(sg, ctx) for sg in plan_json.get("subgoals", [])]
for L in logs:
ctx_key = f"g{len(trace)}_{abs(hash(L['subgoal']))%9999}"
ctx[ctx_key] = L["run"].get("result")
verdict = critic(task, logs)
trace.append({"round": round_id, "plan": plan_json, "logs": logs, "verdict": verdict})
if verdict.get("action") == "submit": break
plan_json = refine(task, logs) or plan_json
final = synthesize(task, trace[-1]["logs"], plan_json.get("final_format", "Answer: <value>"))
return {"final": final, "trace": trace}
Demos: ARC-like transformation and word math
Two example tasks validate the pipeline: an ARC-like grid transformation that requires inferring a rule from train pairs and applying it to test data, and a word-math problem involving percent decay and refill.
ARC_TASK = textwrap.dedent("""
Infer the transformation rule from train examples and apply to test.
Return exactly: "Answer: <grid>", where <grid> is a Python list of lists of ints.
""").strip()
ARC_DATA = {
"train": [
{"inp": [[0,0],[1,0]], "out": [[1,1],[0,1]]},
{"inp": [[0,1],[0,0]], "out": [[1,0],[1,1]]}
],
"test": [[0,0],[0,1]]
}
res1 = hrm_agent(ARC_TASK, context={"TRAIN": ARC_DATA["train"], "TEST": ARC_DATA["test"]}, budget=2)
rprint("\n[bold]Demo 1 — ARC-like Toy[/bold]")
rprint(res1["final"])
WM_TASK = "A tank holds 1200 L. It leaks 2% per hour for 3 hours, then is refilled by 150 L. Return exactly: 'Answer: <liters>'."
res2 = hrm_agent(WM_TASK, context={}, budget=2)
rprint("\n[bold]Demo 2 — Word Math[/bold]")
rprint(res2["final"])
rprint("\n[dim]Rounds executed (Demo 1):[/dim]", len(res1["trace"]))
Practical notes and takeaways
This pattern demonstrates how hierarchical planning, targeted small Python solvers, and an internal critic enable a compact model to handle structured reasoning tasks. The approach produces an auditable trace of plans and intermediate results, lets you iterate locally, and can be adapted to many tasks that are decomposable into code-executable subgoals. The pipeline encourages experimentation with different models, system prompts, and solver constraints to balance correctness, speed, and interpretability.