Teach a Local AI to Think and Act: Build a Virtual Desktop Agent with Flan-T5
Overview
This tutorial shows how to build a lightweight computer-use agent that reasons, plans, and performs virtual actions using a local open-weight language model. We simulate a miniature desktop environment, expose a tool interface for interactions, and create an agent that inspects its environment, decides on actions (click, type, screenshot), and executes them step by step. The demo uses a local Flan-T5 model and simple Python components to illustrate the architecture and control flow of such an agent.
Environment setup
Install the required packages and prepare the runtime for running local models and asynchronous tasks. The example assumes a Colab-like environment but works where the dependencies and a local model are available.
!pip install -q transformers accelerate sentencepiece nest_asyncio
import torch, asyncio, uuid
from transformers import pipeline
import nest_asyncio
nest_asyncio.apply()
Local LLM and virtual desktop
We create a minimal LocalLLM wrapper around a Transformers pipeline and a VirtualComputer class that simulates a tiny desktop with apps, focus, and a screen representation. The virtual desktop supports screenshot, click, and type operations and logs actions.
class LocalLLM:
def __init__(self, model_name="google/flan-t5-small", max_new_tokens=128):
self.pipe = pipeline("text2text-generation", model=model_name, device=0 if torch.cuda.is_available() else -1)
self.max_new_tokens = max_new_tokens
def generate(self, prompt: str) -> str:
out = self.pipe(prompt, max_new_tokens=self.max_new_tokens, temperature=0.0)[0]["generated_text"]
return out.strip()
class VirtualComputer:
def __init__(self):
self.apps = {"browser": "https://example.com", "notes": "", "mail": ["Welcome to CUA", "Invoice #221", "Weekly Report"]}
self.focus = "browser"
self.screen = "Browser open at https://example.com\nSearch bar focused."
self.action_log = []
def screenshot(self):
return f"FOCUS:{self.focus}\nSCREEN:\n{self.screen}\nAPPS:{list(self.apps.keys())}"
def click(self, target:str):
if target in self.apps:
self.focus = target
if target=="browser":
self.screen = f"Browser tab: {self.apps['browser']}\nAddress bar focused."
elif target=="notes":
self.screen = f"Notes App\nCurrent notes:\n{self.apps['notes']}"
elif target=="mail":
inbox = "\n".join(f"- {s}" for s in self.apps['mail'])
self.screen = f"Mail App Inbox:\n{inbox}\n(Read-only preview)"
else:
self.screen += f"\nClicked '{target}'."
self.action_log.append({"type":"click","target":target})
def type(self, text:str):
if self.focus=="browser":
self.apps["browser"] = text
self.screen = f"Browser tab now at {text}\nPage headline: Example Domain"
elif self.focus=="notes":
self.apps["notes"] += ("\n"+text)
self.screen = f"Notes App\nCurrent notes:\n{self.apps['notes']}"
else:
self.screen += f"\nTyped '{text}' but no editable field."
self.action_log.append({"type":"type","text":text})
This pairing provides a reasoning engine (LocalLLM) and a controlled environment (VirtualComputer) where the agent can practice reasoning about UI state and executing actions.
Tool interface
The ComputerTool wraps the VirtualComputer and exposes a consistent run(command, argument) interface that the agent uses to interact with the simulated desktop.
class ComputerTool:
def __init__(self, computer:VirtualComputer):
self.computer = computer
def run(self, command:str, argument:str=""):
if command=="click":
self.computer.click(argument)
return {"status":"completed","result":f"clicked {argument}"}
if command=="type":
self.computer.type(argument)
return {"status":"completed","result":f"typed {argument}"}
if command=="screenshot":
snap = self.computer.screenshot()
return {"status":"completed","result":snap}
return {"status":"error","result":f"unknown command {command}"}
This structure separates reasoning (LLM) from execution (tool), making it easy to swap implementations or add new capabilities.
Agent logic and control loop
The ComputerAgent orchestrates the interaction: it prompts the LLM with the user goal and current screen, parses the LLM’s step-by-step reply to extract actions and assistant messages, calls the tool, captures outputs, and repeats until the goal is reached or the step budget is exhausted.
class ComputerAgent:
def __init__(self, llm:LocalLLM, tool:ComputerTool, max_trajectory_budget:float=5.0):
self.llm = llm
self.tool = tool
self.max_trajectory_budget = max_trajectory_budget
async def run(self, messages):
user_goal = messages[-1]["content"]
steps_remaining = int(self.max_trajectory_budget)
output_events = []
total_prompt_tokens = 0
total_completion_tokens = 0
while steps_remaining>0:
screen = self.tool.computer.screenshot()
prompt = (
"You are a computer-use agent.\n"
f"User goal: {user_goal}\n"
f"Current screen:\n{screen}\n\n"
"Think step-by-step.\n"
"Reply with: ACTION <click/type/screenshot> ARG <target or text> THEN <assistant message>.\n"
)
thought = self.llm.generate(prompt)
total_prompt_tokens += len(prompt.split())
total_completion_tokens += len(thought.split())
action="screenshot"; arg=""; assistant_msg="Working..."
for line in thought.splitlines():
if line.strip().startswith("ACTION "):
after = line.split("ACTION ",1)[1]
action = after.split()[0].strip()
if "ARG " in line:
part = line.split("ARG ",1)[1]
if " THEN " in part:
arg = part.split(" THEN ")[0].strip()
else:
arg = part.strip()
if "THEN " in line:
assistant_msg = line.split("THEN ",1)[1].strip()
output_events.append({"summary":[{"text":assistant_msg,"type":"summary_text"}],"type":"reasoning"})
call_id = "call_"+uuid.uuid4().hex[:16]
tool_res = self.tool.run(action, arg)
output_events.append({"action":{"type":action,"text":arg},"call_id":call_id,"status":tool_res["status"],"type":"computer_call"})
snap = self.tool.computer.screenshot()
output_events.append({"type":"computer_call_output","call_id":call_id,"output":{"type":"input_image","image_url":snap}})
output_events.append({"type":"message","role":"assistant","content":[{"type":"output_text","text":assistant_msg}]})
if "done" in assistant_msg.lower() or "here is" in assistant_msg.lower():
break
steps_remaining -= 1
usage = {"prompt_tokens": total_prompt_tokens,"completion_tokens": total_completion_tokens,"total_tokens": total_prompt_tokens + total_completion_tokens,"response_cost": 0.0}
yield {"output": output_events, "usage": usage}
The agent keeps a small budget for steps and records detailed events for each reasoning and tool call, including snapshots of the virtual screen after actions.
Running the demo
The demo ties the components together: it initializes the virtual computer, the tool, the local LLM, and the agent, then asks the agent to open mail, read subjects, and summarize.
async def main_demo():
computer = VirtualComputer()
tool = ComputerTool(computer)
llm = LocalLLM()
agent = ComputerAgent(llm, tool, max_trajectory_budget=4)
messages=[{"role":"user","content":"Open mail, read inbox subjects, and summarize."}]
async for result in agent.run(messages):
print("==== STREAM RESULT ====")
for event in result["output"]:
if event["type"]=="computer_call":
a = event.get("action",{})
print(f"[TOOL CALL] {a.get('type')} -> {a.get('text')} [{event.get('status')}]")
if event["type"]=="computer_call_output":
snap = event["output"]["image_url"]
print("SCREEN AFTER ACTION:\n", snap[:400],"...\n")
if event["type"]=="message":
print("ASSISTANT:", event["content"][0]["text"], "\n")
print("USAGE:", result["usage"])
loop = asyncio.get_event_loop()
loop.run_until_complete(main_demo())
What to observe
- The agent reasons using text prompts that include the user goal and a serialized view of the current screen.
- The LLM is prompted to output explicit ACTION/ARG/THEN lines that the controller parses into tool calls.
- Each cycle logs reasoning, the tool call, call outputs (screen snapshots), and assistant messages, enabling traceability of decisions.
Extensions and considerations
This minimal example demonstrates the core architecture: a local LLM for reasoning, a structured tool interface for actions, and an agent loop that composes these pieces. From here you can:
- Swap to a larger local model or quantized weights for better reasoning.
- Extend the virtual desktop with richer app state and modalities (images, DOM-like structure).
- Add safety checks, action validation, and longer-term memory for multi-step tasks.
The provided code gives a practical starting point to explore how local language models can power interactive agents that plan and act inside controlled environments.