<RETURN_TO_BASE

CMU Trains LLM Agents to Be Productive, Proactive and Personalized with PPP and UserVille

CMU researchers present PPP and UserVille, a multiobjective RL framework and interactive environment that teach LLM agents when to ask clarifying questions and how to adapt to user preferences, significantly improving productivity, proactivity, and personalization on benchmark tasks.

Why current LLM agents fall short

Most large language model agents are optimized to maximize task success. They can fix GitHub issues or answer deep research questions, but they often fail to reason about when to ask clarifying questions or how to adapt to a user's interaction preferences. This leads to either too few questions and lower task quality, or too many intrusive questions that violate user expectations.

What PPP and UserVille introduce

Researchers from Carnegie Mellon University (CMU) and OpenHands formalize three joint objectives for interactive agents: Productivity, Proactivity, and Personalization. They implement a multiobjective reinforcement learning framework called PPP inside an interaction centric environment named UserVille. The goal is to train agents that not only complete tasks, but also know when to ask useful questions and how to tailor behavior to individual users.

UserVille: an environment focused on interaction

UserVille converts existing agent benchmarks into an RL environment populated by LLM based user simulators. Its key components are:

  • Prompt vaguenization: Precise task prompts are rewritten to vague versions that preserve intent but remove details. The simulator still sees the precise prompt while the agent receives the vague prompt, creating information asymmetry.
  • Preference aware user simulation: Each simulator is parameterized by one of 20 interaction preferences, covering brevity, allowed number of questions per turn, answer format, timing, language constraints, or requirements such as JSON formatted replies. Training uses 12 preferences and holds out 8 for generalization tests.
  • User centric evaluation: After a session, the simulator labels each question as low effort, medium effort, or high effort. The proactivity score rewards sessions with low effort questions. Personalization is scored when the agent follows the user preference, averaged over sessions where the agent asked at least one question.

UserVille is applied to two domains: software engineering (SWE-Gym training, SWE-Bench Verified and SWE-Bench Full for evaluation) and deep research (BrowseComp-Plus with search and open_page tools).

PPP: reward design and training

Agents are implemented in a ReAct style with Seed-OSS-36B-Instruct policies. They can call domain tools and an ask_user tool that queries the user simulator. PPP defines a trajectory level reward R as the sum of three components:

R = RProd + RProact + RPers

  • Productivity reward (RProd) uses the task metric, for example F1 on SWE-Func-Loc or exact match on BrowseComp-Plus.
  • Proactivity reward (RProact) gives a bonus of +0.05 if all questions in a session are low effort, and applies penalties of −0.1 per medium effort question and −0.5 per high effort question.
  • Personalization reward (RPers) adds +0.05 when the agent follows the preference and applies nonpositive penalties defined by preference specific rules for violations.

Training uses a GRPO based RL algorithm with the Clip Higher strategy and a token level policy gradient loss inspired by DAPO. Only LLM generated tokens are optimized. Training is implemented with Verl and OpenHands scaffolds. Seed-OSS-36B-Instruct is trained for 200 steps with batch size 64 and group size 8. Maximum output lengths vary by task (for example 32k tokens for SWE-Func-Loc). GPT 5 Nano is used as the user simulator in experiments.

Experimental results

The paper evaluates productivity, proactivity, and personalization on SWE-Bench Verified Func-Loc and BrowseComp-Plus using vague prompts and averaging over 20 preferences.

For the Seed-OSS-36B-Instruct base model:

  • SWE-Func-Loc: productivity 38.59, proactivity 43.70, personalization 69.07
  • BrowseComp-Plus: productivity 18.20, proactivity 37.60, personalization 64.76

After PPP RL training, the PPP model reaches:

  • SWE-Func-Loc: productivity 56.26, proactivity 75.55, personalization 89.26
  • BrowseComp-Plus: productivity 26.63, proactivity 47.69, personalization 76.85

The average gain across all three dimensions and both datasets is 16.72 points relative to Seed-OSS-36B-Instruct. PPP also outperforms GPT 5 and other GPT series baselines on the combined metric.

Interaction matters: on SWE-Func-Loc, F1 with precise prompts and no interaction is 64.50. With vague prompts and no interaction it drops to 44.11. Adding interaction without RL does not recover the gap. With PPP training and interaction, F1 under vague prompts improves by 21.66 points.

PPP changes agent behavior as well. The ask ratio on SWE-Func-Loc rises from 50% to 100% under vague prompts and from 51% to 85% on deep research tasks, while remaining low for precise prompts. The number of questions per session increases early in training, then stabilizes with a high proportion of low effort questions and very few high effort ones.

Key takeaways

  • PPP reframes agent training as a multiobjective RL problem that jointly optimizes Productivity, Proactivity, and Personalization.
  • UserVille enforces interaction modeling by creating vague prompts and pairing them with preference aware simulators covering 20 interaction preferences.
  • The reward design incentivizes low effort clarifying questions and penalizes unnecessary or costly queries and preference violations.
  • PPP training substantially improves all three metrics on SWE-Bench Func-Loc and BrowseComp-Plus under vague prompts, and agents learn to ask fewer but more targeted low effort questions.

The work highlights interaction awareness as a core capability for future LLM agents and provides code and data references in the paper repository.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский