Sensible Agent: How Google Couples 'what+how' to Make AR Less Awkward
What Sensible Agent aims to solve
Sensible Agent is a Google research framework and prototype that treats assisted augmented reality as a single decision: not only what the agent should do, but how it should present that suggestion. The system conditions this joint choice on real-time multimodal context such as whether the user’s hands are busy, ambient noise, and social setting. The goal is to reduce friction and social awkwardness by avoiding high-quality suggestions delivered through infeasible or socially unacceptable channels.
Why joint decisions matter
Traditional AR assistants separate content selection from interaction modality. Sensible Agent’s core insight is that a good suggestion delivered through the wrong channel is effectively noise. The framework explicitly models the combined decision of (a) the action the agent proposes, such as recommend, guide, remind, or automate, and (b) the presentation and confirmation modality, such as visual, audio, or combined, with inputs like head nods, gaze dwell, finger poses, short-vocabulary speech, or non-lexical sounds. By binding content to modality feasibility and social acceptability, the system aims to lower perceived effort while preserving utility.
Runtime architecture
A prototype running on an Android-class XR headset implements a three-stage pipeline:
- Context parsing fuses egocentric imagery using vision-language inference for scene, activity, and familiarity with an ambient audio classifier based on YAMNet to detect conditions like noise or conversation.
- A proactive query generator prompts a large multimodal model with few-shot exemplars to select the action, the query structure (binary, multi-choice, icon-cue), and the presentation modality.
- The interaction layer exposes only input methods compatible with current I/O availability. For example, it offers head nod for binary confirmation when whispering is unacceptable, or gaze dwell when hands are occupied.
How the few-shot policies are grounded
The team seeded the policy space using two studies. An expert workshop with 12 participants enumerated useful proactive help scenarios and socially acceptable micro-inputs. A context mapping study with 40 participants produced 960 context entries across everyday scenarios such as gym, grocery, museum, commuting, and cooking. Participants specified desired agent actions and a preferred query type and modality for each context. These mappings become the few-shot exemplars used at runtime, shifting choices from ad-hoc rules to data-derived patterns, for example preferring multi-choice in unfamiliar environments, binary confirmations under time pressure, and icon-plus-visual cues in socially sensitive settings.
Supported interaction techniques
The prototype supports a range of low-effort input primitives mapped to the chosen query types:
- Binary confirmations via head nod or shake
- Multi-choice via head-tilt mapping left/right/back to options 1/2/3
- Finger-pose gestures for numeric selection and thumbs up/down
- Gaze dwell to trigger visual buttons when raycast pointing would be fussy
- Short-vocabulary speech such as ‘yes’, ’no’, ‘one’, ’two’, ’three’ for minimal dictation
- Non-lexical conversational sounds like ‘mm-hm’ for noisy or whisper-only contexts
Crucially, the pipeline only offers modalities that are feasible in the sensed context, suppressing audio prompts in quiet spaces and avoiding gaze dwell when the user isn’t looking at the HUD.
Evaluation and results
A preliminary within-subjects study with 10 participants compared Sensible Agent to a voice-prompt baseline across AR and 360° VR. Results showed lower perceived interaction effort and intrusiveness while maintaining usability and preference for the Sensible Agent approach. The sample is small and directional, but it aligns with the hypothesis that coupling intent and modality reduces overhead.
Why YAMNet for audio
YAMNet is a lightweight MobileNet-v1 audio event classifier trained on AudioSet that predicts 521 classes. It is a practical on-device choice to detect coarse ambient conditions such as speech presence, music, or crowd noise. Those tags quickly gate whether to favor audio prompts or bias toward visual and gesture interactions. YAMNet’s availability in TensorFlow Hub and edge-ready guides makes it straightforward to deploy on device.
How to integrate Sensible Agent concepts
A minimal adoption plan includes:
- Instrument a lightweight context parser that produces a compact state from VLM on egocentric frames and ambient audio tags.
- Build a few-shot table mapping context to (action, query type, modality) from internal pilots or user studies.
- Prompt an LMM to emit both the ‘what’ and the ‘how’ together.
- Expose only feasible input methods per state and keep confirmations binary by default.
- Log choices and outcomes for offline policy learning.
The Sensible Agent artifacts demonstrate feasibility in WebXR/Chrome on Android-class hardware, so migrating to native HMD runtimes or phone-based HUDs is primarily an engineering exercise.
Bottom line
Sensible Agent operationalizes proactive AR as a coupled policy problem, providing a reproducible recipe: a dataset of context-to-(what/how) mappings, few-shot prompts that bind them at runtime, and low-effort input primitives that respect social and I/O constraints. The prototype and small user validation suggest the approach can reduce interaction cost compared to voice-first prompts.