NVIDIA's NitroGen: A Game-Changing AI Model
NVIDIA releases NitroGen, an AI model revolutionizing generalist gaming agents.
Overview
NVIDIA AI research team released NitroGen, an open vision action foundation model designed for generalist gaming agents. This innovative model learns to play commercial games directly from pixels and gamepad actions, leveraging internet video at scale. NitroGen has been trained on 40,000 hours of gameplay across more than 1,000 games and includes an open dataset, a universal simulator, and a pre-trained policy.
Internet Scale Video Action Dataset
The NitroGen pipeline begins with publicly available gameplay videos that feature input overlays, like gamepad visualizations utilized by streamers. The research team gathered 71,000 hours of raw video with overlays, applying quality filters to retain 40,000 hours of data, representing over 1,000 games.
Curated dataset comprises 38,739 videos from 818 creators, covering diverse titles across various genres. Action RPGs hold 34.9%, platformers 18.4%, and action-adventure games 9.2%, with additional coverage of racing, sports, and roguelike genres.
Action Extraction from Controller Overlays
NitroGen employs a three-stage action extraction pipeline:
- Template Matching: Localizes controller overlays using ~300 templates, sampling 25 frames to match features.
- SegFormer Classification: Parses controller crops and identifies joystick locations using a hybrid model trained on 8 million synthetic images.
- Joystick Position Refinement: Normalizes joystick coordinates and filters low-activity segments, avoiding over-prediction of null actions during training.
Benchmarks show joystick predictions achieve an average R² of 0.84 and button frame accuracy of 0.96 across Xbox and PlayStation controllers, validating the quality of automatic annotations.
Universal Simulator and Multi-Game Benchmark
NitroGen integrates a universal simulator compatible with a Gymnasium interface, allowing cross-title interaction without altering game code. Observations consist of single RGB frames, with actions encapsulated in a unified 16-dimensional binary vector for gamepad buttons and joystick positions.
The benchmark includes 10 commercial games and 30 tasks, incorporating both 2D and 3D gameplay, with dedicated evaluations across combat and navigation tasks.
NitroGen Model Architecture
NitroGen's foundation policy follows the GR00T N1 architecture, featuring a vision encoder and an action head, processing 256x256 RGB frames. The model employs a diffusion transformer (DiT) to generate future actions, training through conditional flow matching across action chunks.
The model is sizable, with 4.93 billion parameters, producing a 21x16 tensor that represents action dynamics.
Training Outcomes and Transfer Gains
NitroGen is trained purely via behavior cloning using the internet video dataset, without reinforcement learning. Notable image augmentations are applied, and results show task completion rates between 45% to 60% in zero-shot evaluations across various game types.
Transfer learning from NitroGen achieves relative improvements of 10% to 25% and even 52% in low-data scenarios, demonstrating significant performance boosts in unseen titles.
Key Takeaways
- Generalist Model: NitroGen efficiently maps game frames to standardized actions without reinforcement learning.
- Large Dataset: Utilizes 40,000 hours of gameplay data automatically labeled from overlays.
- Cross Game Transfer: Unified controller action space enables deployment of single policies across multiple games.
- Advanced Architecture: Employs diffusion transformers for handling noisy data, yielding robust control.
- Boosted Performance: Pretraining with NitroGen enhances completion rates across various tasks in new games.
Сменить язык
Читать эту статью на русском