<RETURN_TO_BASE

Nous Research Unveils NousCoder-14B: Competitive AI Model

NousCoder-14B shows high accuracy in competitive programming evaluations.

Overview of NousCoder-14B

Nous Research has introduced NousCoder-14B, a competitive Olympiad programming model post-trained on Qwen3-14B using reinforcement learning (RL) with verifiable rewards. On the LiveCodeBench v6 benchmark running from 08/01/2024 to 05/01/2025, the model achieved a Pass@1 accuracy of 67.87%, surpassing the Qwen3-14B baseline of 60.79% by 7.08 percentage points. The model was trained over 4 days on 24,000 verifiable coding problems using 48 B200 GPUs, and the weights have been released under the Apache 2.0 license on Hugging Face.

Benchmark Focus and the Significance of Pass@1

LiveCodeBench v6 is specifically designed for evaluating competitive programming capabilities. This test set comprises 454 problems, sharing methodology with the DeepCoder-14B project from Agentica and Together AI. It includes tasks from TACO Verified, PrimeIntellect SYNTHETIC 1, and previous LiveCodeBench problems before 07/31/2024. The Pass@1 metric indicates the fraction of the total problems for which the first generated program is correct under all constraints.

Dataset Construction for Reinforcement Learning

The training datasets comprise verifiable code generation problems, with each problem having a reference implementation and multiple test cases. The training set includes:

  • TACO Verified
  • PrimeIntellect SYNTHETIC 1
  • Pre-07/31/2024 LiveCodeBench problems

The evaluation set for this study is the LiveCodeBench v6, which provides the necessary rigor across 454 problems for the specified duration.

RL Environment Based on Atropos and Modal

An Atropos framework is used for the RL environment, generating Python code for various problems based on standard prompts. The model receives a scalar reward based on outcomes from test cases:

  • Reward +1 if the generated code passes all test cases.
  • Reward -1 for any incorrect output, time limit breach, or memory limit exceedance.

To ensure safe execution of untrusted code, Modal is utilized as a sandbox that scales automatically, where each rollout is processed in a separate container. This design effectively separates training from verification, maintaining the stability of the RL loop.

Objectives Employed: GRPO, DAPO, GSPO, and GSPO+

NousCoder-14B employs Group Relative Policy Optimization (GRPO), without requiring a separate value model, to explore three objectives:

  • Dynamic Sampling Policy Optimization (DAPO)
  • Group Sequence Policy Optimization (GSPO)
  • Modified GSPO variant (GSPO+)

These objectives operate on group-normalized rewards, and while their performance differences on LiveCodeBench v6 are modest, DAPO achieves the highest Pass@1 at 67.87% for the longest context length of 81,920 tokens.

Iterative Context Extension and Overlong Filtering

Qwen3-14B supports an extensive context, initially training at 32k and subsequently at 40k tokens, followed by YaRN context extension during evaluation up to 81,920 tokens. Implementing overlong filtering resets advantages to zero when generated programs exceed maximum context, strategically avoiding penalties and preserving solution quality.

Key Takeaways

  • NousCoder-14B, based on Qwen3-14B, shines in competitive programming tasks with a 67.87% Pass@1 on LiveCodeBench v6.
  • The model utilizes 24,000 verified problems, achieving strong performance on a separate 454-problem test set.
  • The RL environment is optimized for efficiency, with rewards effectively designed for correctness and resource constraints.
  • GRPO objectives contribute uniquely to long context reinforcement learning approaches.
  • The training methodology, including context extension and filtering, results in a robust, reproducible model.

Reach further insights by visiting Model Weights and Technical Details.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский