IndQA: OpenAI’s Culture-Aware Benchmark for Indian Language Understanding

Why IndQA Exists

OpenAI created IndQA to answer a practical question: can large language models truly understand Indian languages and the cultural contexts that shape them? Many existing multilingual evaluations focus on translation or multiple-choice formats and often cluster top models at near-ceiling scores. That makes it hard to tell whether models grasp local context, history, and everyday life beyond surface-level token matching.

Dataset, Languages, and Cultural Domains

IndQA contains 2,278 questions written natively across 12 languages: Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi, and Tamil. The benchmark covers 10 cultural domains including Architecture and Design, Arts and Culture, Everyday Life, Food and Cuisine, History, Law and Ethics, Literature and Linguistics, Media and Entertainment, Religion and Spirituality, and Sports and Recreation. Hinglish is included to reflect common code switching in everyday conversations.

Expert-Curated Examples and Structure

Each datapoint bundles four components: a culturally grounded prompt in an Indian language, an English translation for auditability, a detailed rubric that lists grading criteria and weights, and an ideal answer that encodes expert expectations. The dataset was created with contributions from 261 domain experts across India, ensuring regionally anchored, reasoning-heavy prompts.

Rubric-Based Evaluation Instead of Exact Match

Rather than relying on exact-match accuracy, IndQA uses a rubric-based grading pipeline. Domain experts define multiple weighted criteria for each question describing what strong answers must include or avoid. A model-based grader evaluates model responses against those criteria and awards partial credit based on satisfied criteria. This approach captures nuance and cultural correctness instead of surface token overlap.

Adversarial Filtering and Construction Pipeline

OpenAI followed a four-step pipeline: recruit native-speaking domain experts; apply adversarial filtering by testing draft questions against strong models (GPT-4o, OpenAI o3, GPT-4.5 and partially GPT-5) and keep only items where most models failed; reuse the expert-defined rubrics for consistent grading; and finalize prompts with ideal answers, translations, peer review, and iterative revisions. Adversarial filtering preserves evaluation headroom so progress remains measurable.

Tracking Progress and Results

OpenAI reports that frontier models have improved on IndQA over recent years but that substantial room for improvement remains. Results are stratified by language and domain and include comparisons such as GPT-5 Thinking High against other systems. Because questions were adversarially filtered and graded with nuanced rubrics, IndQA highlights meaningful gains in cultural and contextual reasoning rather than superficial improvements.

Why It Matters

IndQA addresses a gap in multilingual evaluation by centering culturally relevant reasoning in Indian languages. With expert curation, rubric-based scoring, and adversarial filtering, it offers a practical north star for assessing how well modern AI systems understand region-specific knowledge and everyday cultural contexts in India.

IndQA: OpenAI’s Culture-Aware Benchmark for Indian Language Understanding

Сменить язык