AmbiGraph-Eval: Benchmarking LLMs for Ambiguity in Graph Query Generation

Semantic parsing turns natural language into precise graph queries such as Cypher, enabling intuitive database interaction. Natural language, however, is often ambiguous, while query languages demand exact semantics. Graph databases introduce extra complexity because nodes and relationships create many plausible interpretations for the same phrase.

Why ambiguity in graph queries matters

A phrase like 'best evaluated restaurant' can map to different graph queries depending on whether you compare individual ratings, aggregate scores, or ratings tied to specific visits or reviewers. Wrong interpretations lead to incorrect results, extra computation, and wasted resources. In time sensitive or high-stakes environments, these mistakes can reduce effectiveness or increase costs.

Types of ambiguity identified

The researchers categorize ambiguity in graph query generation into three concrete types:

Attribute ambiguity: uncertainty about which property or attribute on a node is meant.
Relationship ambiguity: uncertainty about which edge type or path connects entities.
Attribute-relationship ambiguity: combinations of the two that create multi-dimensional uncertainty.

These distinctions help focus evaluation and error analysis for models that must both understand intent and produce syntactically valid queries.

Building AmbiGraph-Eval

AmbiGraph-Eval is a benchmark of 560 ambiguous natural language queries paired with graph database samples, crafted to probe LLMs' ability to generate syntactically correct and semantically appropriate Cypher statements. The dataset was created in two phases: initial data collection and human review. Ambiguous prompts were obtained by three methods: extraction from existing graph databases, synthesis from unambiguous data via LLMs, and full generation by prompting LLMs to create new ambiguous cases. Human review ensured the cases were realistic and genuinely ambiguous.

Researchers evaluated nine LLMs, including closed-source models such as GPT-4 and Claude-3.5-Sonnet and open-source models such as Qwen-2.5 and LLaMA-3.1. Evaluations ran via API calls or on local infrastructure using GPUs, measuring models on zero-shot generation of Cypher queries and their ability to resolve ambiguity across the three categories.

Key findings from model evaluation

Performance varies widely by ambiguity type and task setup. Highlights include:

Attribute ambiguity: models differ between same-entity scenarios and cross-entity comparisons. O1-mini performed strongly on same-entity tasks, while GPT-4o and LLaMA-3.1 also showed good results. GPT-4o led on cross-entity tasks.
Relationship ambiguity: LLaMA-3.1 led overall, while GPT-4o had mixed results, weaker on same-entity cases but strong on cross-entity reasoning.
Attribute-relationship ambiguity: this combined category proved hardest. LLaMA-3.1 did best on same-entity tasks and GPT-4o on cross-entity tasks, but overall scores dropped compared to single-dimension ambiguities.

The evaluation shows that strong general reasoning does not automatically translate into reliable ambiguity resolution for graph queries. Models struggle with identifying ambiguous intent, producing correct syntax, interpreting graph schemas, and performing aggregations.

Bottlenecks and directions for improvement

Two major bottlenecks emerged: ambiguity detection and syntax generation. Even when a model can reason about entities and attributes, it may fail to express the correct Cypher syntax or to flag multiple plausible interpretations. Suggested directions include syntax-aware prompting, explicit signaling of ambiguity to the model, and tighter integration of schema information or interactive clarification strategies. The benchmark provides a diagnostic tool for future work aimed at aligning LLM outputs with true user intent in graph query settings.

Resources

The researchers provide a technical paper and a GitHub page with tutorials, code, and notebooks for those who want to reproduce or extend the benchmark. The paper and repository are useful starting points for teams looking to improve LLM-driven graph query generation.