How to Build a Reliable Groundedness Verification Tool with Upstage API and LangChain

Introduction to Upstage's Groundedness Check API

Upstage offers a powerful Groundedness Check API designed to verify if AI-generated responses are firmly based on reliable source material. By sending context–answer pairs to the Upstage endpoint, developers can instantly determine whether the context supports the answer and receive a confidence score for the grounding.

Setting Up the Environment

To get started, you need to install the necessary packages and configure your API key:

!pip install -qU langchain-core langchain-upstage
 
import os
import json
from typing import List, Dict, Any
from langchain_upstage import UpstageGroundednessCheck
 
os.environ["UPSTAGE_API_KEY"] = "Use Your API Key Here"

This setup installs the LangChain core and Upstage integration, imports essential Python modules, and sets the API key to authenticate requests.

Creating an Advanced Groundedness Checker

The AdvancedGroundednessChecker class wraps the Upstage API, providing methods for single and batch checks, extracting confidence levels, and analyzing results:

class AdvancedGroundednessChecker:
    """Advanced wrapper for Upstage Groundedness Check with batch processing and analysis"""
   
    def __init__(self):
        self.checker = UpstageGroundednessCheck()
        self.results = []
   
    def check_single(self, context: str, answer: str) -> Dict[str, Any]:
        """Check groundedness for a single context-answer pair"""
        request = {"context": context, "answer": answer}
        response = self.checker.invoke(request)
       
        result = {
            "context": context,
            "answer": answer,
            "grounded": response,
            "confidence": self._extract_confidence(response)
        }
        self.results.append(result)
        return result
   
    def batch_check(self, test_cases: List[Dict[str, str]]) -> List[Dict[str, Any]]:
        """Process multiple test cases"""
        batch_results = []
        for case in test_cases:
            result = self.check_single(case["context"], case["answer"])
            batch_results.append(result)
        return batch_results
   
    def _extract_confidence(self, response) -> str:
        """Extract confidence level from response"""
        if hasattr(response, 'lower'):
            if 'grounded' in response.lower():
                return 'high'
            elif 'not grounded' in response.lower():
                return 'low'
        return 'medium'
   
    def analyze_results(self) -> Dict[str, Any]:
        """Analyze batch results"""
        total = len(self.results)
        grounded = sum(1 for r in self.results if 'grounded' in str(r['grounded']).lower())
       
        return {
            "total_checks": total,
            "grounded_count": grounded,
            "not_grounded_count": total - grounded,
            "accuracy_rate": grounded / total if total > 0 else 0
        }
 
checker = AdvancedGroundednessChecker()

This class simplifies running checks on single or multiple pairs and helps interpret results with confidence scores and accuracy metrics.

Running Single Context-Answer Checks

Here are examples demonstrating how the checker flags grounded and ungrounded answers:

print("=== Test Case 1: Height Discrepancy ===")
result1 = checker.check_single(
    context="Mauna Kea is an inactive volcano on the island of Hawai'i.",
    answer="Mauna Kea is 5,207.3 meters tall."
)
print(f"Result: {result1['grounded']}")
 
print("\n=== Test Case 2: Correct Information ===")
result2 = checker.check_single(
    context="Python is a high-level programming language created by Guido van Rossum in 1991. It emphasizes code readability and simplicity.",
    answer="Python was made by Guido van Rossum & focuses on code readability."
)
print(f"Result: {result2['grounded']}")
 
print("\n=== Test Case 3: Partial Information ===")
result3 = checker.check_single(
    context="The Great Wall of China is approximately 13,000 miles long and took over 2,000 years to build.",
    answer="The Great Wall of China is very long."
)
print(f"Result: {result3['grounded']}")
 
print("\n=== Test Case 4: Contradictory Information ===")
result4 = checker.check_single(
    context="Water boils at 100 degrees Celsius at sea level atmospheric pressure.",
    answer="Water boils at 90 degrees Celsius at sea level."
)
print(f"Result: {result4['grounded']}")

These tests show how the service discriminates between accurate, partial, and incorrect answers.

Batch Processing and Multi-Domain Testing

The checker also supports batch verification and multi-domain validation:

print("\n=== Batch Processing Example ===")
test_cases = [
    {
        "context": "Shakespeare wrote Romeo and Juliet in the late 16th century.",
        "answer": "Romeo and Juliet was written by Shakespeare."
    },
    {
        "context": "The speed of light is approximately 299,792,458 meters per second.",
        "answer": "Light travels at about 300,000 kilometers per second."
    },
    {
        "context": "Earth has one natural satellite called the Moon.",
        "answer": "Earth has two moons."
    }
]
 
batch_results = checker.batch_check(test_cases)
for i, result in enumerate(batch_results, 1):
    print(f"Batch Test {i}: {result['grounded']}")
 
print("\n=== Results Analysis ===")
analysis = checker.analyze_results()
print(f"Total checks performed: {analysis['total_checks']}")
print(f"Grounded responses: {analysis['grounded_count']}")
print(f"Not grounded responses: {analysis['not_grounded_count']}")
print(f"Groundedness rate: {analysis['accuracy_rate']:.2%}")
 
print("\n=== Multi-domain Testing ===")
domains = {
    "Science": {
        "context": "Photosynthesis is the process by which plants convert sunlight, carbon dioxide, & water into glucose and oxygen.",
        "answer": "Plants use photosynthesis to make food from sunlight and CO2."
    },
    "History": {
        "context": "World War II ended in 1945 after the surrender of Japan following the atomic bombings.",
        "answer": "WWII ended in 1944 with Germany's surrender."
    },
    "Geography": {
        "context": "Mount Everest is the highest mountain on Earth, located in the Himalayas at 8,848.86 meters.",
        "answer": "Mount Everest is the tallest mountain and is located in the Himalayas."
    }
}
 
for domain, test_case in domains.items():
    result = checker.check_single(test_case["context"], test_case["answer"])
    print(f"{domain}: {result['grounded']}")

This approach allows comprehensive testing across different topics and assessment of overall groundedness.

Generating Detailed Test Reports

A helper function compiles results into a summary report with recommendations:

def create_test_report(checker_instance):
    """Generate a detailed test report"""
    report = {
        "summary": checker_instance.analyze_results(),
        "detailed_results": checker_instance.results,
        "recommendations": []
    }
   
    accuracy = report["summary"]["accuracy_rate"]
    if accuracy < 0.7:
        report["recommendations"].append("Consider reviewing answer generation process")
    if accuracy > 0.9:
        report["recommendations"].append("High accuracy - system performing well")
   
    return report
 
print("\n=== Final Test Report ===")
report = create_test_report(checker)
print(f"Overall Performance: {report['summary']['accuracy_rate']:.2%}")
print("Recommendations:", report["recommendations"])

This report helps identify areas for improvement and confirms system reliability.

Summary

Upstage’s Groundedness Check provides a scalable, domain-agnostic solution for real-time fact verification and confidence scoring. Integrating this service enhances the reliability and factual integrity of AI-generated content across diverse applications.