Google AI Launches LangExtract: Python Library for Traceable Data Extraction from Unstructured Text

Unlocking Insights from Unstructured Text

In a world flooded with data, crucial information often resides within unstructured text such as clinical notes, legal contracts, or customer feedback. Extracting structured, traceable data from these documents has remained a technical challenge. Google AI’s new open-source Python library, LangExtract, tackles this problem head-on by leveraging large language models (LLMs) like Gemini to deliver accurate and transparent extraction.

Key Features of LangExtract

Declarative and Traceable Extraction: Users define extraction tasks using natural language prompts and few-shot examples, specifying entities, relationships, or facts to extract. Each extracted data point is linked back to its original location in the source text, ensuring validation and auditability.
Versatile Domain Application: LangExtract supports diverse fields including healthcare (extracting medications and dosages), finance (summarizing risk documents), law (contract analysis), research, and even the arts (analyzing Shakespearean literature).
Schema Enforcement with LLMs: Powered by Gemini and compatible with other LLMs, LangExtract enforces custom output schemas like JSON, preventing hallucination and schema drift by anchoring outputs to instructions and source text.
Scalability and Visualization: The library efficiently processes lengthy documents by chunking and parallelizing, aggregates results, and offers interactive HTML reports that highlight extracted entities within the original text for easy review.
Seamless Integration: LangExtract runs smoothly in Google Colab, Jupyter notebooks, or as standalone HTML files, enabling rapid development and iteration.

Installation

Install LangExtract easily using pip:

pip install langextract

Example Workflow: Extracting Character Information from Shakespeare

import langextract as lx
import textwrap
 
# 1. Define your prompt
prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.
""")
 
# 2. Provide a high-quality example
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
            lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
            lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
        ],
    )
]
 
# 3. Extract from new text
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
 
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-pro"
)
 
# 4. Save and visualize results
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content)

This workflow outputs structured JSON data linked to source text, accompanied by interactive HTML visualization for easy auditing.

Real-World Applications

Healthcare: Extracts medications, dosages, and timings from clinical documents, improving clarity and interoperability.
Finance & Law: Automatically identifies relevant clauses or risks from dense legal and financial texts with traceability.
Research: Facilitates large-scale extraction from scientific literature.

LangExtract also offers RadExtract, a demonstration tool for structuring radiology reports with precise source linking.

Advantages Over Traditional Methods

| Feature | Traditional Approaches | LangExtract Approach | |------------------------|--------------------------------|-------------------------------------------------------| | Schema Consistency | Manual and error-prone | Enforced by instructions and few-shot examples | | Result Traceability | Minimal | All outputs linked to source text | | Scaling to Long Texts | Windowed, lossy | Chunked and parallel extraction with aggregation | | Visualization | Custom or absent | Built-in interactive HTML reports | | Deployment | Rigid, model-specific | Gemini-first, open to other LLMs and on-premises |

LangExtract marks a significant advancement in automated, reliable extraction of structured data from unstructured text sources.