<RETURN_TO_BASE

Google AI Launches LangExtract: Python Library for Traceable Data Extraction from Unstructured Text

Google AI introduces LangExtract, a powerful open-source Python library that extracts structured and traceable data from unstructured text using LLMs like Gemini.

Unlocking Insights from Unstructured Text

In a world flooded with data, crucial information often resides within unstructured text such as clinical notes, legal contracts, or customer feedback. Extracting structured, traceable data from these documents has remained a technical challenge. Google AI’s new open-source Python library, LangExtract, tackles this problem head-on by leveraging large language models (LLMs) like Gemini to deliver accurate and transparent extraction.

Key Features of LangExtract

  • Declarative and Traceable Extraction: Users define extraction tasks using natural language prompts and few-shot examples, specifying entities, relationships, or facts to extract. Each extracted data point is linked back to its original location in the source text, ensuring validation and auditability.

  • Versatile Domain Application: LangExtract supports diverse fields including healthcare (extracting medications and dosages), finance (summarizing risk documents), law (contract analysis), research, and even the arts (analyzing Shakespearean literature).

  • Schema Enforcement with LLMs: Powered by Gemini and compatible with other LLMs, LangExtract enforces custom output schemas like JSON, preventing hallucination and schema drift by anchoring outputs to instructions and source text.

  • Scalability and Visualization: The library efficiently processes lengthy documents by chunking and parallelizing, aggregates results, and offers interactive HTML reports that highlight extracted entities within the original text for easy review.

  • Seamless Integration: LangExtract runs smoothly in Google Colab, Jupyter notebooks, or as standalone HTML files, enabling rapid development and iteration.

Installation

Install LangExtract easily using pip:

pip install langextract

Example Workflow: Extracting Character Information from Shakespeare

import langextract as lx
import textwrap
 
# 1. Define your prompt
prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.
""")
 
# 2. Provide a high-quality example
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
            lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
            lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
        ],
    )
]
 
# 3. Extract from new text
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
 
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-pro"
)
 
# 4. Save and visualize results
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content)

This workflow outputs structured JSON data linked to source text, accompanied by interactive HTML visualization for easy auditing.

Real-World Applications

  • Healthcare: Extracts medications, dosages, and timings from clinical documents, improving clarity and interoperability.
  • Finance & Law: Automatically identifies relevant clauses or risks from dense legal and financial texts with traceability.
  • Research: Facilitates large-scale extraction from scientific literature.

LangExtract also offers RadExtract, a demonstration tool for structuring radiology reports with precise source linking.

Advantages Over Traditional Methods

| Feature | Traditional Approaches | LangExtract Approach | |------------------------|--------------------------------|-------------------------------------------------------| | Schema Consistency | Manual and error-prone | Enforced by instructions and few-shot examples | | Result Traceability | Minimal | All outputs linked to source text | | Scaling to Long Texts | Windowed, lossy | Chunked and parallel extraction with aggregation | | Visualization | Custom or absent | Built-in interactive HTML reports | | Deployment | Rigid, model-specific | Gemini-first, open to other LLMs and on-premises |

LangExtract marks a significant advancement in automated, reliable extraction of structured data from unstructured text sources.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский