Google AI Launches LangExtract: Python Library for Traceable Data Extraction from Unstructured Text
Google AI introduces LangExtract, a powerful open-source Python library that extracts structured and traceable data from unstructured text using LLMs like Gemini.
Unlocking Insights from Unstructured Text
In a world flooded with data, crucial information often resides within unstructured text such as clinical notes, legal contracts, or customer feedback. Extracting structured, traceable data from these documents has remained a technical challenge. Google AI’s new open-source Python library, LangExtract, tackles this problem head-on by leveraging large language models (LLMs) like Gemini to deliver accurate and transparent extraction.
Key Features of LangExtract
-
Declarative and Traceable Extraction: Users define extraction tasks using natural language prompts and few-shot examples, specifying entities, relationships, or facts to extract. Each extracted data point is linked back to its original location in the source text, ensuring validation and auditability.
-
Versatile Domain Application: LangExtract supports diverse fields including healthcare (extracting medications and dosages), finance (summarizing risk documents), law (contract analysis), research, and even the arts (analyzing Shakespearean literature).
-
Schema Enforcement with LLMs: Powered by Gemini and compatible with other LLMs, LangExtract enforces custom output schemas like JSON, preventing hallucination and schema drift by anchoring outputs to instructions and source text.
-
Scalability and Visualization: The library efficiently processes lengthy documents by chunking and parallelizing, aggregates results, and offers interactive HTML reports that highlight extracted entities within the original text for easy review.
-
Seamless Integration: LangExtract runs smoothly in Google Colab, Jupyter notebooks, or as standalone HTML files, enabling rapid development and iteration.
Installation
Install LangExtract easily using pip:
pip install langextractExample Workflow: Extracting Character Information from Shakespeare
import langextract as lx
import textwrap
# 1. Define your prompt
prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.
""")
# 2. Provide a high-quality example
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
extractions=[
lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
],
)
]
# 3. Extract from new text
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-pro"
)
# 4. Save and visualize results
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
f.write(html_content)This workflow outputs structured JSON data linked to source text, accompanied by interactive HTML visualization for easy auditing.
Real-World Applications
- Healthcare: Extracts medications, dosages, and timings from clinical documents, improving clarity and interoperability.
- Finance & Law: Automatically identifies relevant clauses or risks from dense legal and financial texts with traceability.
- Research: Facilitates large-scale extraction from scientific literature.
LangExtract also offers RadExtract, a demonstration tool for structuring radiology reports with precise source linking.
Advantages Over Traditional Methods
| Feature | Traditional Approaches | LangExtract Approach | |------------------------|--------------------------------|-------------------------------------------------------| | Schema Consistency | Manual and error-prone | Enforced by instructions and few-shot examples | | Result Traceability | Minimal | All outputs linked to source text | | Scaling to Long Texts | Windowed, lossy | Chunked and parallel extraction with aggregation | | Visualization | Custom or absent | Built-in interactive HTML reports | | Deployment | Rigid, model-specific | Gemini-first, open to other LLMs and on-premises |
LangExtract marks a significant advancement in automated, reliable extraction of structured data from unstructured text sources.
Сменить язык
Читать эту статью на русском