Mastering Microsoft Presidio: Detect and Anonymize PII in Text with Practical Examples
Explore Microsoft Presidio’s capabilities for detecting and anonymizing PII in text with practical Python examples, including custom recognizers and hash-based anonymization.
Introduction to Microsoft Presidio
Microsoft’s Presidio is an open-source framework designed specifically for detecting, analyzing, and anonymizing personally identifiable information (PII) in free-form text. Built on top of the powerful spaCy NLP library, Presidio offers a lightweight, modular architecture suitable for integration into real-time applications and data pipelines.
Installing Presidio and Dependencies
To begin using Presidio, install the key libraries presidio-analyzer and presidio-anonymizer along with the recommended spaCy model for English:
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lgRestart your environment if necessary, especially when working in Jupyter or Colab.
Basic PII Detection with Presidio Analyzer
The AnalyzerEngine loads spaCy’s NLP pipeline and built-in recognizers to scan text for sensitive entities like phone numbers. Here is how to detect a U.S. phone number in a sample sentence while suppressing verbose logging:
import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
results = analyzer.analyze(text="My phone number is 212-555-5555",
entities=["PHONE_NUMBER"],
language='en')
print(results)Creating Custom Recognizers Using Deny Lists
Presidio supports custom recognizers. For example, you can create a recognizer to detect academic titles such as “Dr.” and “Prof.” using a deny list:
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry
academic_title_recognizer = PatternRecognizer(
supported_entity="ACADEMIC_TITLE",
deny_list=["Dr.", "Dr", "Professor", "Prof."]
)
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)
analyzer = AnalyzerEngine(registry=registry)
text = "Prof. John Smith is meeting with Dr. Alice Brown."
results = analyzer.analyze(text=text, language="en")
for result in results:
print(result)Custom recognizers can also be created using regex or external models as detailed in official documentation.
Anonymizing PII with Presidio Anonymizer
Once PII is detected, you can anonymize it. Here is an example replacing detected person names with a placeholder:
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
engine = AnonymizerEngine()
result = engine.anonymize(
text="My name is Bond, James Bond",
analyzer_results=[
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
],
operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})},
)
print(result)Other anonymization operators include redaction, hashing, and pseudonymization.
Advanced Usage: Custom Entities and Consistent Re-Anonymization
You can define custom PII entities using regex-based PatternRecognizers and implement a custom anonymizer that hashes entities consistently across texts.
Defining a Custom Hash-Based Anonymizer
This operator uses SHA-256 hashing and maintains a mapping to ensure consistent anonymization:
from presidio_anonymizer.operators import Operator, OperatorType
import hashlib
from typing import Dict
class ReAnonymizer(Operator):
"""
Anonymizer that replaces text with a reusable SHA-256 hash,
stored in a shared mapping dict.
"""
def operate(self, text: str, params: Dict = None) -> str:
entity_type = params.get("entity_type", "DEFAULT")
mapping = params.get("entity_mapping")
if mapping is None:
raise ValueError("Missing `entity_mapping` in params")
if entity_type in mapping and text in mapping[entity_type]:
return mapping[entity_type][text]
hashed = "<HASH_" + hashlib.sha256(text.encode()).hexdigest()[:10] + ">"
mapping.setdefault(entity_type, {})[text] = hashed
return hashed
def validate(self, params: Dict = None) -> None:
if "entity_mapping" not in params:
raise ValueError("You must pass an 'entity_mapping' dictionary.")
def operator_name(self) -> str:
return "reanonymizer"
def operator_type(self) -> OperatorType:
return OperatorType.AnonymizeCustom PII Recognizers for PAN and Aadhaar Numbers
Here are regex-based recognizers for Indian PAN and Aadhaar numbers:
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
pan_recognizer = PatternRecognizer(
supported_entity="IND_PAN",
name="PAN Recognizer",
patterns=[Pattern(name="pan", regex=r"\b[A-Z]{5}[0-9]{4}[A-Z]\b", score=0.8)],
supported_language="en"
)
aadhaar_recognizer = PatternRecognizer(
supported_entity="AADHAAR",
name="Aadhaar Recognizer",
patterns=[Pattern(name="aadhaar", regex=r"\b\d{4}[- ]?\d{4}[- ]?\d{4}\b", score=0.8)],
supported_language="en"
)
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(pan_recognizer)
analyzer.registry.add_recognizer(aadhaar_recognizer)
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(ReAnonymizer)
entity_mapping = {}Analyzing and Anonymizing Multiple Texts Consistently
from pprint import pprint
text1 = "My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."
results1 = analyzer.analyze(text=text1, language="en")
anon1 = anonymizer.anonymize(
text1,
results1,
{"DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})}
)
results2 = analyzer.analyze(text=text2, language="en")
anon2 = anonymizer.anonymize(
text2,
results2,
{"DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})}
)
print(" Original 1:", text1)
print(" Anonymized 1:", anon1.text)
print(" Original 2:", text2)
print(" Anonymized 2:", anon2.text)
print("\n Mapping used:")
pprint(entity_mapping)This approach preserves consistent pseudonymization across multiple documents, which is essential for data analysis and privacy compliance.
Presidio offers a highly flexible and powerful toolkit for PII detection and anonymization, enabling developers to secure sensitive data with customizable workflows.
Сменить язык
Читать эту статью на русском