<RETURN_TO_BASE

Mastering Microsoft Presidio: Detect and Anonymize PII in Text with Practical Examples

Explore Microsoft Presidio’s capabilities for detecting and anonymizing PII in text with practical Python examples, including custom recognizers and hash-based anonymization.

Introduction to Microsoft Presidio

Microsoft’s Presidio is an open-source framework designed specifically for detecting, analyzing, and anonymizing personally identifiable information (PII) in free-form text. Built on top of the powerful spaCy NLP library, Presidio offers a lightweight, modular architecture suitable for integration into real-time applications and data pipelines.

Installing Presidio and Dependencies

To begin using Presidio, install the key libraries presidio-analyzer and presidio-anonymizer along with the recommended spaCy model for English:

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

Restart your environment if necessary, especially when working in Jupyter or Colab.

Basic PII Detection with Presidio Analyzer

The AnalyzerEngine loads spaCy’s NLP pipeline and built-in recognizers to scan text for sensitive entities like phone numbers. Here is how to detect a U.S. phone number in a sample sentence while suppressing verbose logging:

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)
 
from presidio_analyzer import AnalyzerEngine
 
analyzer = AnalyzerEngine()
 
results = analyzer.analyze(text="My phone number is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language='en')
print(results)

Creating Custom Recognizers Using Deny Lists

Presidio supports custom recognizers. For example, you can create a recognizer to detect academic titles such as “Dr.” and “Prof.” using a deny list:

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry
 
academic_title_recognizer = PatternRecognizer(
    supported_entity="ACADEMIC_TITLE",
    deny_list=["Dr.", "Dr", "Professor", "Prof."]
)
 
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)
 
analyzer = AnalyzerEngine(registry=registry)
 
text = "Prof. John Smith is meeting with Dr. Alice Brown."
results = analyzer.analyze(text=text, language="en")
 
for result in results:
    print(result)

Custom recognizers can also be created using regex or external models as detailed in official documentation.

Anonymizing PII with Presidio Anonymizer

Once PII is detected, you can anonymize it. Here is an example replacing detected person names with a placeholder:

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
 
engine = AnonymizerEngine()
 
result = engine.anonymize(
    text="My name is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})},
)
 
print(result)

Other anonymization operators include redaction, hashing, and pseudonymization.

Advanced Usage: Custom Entities and Consistent Re-Anonymization

You can define custom PII entities using regex-based PatternRecognizers and implement a custom anonymizer that hashes entities consistently across texts.

Defining a Custom Hash-Based Anonymizer

This operator uses SHA-256 hashing and maintains a mapping to ensure consistent anonymization:

from presidio_anonymizer.operators import Operator, OperatorType
import hashlib
from typing import Dict
 
class ReAnonymizer(Operator):
    """
    Anonymizer that replaces text with a reusable SHA-256 hash,
    stored in a shared mapping dict.
    """
 
    def operate(self, text: str, params: Dict = None) -> str:
        entity_type = params.get("entity_type", "DEFAULT")
        mapping = params.get("entity_mapping")
 
        if mapping is None:
            raise ValueError("Missing `entity_mapping` in params")
 
        if entity_type in mapping and text in mapping[entity_type]:
            return mapping[entity_type][text]
 
        hashed = "<HASH_" + hashlib.sha256(text.encode()).hexdigest()[:10] + ">"
        mapping.setdefault(entity_type, {})[text] = hashed
        return hashed
 
    def validate(self, params: Dict = None) -> None:
        if "entity_mapping" not in params:
            raise ValueError("You must pass an 'entity_mapping' dictionary.")
 
    def operator_name(self) -> str:
        return "reanonymizer"
 
    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize

Custom PII Recognizers for PAN and Aadhaar Numbers

Here are regex-based recognizers for Indian PAN and Aadhaar numbers:

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
 
pan_recognizer = PatternRecognizer(
    supported_entity="IND_PAN",
    name="PAN Recognizer",
    patterns=[Pattern(name="pan", regex=r"\b[A-Z]{5}[0-9]{4}[A-Z]\b", score=0.8)],
    supported_language="en"
)
 
aadhaar_recognizer = PatternRecognizer(
    supported_entity="AADHAAR",
    name="Aadhaar Recognizer",
    patterns=[Pattern(name="aadhaar", regex=r"\b\d{4}[- ]?\d{4}[- ]?\d{4}\b", score=0.8)],
    supported_language="en"
)
 
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(pan_recognizer)
analyzer.registry.add_recognizer(aadhaar_recognizer)
 
anonymizer = AnonymizerEngine()
anonymizer.add_anonymizer(ReAnonymizer)
 
entity_mapping = {}

Analyzing and Anonymizing Multiple Texts Consistently

from pprint import pprint
 
text1 = "My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."
 
results1 = analyzer.analyze(text=text1, language="en")
anon1 = anonymizer.anonymize(
    text1,
    results1,
    {"DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})}
)
 
results2 = analyzer.analyze(text=text2, language="en")
anon2 = anonymizer.anonymize(
    text2,
    results2,
    {"DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})}
)
 
print(" Original 1:", text1)
print(" Anonymized 1:", anon1.text)
print(" Original 2:", text2)
print(" Anonymized 2:", anon2.text)
 
print("\n Mapping used:")
pprint(entity_mapping)

This approach preserves consistent pseudonymization across multiple documents, which is essential for data analysis and privacy compliance.

Presidio offers a highly flexible and powerful toolkit for PII detection and anonymization, enabling developers to secure sensitive data with customizable workflows.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский