Azure-Language-Document-PII-redaction
Version: 1
Azure Language
Azure Language adds advanced natural language processing to your apps using task‑optimized AI models. It helps you extract key information from text, transcripts, and files as well as detect language to build multilingual, conversational experiences—all with enterprise‑grade security and flexible customization.Key capabilities
About this model
The Document PII Redaction model in Azure Language automatically detects and masks sensitive information in native documents (PDF, Word, and plain text files), ensuring privacy and compliance. It is designed for batch processing, making it ideal for enterprise workflows that require secure handling of personal data in documents without requiring text preprocessing.Key model capabilities
- Native Document Support: Processes PDF (.pdf), Microsoft Word (.docx), and plain text (.txt) files directly, eliminating the need for text preprocessing.
- Comprehensive PII Detection: Identifies a wide range of sensitive entities like names, addresses, IDs, and financial data.
- Automatic Redaction: Replaces detected PII with placeholders or entity type masks to prevent exposure in downstream systems.
- Multilingual Support: Detects PII across multiple languages for global applications.
- Seamless Integration: Works with REST APIs, SDKs, and Azure AI Foundry Tools for easy deployment and scaling.
Use cases
See Responsible Use of AI for additional considerations for responsible use.Key use cases
- Generative AI Preprocessing: Remove PII before sending documents to LLMs for summarization or content generation.
- Customer Support Logs: Redact sensitive details in document archives for compliance.
- Healthcare & Finance: Protect patient or client data in PDF reports and Word documents.
- Data Sharing & Analytics: Anonymize document datasets for safe sharing and analysis.
Out of scope use cases
The model is not intended for:- Detecting non-textual PII (e.g., inputs with video and audio).
- Guaranteeing compliance without human oversight.
- Any use that violates Microsoft's Responsible Use of AI .
Pricing
Pricing is based on the number of text records processed and the selected tier. See the Azure pricing page for more details.Technical specs
PII Redaction for Documents is a cloud-based service that uses advanced machine learning and Named Entity Recognition (NER) models to identify and redact sensitive information in native file formats. It accepts documents via Azure Blob Storage and returns redacted output files to a target container. The model supports multiple entity categories (e.g., financial, medical, personal identifiers) and works across a wide range of languages. It integrates seamlessly with other Azure Language services and Generative AI workflows to ensure sensitive data is protected before downstream processing.Input formats
The Document PII Redaction model accepts native document formats uploaded to Azure Blob Storage. Supported formats are:- PDF (.pdf) — including scanned PDFs
- Microsoft Word (.docx)
- Plain text (.txt)
Input guidelines
- Total number of documents per request: ≤ 40
- Total content size per request: ≤ 10 MB
- Digital images with embedded text and tables in scanned documents are not supported.
Supported language
The feature supports multiple languages for PII detection and redaction. Detected entities are returned with their type, confidence score, and redacted output document. See the full list of supported languages linked here .Supported Azure regions
See the full list of supported Azure regions for Azure Language linked here .Sample JSON request
{
"displayName": "Document PII Redaction example",
"analysisInput": {
"documents": [
{
"language": "en-US",
"id": "Output-1",
"source": {
"location": "{your-source-blob-with-SAS-URL}"
},
"target": {
"location": "{your-target-container-with-SAS-URL}"
}
}
]
},
"tasks": [
{
"kind": "PiiEntityRecognition",
"taskName": "Redact PII Task 1",
"parameters": {
"redactionPolicy": {
"policyKind": "entityMask"
},
"piiCategories": [
"Person",
"Organization"
],
"excludeExtractionData": false
}
}
]
}
Sample response
Upon successful submission, the API returns a202 Accepted response with an operation-location header containing a job ID. Poll the job endpoint to retrieve results. On completion, redacted documents are written to your target Azure Blob Storage container.
Model architecture
Transformer-based multilingual NER architecture optimized for entity detection and redaction, adapted for native document formats with layout-aware preprocessing for PDF and Word files.Optimizing model performance
Efficiency
- Batch Processing: Submit up to 40 documents per request to reduce API call overhead.
- Selective Redaction: Use entity category filters (e.g., only redact financial or personal identifiers) to minimize unnecessary processing.
- Blob Storage Organization: Use dedicated source and target containers to simplify management of high-volume document workflows.
Accuracy
- Full Document Context: Submit complete documents rather than fragmented excerpts to ensure the model has sufficient context for accurate detection.
- Pre-cleaning Documents: Remove password protection and ensure documents are text-selectable (not purely image-based scans) for best detection quality.
- Confidence Thresholding: Apply thresholds to handle ambiguous detections—e.g., flag low-confidence entities for human review.
Cost-Effectiveness
- Selective Redaction: Use
piiCategoriesto redact only the entity types relevant to your scenario. excludeExtractionData: Set totruewhen you only need the redacted document and do not need the JSON entity extraction data, reducing response payload size.- Autoscaling & Rate Limiting: Configure autoscaling for peak loads and apply throttling to avoid unnecessary compute costs.
Additional assets
List of additional assets (e.g. training data, technical reports data processing code, model training code, model inference code, model evaluation code), if any, that are made available with a link, description of how each can be accessed and what licenses, if any, relate to their use.Distribution
More information
Responsible AI considerations
Safety techniques
N/ASafety evaluations
N/AKnown limitations
Depending on your scenario, input data and the entities you wish to extract, you could experience different levels of performance. The following sections are designed to help you understand key concepts about performance as they apply to using the Azure Language Document PII service.Understand and measure performance
Since both false positive and false negative errors can occur, it is important to understand how both types of errors might affect your overall system. In redaction scenarios, for example, false negatives could lead to personal information leakage. For redaction scenarios, consider a process for human review to account for this type of error. For sensitivity label scenarios, both false positives and false negatives could lead to misclassification of documents. The audience may unnecessarily limited for documents labelled as confidential where a false positive occurred. PII could be leaked where a false negative occurred and a public label was applied. You can adjust the threshold for confidence score your system uses to tune your system. If it is more important to identify all potential instances of PII, you can use a lower threshold. This means that you may get more false positives (non-PII data being recognized as PII entities), but fewer false negatives (PII entities not recognized as PII). If it is more important for your system to recognize only true PII data, you can use a higher threshold. Threshold values may not have consistent behavior across individual categories of PII entities. Therefore, it is critical that you test your system with real data it will process in production.System limitations and best practices for enhancing performance
- Make sure you understand all the entity categories that can be recognized by the system. Depending on your scenario, your data may include other information that could be considered personal but is not covered by the categories the service currently supports.
- Context is important for all entity categories to be correctly recognized by the system. Always submit complete documents rather than fragmented excerpts to ensure the model has sufficient context for accurate detection.
- Person names in particular require linguistic context. Send as much context as possible for better person name detection.
- The Document PII service accepts native document formats (PDF, Word, plain text). Images with embedded text and tables in scanned documents are not supported. Ensure documents are text-selectable for best detection quality.
- The service currently does not support password-protected or encrypted documents. Remove protection before submitting for analysis.
- Although many international entities are supported, currently the service only supports English text. Consider verifying the language the input text is in if you're not sure it will be all in English.
- Make sure to carefully test your redaction workflow to ensure identified entities are not accidentally leaked when processing documents in bulk.
Acceptable use
Acceptable use policy
Microsoft wants to help you responsibly develop and deploy solutions that use Azure Language. We are taking a principled approach to upholding personal agency and dignity by considering the fairness, reliability & safety, privacy & security, inclusiveness, transparency, and human accountability of our AI systems. These considerations are in line with our commitment to developing Responsible AI. This article discusses Azure Language features and the key considerations for making use of this technology responsibly. Consider the following factors when you decide how to use and implement AI-powered products and features.General guidelines
When you're getting ready to deploy AI-powered products or features, the following activities help to set you up for success:- Understand what it can do: Fully assess the capabilities of any AI model you are using to understand its capabilities and limitations. Understand how it will perform in your particular scenario and context.
- Test with real, diverse data: Understand how your system will perform in your scenario by thoroughly testing it with real life conditions and data that reflects the diversity in your users, geography and deployment contexts. Small datasets, synthetic data and tests that don't reflect your end-to-end scenario are unlikely to sufficiently represent your production performance.
- Respect an individual's right to privacy: Only collect data and information from individuals for lawful and justifiable purposes. Only use data and information that you have consent to use for this purpose.
- Legal review: Obtain appropriate legal advice to review your solution, particularly if you will use it in sensitive or high-risk applications. Understand what restrictions you might need to work within and your responsibility to resolve any issues that might come up in the future. Do not provide any legal advice or guidance.
- System review: If you're planning to integrate and responsibly use an AI-powered product or feature into an existing system of software, customers or organizational processes, take the time to understand how each part of your system will be affected. Consider how your AI solution aligns with Microsoft's Responsible AI principles.
- Human in the loop: Keep a human in the loop, and include human oversight as a consistent pattern area to explore. This means constant human oversight of the AI-powered product or feature and maintaining the role of humans in decision-making. Ensure you can have real-time human intervention in the solution to prevent harm. This enables you to manage where the AI model doesn't perform as required.
- Security: Ensure your solution is secure and has adequate controls to preserve the integrity of your content and prevent unauthorized access.
- Customer feedback loop: Provide a feedback channel that allows users and individuals to report issues with the service once it's been deployed. Once you've deployed an AI-powered product or feature it requires ongoing monitoring and improvement – be ready to implement any feedback and suggestions for improvement.
Terms of Service
Terms of Service Link
Your use of the Azure service is governed by the terms and conditions of the agreement under which you obtained the services.- For customers who purchase or renew a subscription (including free trials) online from Microsoft, your use is governed by either the Microsoft Customer Agreement ("MCA"), or the Microsoft Online Subscription Agreement ("MOSA"). Your use is governed by the latter if the MCA is not available in your geography. Visit the MCA page for availability details.
- For customers who purchase through another Microsoft Commercial Licensing Program, such as an Enterprise Agreement, your use is governed by the licensing agreement under which you purchased the services. You can obtain a copy of your - licensing agreement by contacting your Microsoft account representative or Commercial Licensing.
- If you do not have an Azure subscription, the Microsoft Terms of Use will govern your use of the limited Azure services which can be used without a subscription.
Model Specifications
Last UpdatedApril 2026
Input TypeText
Output TypeText
ProviderMicrosoft