Here's the README content in plain text format with the requested changes:
Detector-Redactor is a Python-based tool for analyzing and redacting Personally Identifiable Information (PII) from text. It uses the Presidio Analyzer and Anonymizer libraries to detect and anonymize sensitive information.
-
Clone the repository:
git clone https://github.com/yourusername/Detector-Redactor.git cd Detector-Redactor -
Install the required packages:
pip install -r requirements.txtIf you encounter any issues or don't have pip installed:
a. Ensure you have Python 3.7+ installed:
python --versionb. Check if pip is installed and get its version:
pip --versionc. If pip is not installed or you need to upgrade it:
python -m ensurepip --upgraded. If you still face problems, refer to the official Python documentation for pip installation instructions specific to your operating system.
Note: It's recommended to use a virtual environment for this project. You can create one using:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`Then proceed with the pip install command.
To run the Detector-Redactor script:
python analyzer/PIIAnalyzer.py --config <path/to/config.yml> --recognizers <path/to/recognizers-config.yml>
Command-line arguments:
--config: Path to the main configuration file (default: "./analyzer/config.yml")--recognizers: Path to the custom recognizers configuration file (default: "./analyzer/recognizers-config.yml")
The Detector-Redactor uses two main configuration files:
config.yml: Contains general settings and PII type mappings.recognizers-config.yml: Defines custom recognizers for specific PII types.
The main configuration file includes:
- PII type mappings
- Recognized and unrecognized PII types
- Dataset filepath
Example:
pii_mapping:
FIRSTNAME: PERSON
LASTNAME: PERSON
MIDDLENAME: PERSON
PHONENUMBER: PHONE_NUMBER
EMAIL: EMAIL_ADDRESS
IPV4: IP_ADDRESS
IPV6: IP_ADDRESS
URL: URL
IBAN: IBAN_CODE
CREDITCARDNUMBER: CREDIT_CARD
BITCOINADDRESS: CRYPTO
DOB: DATE_TIME
DATE: DATE_TIME
TIME: DATE_TIME
SSN: US_SSN
STREET: ADDRESS
CITY: ADDRESS
STATE: ADDRESS
ZIP: ADDRESS
COUNTRY: ADDRESS
pii_types:
recognized:
- PERSON
- PHONE_NUMBER
- EMAIL_ADDRESS
- IP_ADDRESS
- URL
- IBAN_CODE
- CREDIT_CARD
- CRYPTO
- DATE_TIME
- US_SSN
- ADDRESS
unrecognized:
- UNKNOWN_TYPE_1
- UNKNOWN_TYPE_2
dataset_filepath: "path/to/dataset.csv"This file defines custom recognizers for specific PII types, including:
- Regex patterns
- Supported languages
- Context words
- Scoring
Example:
global_regex_flags: 26
supported_languages:
- en
- es
recognizers:
- name: CreditCardCVVRecognizer
patterns:
- name: "CVV"
regex: "\b\d{3,4}\b"
score: 0.7
supported_languages:
- language: en
context: [cvv, card verification, security code]
- language: es
context: [cvv, verificación de tarjeta, código de seguridad]
supported_entity: "CREDIT_CARD_CVV"
type: custom
- name: AmexAccountNumberRecognizer
patterns:
- name: "AMEX"
regex: "\b3[47]\d{2}[-\s]?\d{6}[-\s]?\d{5}\b"
score: 0.9
supported_languages:
- language: en
context: [amex, american express, account number]
- language: es
context: [amex, american express, número de cuenta]
supported_entity: "AMEX_ACCOUNT_NUMBER"
type: custom
- name: PasswordRecognizer
patterns:
- name: "PASSWORD"
regex: "\b(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{8,}\b"
score: 0.6
supported_languages:
- language: en
context: [password, pwd, passcode]
- language: es
context: [contraseña, clave]
supported_entity: "PASSWORD"
type: custom
- name: UserIDRecognizer
patterns:
- name: "USERID"
regex: "\b[A-Za-z0-9_]{3,20}\b"
score: 0.5
supported_languages:
- language: en
context: [user, username, userid, login]
- language: es
context: [usuario, nombre de usuario, identificación de usuario, inicio de sesión]
supported_entity: "USERID"
type: custom
- name: VehicleVINRecognizer
patterns:
- name: "VIN"
regex: "\b[A-HJ-NPR-Z0-9]{17}\b"
score: 0.8
supported_languages:
- language: en
context: [vin, vehicle identification number]
- language: es
context: [vin, número de identificación del vehículo]
supported_entity: "VEHICLE_VIN"
type: customThe general pipeline for analyzing and redacting text using Detector-Redactor is as follows:
- Load configurations
- Initialize NLP engine and recognizer registry
- Create analyzer engine with loaded configurations
- Analyze input text using the analyzer engine
- (TODO) Anonymize detected PII using the anonymizer engine
The PIIAnalyzer class is the core component of the Detector-Redactor tool. It provides methods for:
- Loading configurations
- Creating recognizer registries
- Analyzing text for PII
- Managing supported languages
For more details on the PIIAnalyzer class, refer to:
class PIIAnalyzer:
"""
A class for analyzing and detecting Personally Identifiable Information (PII) in text.
This class provides methods for loading configurations, creating recognizer registries,
analyzing text for PII, and managing supported languages.
Attributes:
config (Dict): Configuration dictionary for the analyzer.
nlp_engine_provider (NlpEngineProvider): Provider for NLP engine.
recognizer_registry (RecognizerRegistry): Registry of PII recognizers.
analyzer_engine (AnalyzerEngine): Engine for analyzing text.
"""The Detector-Redactor includes several default recognizers defined in the recognizers-config.yml file:
- CreditCardCVVRecognizer
- AmexAccountNumberRecognizer
- PasswordRecognizer
- UserIDRecognizer
- VehicleVINRecognizer
These recognizers use regex patterns and context words to identify specific types of PII in the input text.
Listed in order of priority:
Implement the Anonymizer class to redact or replace detected PIICreate a user-friendly interface for configuring recognizers and anonymizers- Make encryption more format-preserving in a way that is easy for an LLM to still understand the context of the message
- Build pipeline to feed redacted text into a LLM for further processing (fun, optional, would look cool but needs encryption fixed)
- Add more custom recognizers for additional PII types
- Implement support for more languages
- Write test cases and improve error handling
- Fine-tune recognizer patterns and scoring
- Implement a mechanism to easily add new recognizers without modifying the main code
- Implement logging and reporting features (for console, but also metadata download for user would be nice)
- Add support for different output formats (e.g., JSON, CSV)
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.