Skip to content

AMEX-2A/Detector-Redactor

Repository files navigation

Here's the README content in plain text format with the requested changes:

Detector-Redactor

Demo

Dectector-Redactor Demo
Detector-Redactor is a Python-based tool for analyzing and redacting Personally Identifiable Information (PII) from text. It uses the Presidio Analyzer and Anonymizer libraries to detect and anonymize sensitive information.

Table of Contents

  1. Installation
  2. Usage
  3. Configuration
  4. Pipeline
  5. PIIAnalyzer
  6. Recognizers
  7. TODOs

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/Detector-Redactor.git
    cd Detector-Redactor
    
  2. Install the required packages:

    pip install -r requirements.txt
    

    If you encounter any issues or don't have pip installed:

    a. Ensure you have Python 3.7+ installed:

    python --version
    

    b. Check if pip is installed and get its version:

    pip --version
    

    c. If pip is not installed or you need to upgrade it:

    python -m ensurepip --upgrade
    

    d. If you still face problems, refer to the official Python documentation for pip installation instructions specific to your operating system.

    Note: It's recommended to use a virtual environment for this project. You can create one using:

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    

    Then proceed with the pip install command.

Usage

To run the Detector-Redactor script:

python analyzer/PIIAnalyzer.py --config <path/to/config.yml> --recognizers <path/to/recognizers-config.yml>

Command-line arguments:

  • --config: Path to the main configuration file (default: "./analyzer/config.yml")
  • --recognizers: Path to the custom recognizers configuration file (default: "./analyzer/recognizers-config.yml")

Configuration

The Detector-Redactor uses two main configuration files:

  1. config.yml: Contains general settings and PII type mappings.
  2. recognizers-config.yml: Defines custom recognizers for specific PII types.

Main Configuration (config.yml)

The main configuration file includes:

  • PII type mappings
  • Recognized and unrecognized PII types
  • Dataset filepath

Example:

pii_mapping:
  FIRSTNAME: PERSON
  LASTNAME: PERSON
  MIDDLENAME: PERSON
  PHONENUMBER: PHONE_NUMBER
  EMAIL: EMAIL_ADDRESS
  IPV4: IP_ADDRESS
  IPV6: IP_ADDRESS
  URL: URL
  IBAN: IBAN_CODE
  CREDITCARDNUMBER: CREDIT_CARD
  BITCOINADDRESS: CRYPTO
  DOB: DATE_TIME
  DATE: DATE_TIME
  TIME: DATE_TIME
  SSN: US_SSN
  STREET: ADDRESS
  CITY: ADDRESS
  STATE: ADDRESS
  ZIP: ADDRESS
  COUNTRY: ADDRESS

pii_types:
  recognized:
    - PERSON
    - PHONE_NUMBER
    - EMAIL_ADDRESS
    - IP_ADDRESS
    - URL
    - IBAN_CODE
    - CREDIT_CARD
    - CRYPTO
    - DATE_TIME
    - US_SSN
    - ADDRESS

  unrecognized:
    - UNKNOWN_TYPE_1
    - UNKNOWN_TYPE_2

dataset_filepath: "path/to/dataset.csv"

Recognizers Configuration (recognizers-config.yml)

This file defines custom recognizers for specific PII types, including:

  • Regex patterns
  • Supported languages
  • Context words
  • Scoring

Example:

global_regex_flags: 26

supported_languages: 
  - en
  - es

recognizers:
  - name: CreditCardCVVRecognizer
    patterns:
      - name: "CVV"
        regex: "\b\d{3,4}\b"
        score: 0.7
    supported_languages:
      - language: en
        context: [cvv, card verification, security code]
      - language: es
        context: [cvv, verificación de tarjeta, código de seguridad]
    supported_entity: "CREDIT_CARD_CVV"
    type: custom

  - name: AmexAccountNumberRecognizer
    patterns:
      - name: "AMEX"
        regex: "\b3[47]\d{2}[-\s]?\d{6}[-\s]?\d{5}\b"
        score: 0.9
    supported_languages:
      - language: en
        context: [amex, american express, account number]
      - language: es
        context: [amex, american express, número de cuenta]
    supported_entity: "AMEX_ACCOUNT_NUMBER"
    type: custom

  - name: PasswordRecognizer
    patterns:
      - name: "PASSWORD"
        regex: "\b(?=.*[A-Za-z])(?=.*\d)[A-Za-z\d]{8,}\b"
        score: 0.6
    supported_languages:
      - language: en
        context: [password, pwd, passcode]
      - language: es
        context: [contraseña, clave]
    supported_entity: "PASSWORD"
    type: custom

  - name: UserIDRecognizer
    patterns:
      - name: "USERID"
        regex: "\b[A-Za-z0-9_]{3,20}\b"
        score: 0.5
    supported_languages:
      - language: en
        context: [user, username, userid, login]
      - language: es
        context: [usuario, nombre de usuario, identificación de usuario, inicio de sesión]
    supported_entity: "USERID"
    type: custom

  - name: VehicleVINRecognizer
    patterns:
      - name: "VIN"
        regex: "\b[A-HJ-NPR-Z0-9]{17}\b"
        score: 0.8
    supported_languages:
      - language: en
        context: [vin, vehicle identification number]
      - language: es
        context: [vin, número de identificación del vehículo]
    supported_entity: "VEHICLE_VIN"
    type: custom

Pipeline

The general pipeline for analyzing and redacting text using Detector-Redactor is as follows:

  1. Load configurations
  2. Initialize NLP engine and recognizer registry
  3. Create analyzer engine with loaded configurations
  4. Analyze input text using the analyzer engine
  5. (TODO) Anonymize detected PII using the anonymizer engine

PIIAnalyzer

The PIIAnalyzer class is the core component of the Detector-Redactor tool. It provides methods for:

  • Loading configurations
  • Creating recognizer registries
  • Analyzing text for PII
  • Managing supported languages

For more details on the PIIAnalyzer class, refer to:

class PIIAnalyzer:
    """
    A class for analyzing and detecting Personally Identifiable Information (PII) in text.

    This class provides methods for loading configurations, creating recognizer registries,
    analyzing text for PII, and managing supported languages.

    Attributes:
        config (Dict): Configuration dictionary for the analyzer.
        nlp_engine_provider (NlpEngineProvider): Provider for NLP engine.
        recognizer_registry (RecognizerRegistry): Registry of PII recognizers.
        analyzer_engine (AnalyzerEngine): Engine for analyzing text.
    """

Recognizers

The Detector-Redactor includes several default recognizers defined in the recognizers-config.yml file:

  1. CreditCardCVVRecognizer
  2. AmexAccountNumberRecognizer
  3. PasswordRecognizer
  4. UserIDRecognizer
  5. VehicleVINRecognizer

These recognizers use regex patterns and context words to identify specific types of PII in the input text.

TODOs

Listed in order of priority:

  1. Implement the Anonymizer class to redact or replace detected PII
  2. Create a user-friendly interface for configuring recognizers and anonymizers
  3. Make encryption more format-preserving in a way that is easy for an LLM to still understand the context of the message
  4. Build pipeline to feed redacted text into a LLM for further processing (fun, optional, would look cool but needs encryption fixed)

Low Priority

  1. Add more custom recognizers for additional PII types
  2. Implement support for more languages
  3. Write test cases and improve error handling
  4. Fine-tune recognizer patterns and scoring
  5. Implement a mechanism to easily add new recognizers without modifying the main code
  6. Implement logging and reporting features (for console, but also metadata download for user would be nice)
  7. Add support for different output formats (e.g., JSON, CSV)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages