Data deidentifier

Data deidentifier

Overview

A comprehensive solution for anonymizing and pseudonymizing personally identifiable information (PII) in both textual and structured data. Built on Microsoft Presidio, this API provides enterprise-grade data deidentification with configurable operators and methods.

Key Features

Text & Structured Data Processing - Support for plain text and JSON
Flexible Anonymization - Multiple operators: replace, redact, mask, hash, encrypt
Consistent Pseudonymization - Random number, counter, and cryptographic hash methods. Maintains consistency within a single request.
Production Ready - Thread-safe, scalable FastAPI service with comprehensive error handling

Supported Entity Types

PERSON, EMAIL_ADDRESS, PHONE_NUMBER, CREDIT_CARD, IP_ADDRESS, LOCATION, DATE_TIME, URL.

Available Methods and Operators

Anonymization Operators:

replace - Replace with generic placeholders
redact - Remove entirely
mask - Masking with character
hash - Cryptographic hash
encrypt - Encryption

Pseudonymization Methods:

random_number - Cryptographically secure random pseudonyms
counter - Sequential numbering
crypto_hash - BLAKE2b-based pseudonyms

Setup and installation

You can run the application either directly with uv or using Docker.

Clone the repository
Set up environment variables: Create a .env file in the project root by copying .env.default:
```
cp .env.default .env
```
You can then modify the variables in .env as needed.

With Docker

The application is containerized using Docker, with a robust and flexible deployment strategy that leverages:

Docker for containerization with a multi-environment support (dev and prod) using Docker Compose profiles
Traefik as a reverse proxy and load balancer, with built-in SSL/TLS support via Let's Encrypt, and a dashboard in dev environment.
Gunicorn as the production-grade WSGI HTTP server, with configurable worker processes and threads, and dynamic scaling based on system resources.

Prerequisites

Docker and Docker Compose installed on your machine.

Development Environment

Build and run the development environment:

docker compose --profile dev up --build

The API will be available at : http://ddi.localhost

Traefik Dashboard will be available at : http://traefik.ddi.localhost

Quick Start (Without volumes or Traefik)

For a quick test without full stack:

docker build --target dev-standalone -t ddi:dev-standalone .
docker run --env-file .env -p 8005:8005 ddi:dev-standalone

Note: This version won't reflect source code changes in real-time.

Production Environment

Configure production-specific settings, then build and run the production environment:

docker compose --profile prod up --build

With uv

Prerequisites

Python 3.13 or higher
uv for dependency management

Installation

Install uv, see https://docs.astral.sh/uv/getting-started/installation/
Install dependencies
```
make init
```
Start the server
```
make start
```

The API will be available at http://localhost:8005.

Usage

Text Anonymization

curl -X POST "http://localhost:8005/anonymize/text" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "John Doe lives in New York and his email is [email protected]",
    "operator": "replace"
  }'

Response:

{
  "anonymized_text": "<PERSON> lives in <LOCATION> and his email is <EMAIL_ADDRESS>",
  "detected_entities": [
    {
      "type": "PERSON",
      "start": 0,
      "end": 8,
      "score": 0.85,
      "text": "John Doe"
    },
    {
      "type": "LOCATION",
      "start": 18,
      "end": 26,
      "score": 0.85,
      "text": "New York"
    },
    {
      "type": "EMAIL_ADDRESS",
      "start": 44,
      "end": 60,
      "score": 1.0,
      "text": "[email protected]"
    }
  ]
}

Structured Data Anonymization

curl -X POST "http://localhost:8005/anonymize/structured" \
  -H "Content-Type: application/json" \
  -d '{
    "data": {
      "user": {
        "name": "Alice Johnson",
        "email": "[email protected]",
        "address": "123 Main St, Boston"
      }
    },
    "operator": "mask",
    "operator_params": {
      "masking_char":"*",
      "chars_to_mask":999
    }
  }'

Response:

{
  "anonymized_data": {
    "user": {
      "name": "*************",
      "email": "*****************",
      "address": "*******************"
    }
  },
  "detected_fields": {
    "user.name": "PERSON",
    "user.email": "EMAIL_ADDRESS",
    "user.address": "LOCATION"
  }
}

Text Pseudonymization

curl -X POST "http://localhost:8005/pseudonymize/text" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "John Doe was born on January 24, 1985, and Jane Smith lives in 221B Baker Street London",
    "method": "counter"
  }'

Response:

{
  "pseudonymized_text": "<PERSON_2> was born on <DATE_TIME_1>, and <PERSON_1> lives in <LOCATION_1>",
  "detected_entities": [
    {
      "type": "PERSON",
      "start": 0,
      "end": 8,
      "score": 0.85,
      "text": "John Doe"
    },
    {
      "type": "DATE_TIME",
      "start": 21,
      "end": 37,
      "score": 0.85,
      "text": "January 24, 1985"
    },
    {
      "type": "PERSON",
      "start": 43,
      "end": 53,
      "score": 0.85,
      "text": "Jane Smith"
    },
    {
      "type": "LOCATION",
      "start": 63,
      "end": 87,
      "score": 0.85,
      "text": "221B Baker Street London"
    }
  ]
}

Entity Enrichment

Configure external services to add contextual information to pseudonyms, by entity type:

# In .env file
ENRICHMENT_CONFIGURATIONS='{
  "LOCATION": {
    "type": "http",
    "url": "http://your-geo-service.example.com/enrich"
  }
}'

To transform, for example, <LOCATION_123> into <LOCATION_123> (United Kingdom) when the service returns country information.

Development

API Documentation

Once the server is running, you can access the interactive API documentation:

Swagger UI: Available at /docs
ReDoc: Available at /redoc

These interfaces provide detailed information about all available endpoints, request/response schemas, and allow you to test the API directly from your browser.

Code Formatting and Linting

The project uses Ruff for linting and formatting, with pre-commit hooks for automated quality checks. Code documentation is built with MkDocs and Material theme.

Development Commands

Key commands for development:

make help              # Display all available commands
make init              # Initialize project (first installation)
make start             # Start application
make check             # Run all checks (precommit + test)
make format            # Format code
make lint              # Run linting checks
make docs-serve        # Serve project documentation locally

Environment Variables

Variable	Description	Required	Default Value	Possible Values
Application Configuration
`DEFAULT_LANGUAGE`	Default language for text analysis	No	`en`	`en`, `fr`
`DEFAULT_MINIMUM_SCORE`	Default confidence threshold	No	`0.5`	`0.0` to `1.0`
`DEFAULT_ANONYMIZATION_OPERATOR`	Default anonymization method	No	`replace`	`replace`, `redact`, `mask`, `hash`, `encrypt`
`DEFAULT_PSEUDONYMIZATION_METHOD`	Default pseudonymization method	No	`random_number`	`random_number`, `counter`, `crypto_hash`
`ENRICHMENT_CONFIGURATIONS`	Entity enrichment service configs	No	`{}`	JSON object
Environment Configuration
`ENVIRONMENT`	Affects error handling and logging throughout the application	No	`development`	`development`, `production`
`LOG_LEVEL`	Minimum logging level	No	`info`	`debug`, `info`, `warning`, `error`, `critical`
Internal Application Configuration
`APP_INTERNAL_HOST`	Host for internal application binding	No	`0.0.0.0`	Valid host/IP
`APP_INTERNAL_PORT`	Port for internal application binding	No	`8005`	Any valid port
External Routing Configuration
`APP_EXTERNAL_HOST`	External hostname for the application	Yes	`ddi.localhost`	Valid hostname
`APP_EXTERNAL_PORT`	External port for routing (dev env only)	No	`80`	Any valid port
Traefik Configuration
`TRAEFIK_RELEASE`	Traefik image version	No	`v3.4.4`	Valid Traefik version
`LETS_ENCRYPT_EMAIL`	Email for Let's Encrypt certificate	Yes	`[email protected]`	Valid email
Performance Configuration
`WORKERS_COUNT`	Number of worker processes	No	`4`	Positive integer
`THREADS_PER_WORKER`	Number of threads per worker	No	`2`	Positive integer

Refer to .env.default for a complete list of configurable environment variables and their default values.

Architecture

The project follows Domain-Driven Design principles with clean separation of concerns:

├── domain/                # Core business logic
│   ├── contracts/         # Abstract interfaces
│   ├── services/          # Application services
│   ├── types/             # Domain models and enums
│   └── exceptions.py      # Domain exceptions
├── adapters/              # External integrations
│   ├── api/               # FastAPI routes and schemas
│   ├── presidio/          # Microsoft Presidio integration
│   └── infrastructure/    # Config, HTTP client, enrichment

Contributing

We welcome contributions to this project! Please see the CONTRIBUTING.md file for guidelines on how to contribute, including:

How to set up your development environment
Coding standards and style guidelines
Pull request process
Testing requirements

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github		.github
docs		docs
kubernetes		kubernetes
src/data_deidentifier		src/data_deidentifier
tests		tests
.copier-answers.yml		.copier-answers.yml
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
docker-compose.yml		docker-compose.yml
gunicorn.conf.py		gunicorn.conf.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data deidentifier

Overview

Key Features

Supported Entity Types

Available Methods and Operators

Setup and installation

With Docker

Prerequisites

Development Environment

Quick Start (Without volumes or Traefik)

Production Environment

With uv

Prerequisites

Installation

Usage

Text Anonymization

Structured Data Anonymization

Text Pseudonymization

Entity Enrichment

Development

API Documentation

Code Formatting and Linting

Development Commands

Environment Variables

Architecture

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data deidentifier

Overview

Key Features

Supported Entity Types

Available Methods and Operators

Setup and installation

With Docker

Prerequisites

Development Environment

Quick Start (Without volumes or Traefik)

Production Environment

With uv

Prerequisites

Installation

Usage

Text Anonymization

Structured Data Anonymization

Text Pseudonymization

Entity Enrichment

Development

API Documentation

Code Formatting and Linting

Development Commands

Environment Variables

Architecture

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages