Fast, Minimalist Text Deduplication Library for Python
Developers frequently face duplicate textual data when dealing with:
- User-generated inputs (reviews, comments, feedback)
- Product catalogs (e-commerce)
- Web-scraping (news articles, blogs, products)
- CRM data (customer profiles, leads)
- NLP/AI training datasets (duplicate records skew training)
Existing Solutions and their Shortcomings:
- Manual Deduplication: Slow, error-prone, impractical at scale.
- Pandas built-in methods: Only exact matches; ineffective for slight differences (typos, synonyms).
- Fuzzywuzzy / RapidFuzz: Powerful but require boilerplate setup for large-scale deduplication.
Solution:
A simple, intuitive, ready-to-use deduplication wrapper around RapidFuzz, minimizing setup effort while providing great speed and accuracy out-of-the-box.
pip install fast-dedupeimport fastdedupe
data = ["Apple iPhone 12", "Apple iPhone12", "Samsung Galaxy", "Samsng Galaxy", "MacBook Pro", "Macbook-Pro"]
# One-liner deduplication
clean_data, duplicates = fastdedupe.dedupe(data, threshold=85)
print(clean_data)
# Output: ['Apple iPhone 12', 'Samsung Galaxy', 'MacBook Pro']
print(duplicates)
# Output: {'Apple iPhone 12': ['Apple iPhone12'],
# 'Samsung Galaxy': ['Samsng Galaxy'],
# 'MacBook Pro': ['Macbook-Pro']}- High performance: Powered by RapidFuzz for sub-millisecond matching
- Simple API: Single method call (
fastdedupe.dedupe()) - Flexible Matching: Handles minor spelling differences, hyphens, abbreviations
- Configurable Sensitivity: Adjust matching threshold easily
- Detailed Output: Cleaned records and clear mapping of detected duplicates
- Command-line Interface: Deduplicate files directly from the terminal
- High Test Coverage: 93%+ code coverage ensures reliability
products = [
"Apple iPhone 15 Pro Max (128GB)",
"Apple iPhone-12",
"apple iPhone12",
"Samsung Galaxy S24",
"Samsung Galaxy-S24",
]
cleaned_products, duplicates = fastdedupe.dedupe(products, threshold=90)
# cleaned_products:
# ['Apple iPhone 15 Pro Max (128GB)', 'Apple iPhone-12', 'Samsung Galaxy S24']
# duplicates identified clearly:
# {
# 'Apple iPhone-12': ['apple iPhone12'],
# 'Samsung Galaxy S24': ['Samsung Galaxy-S24']
# }emails = ["[email protected]", "[email protected]", "[email protected]"]
clean, dupes = fastdedupe.dedupe(emails, threshold=95)
# clean β ["[email protected]", "[email protected]"]
# dupes β {"[email protected]": ["[email protected]"]}Deduplicates a list of strings using fuzzy matching.
Parameters:
data(list): List of strings to deduplicatethreshold(int, optional): Similarity threshold (0-100). Default is 85.keep_first(bool, optional): If True, keeps the first occurrence of a duplicate. If False, keeps the longest string. Default is True.
Returns:
tuple: (clean_data, duplicates)clean_data(list): List of deduplicated stringsduplicates(dict): Dictionary mapping each kept string to its duplicates
fast-dedupe also provides a command-line interface for deduplicating files:
# Basic usage
fastdedupe input.txt
# Save output to a file
fastdedupe input.txt -o deduplicated.txt
# Save duplicates mapping to a file
fastdedupe input.txt -o deduplicated.txt -d duplicates.json
# Adjust threshold
fastdedupe input.txt -t 90
# Keep longest string instead of first occurrence
fastdedupe input.txt --keep-longest
# Work with CSV files
fastdedupe data.csv -f csv --csv-column name
# Work with JSON files
fastdedupe data.json -f json --json-key text- Data Engineers / Analysts: Cleaning large datasets before ETL, BI tasks, and dashboards
- ML Engineers & Data Scientists: Cleaning datasets before training models to avoid bias and data leakage
- Software Developers (CRM & ERP systems): Implementing deduplication logic without overhead
- Analysts (E-commerce, Marketing): Cleaning and deduplicating product catalogs, customer databases
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
For more details, see CONTRIBUTING.md.
Fast-dedupe is designed for high performance fuzzy string matching and deduplication. It leverages the RapidFuzz library for efficient string comparisons and adds several optimizations:
- Efficient data structures: Uses sets and dictionaries for O(1) lookups
- Parallel processing: Automatically uses multiple CPU cores for large datasets
- Early termination: Optimized algorithms that avoid unnecessary comparisons
- Memory efficiency: Processes data in chunks to reduce memory usage
Fast-dedupe automatically switches to parallel processing for datasets larger than 1,000 items. Here's how the multiprocessing implementation works:
- Data Chunking: The input dataset is divided into smaller chunks based on the number of available CPU cores
- Parallel Processing: Each chunk is processed by a separate worker process
- Result Aggregation: Results from all workers are combined into a final deduplicated dataset
βββββββββββββββββββ
β Input Dataset β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Split Dataset β
β into Chunks β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β Process Chunks β
β βββββββββββ βββββββββββ βββββββββββ β
β β Worker 1 β β Worker 2 β β Worker n β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββ βββββββββββ βββββββββββ β
β β Results 1β β Results 2β β Results nβ β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
βββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββββ
β β β
ββββββββββββββββΌβββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Combine Results β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Final Output β
β (clean, dupes) β
βββββββββββββββββββ
The parallel implementation provides near-linear speedup with the number of CPU cores, making it especially effective for large datasets. For example, on an 8-core system, you can expect up to 6-7x speedup compared to single-core processing.
You can fine-tune the parallel processing behavior with the n_jobs parameter:
from fastdedupe import dedupe
# Automatic (uses all available cores)
clean_data, duplicates = dedupe(data, threshold=85, n_jobs=None)
# Specify exact number of cores to use
clean_data, duplicates = dedupe(data, threshold=85, n_jobs=4)
# Disable parallel processing
clean_data, duplicates = dedupe(data, threshold=85, n_jobs=1)For optimal performance:
- Use
n_jobs=None(default) to let fast-dedupe automatically determine the best configuration - For very large datasets (100,000+ items), consider using a specific number of cores (e.g.,
n_jobs=4) to avoid excessive memory usage - For small datasets (<1,000 items), parallel processing adds overhead and may be slower than single-core processing
We've benchmarked fast-dedupe against other popular libraries for string deduplication:
- pandas: Using TF-IDF vectorization and cosine similarity
- fuzzywuzzy: A popular fuzzy string matching library
The benchmarks were run on various dataset sizes and similarity thresholds. Here's a summary of the results:
| Dataset Size | fast-dedupe (s) | pandas (s) | fuzzywuzzy (s) | fast-dedupe vs pandas | fast-dedupe vs fuzzywuzzy |
|---|---|---|---|---|---|
| 1,000 | 0.0521 | 0.3245 | 0.4872 | 6.23x | 9.35x |
| 5,000 | 0.2873 | 2.8541 | 3.9872 | 9.93x | 13.88x |
| 10,000 | 0.6124 | 7.9872 | 11.2451 | 13.04x | 18.36x |
As the dataset size increases, the performance advantage of fast-dedupe becomes more significant. For large datasets (10,000+ items), fast-dedupe can be 10-20x faster than other libraries.
You can run your own benchmarks to compare performance on your specific data:
# Install dependencies
pip install pandas scikit-learn fuzzywuzzy matplotlib tqdm
# Run benchmarks
python benchmarks/benchmark.py --sizes 100 500 1000 5000 --thresholds 70 80 90The benchmark script will generate detailed reports and visualizations in the benchmark_results directory.
This project is licensed under the MIT License - see the LICENSE file for details.

