FastDedup (FDedup) is a fast and memory-efficient FASTX PCR deduplication tool written in Rust. It utilizes needletail for high-performance sequence parsing, xxh3 for rapid hashing, and fxhash for a low-overhead memory cache.
Paper in preparation, you can check it here.
- Fast & Memory Efficient: Uses zero-allocation sequence parsing and a non-cryptographic high-speed hashing cache, which automatically scales based on the estimated input file size.
- Supports Compressed Formats: Transparently reads both uncompressed and GZIP compressed (
.gz) FASTQ/FASTA files. Writes to both uncompressed and GZIP compressed formats. - Incremental Deduplication & Auto-Recovery: By default, FDedup appends new sequences to an existing uncompressed output file. It safely pre-loads existing hashes to prevent duplicates. If an uncompressed output file is corrupted due to a previous crash, FDedup automatically truncates it to the last valid sequence and resumes safely.
You can download the latest pre-compiled binaries from the releases page
The recommended way to install FastDedup is with pixi through bioconda:
pixi add bioconda::fdedupYou can install FastDedup directly from Cargo:
cargo install fastdedupfdedup [OPTIONS] --input <INPUT>-1, --input <INPUT>: Path to the input FASTA/FASTQ/GZ file (R1 or Single-End).-2, --input-r2 <INPUT_R2>: Path to the input R2 file (Optional, enables Paired-End mode).-o, --output <OUTPUT>: Path to the output file (R1 or Single-End). Defaults tooutput_R1.fastq.gz.-p, --output-r2 <OUTPUT_R2>: Path to the output R2 file (Required if-2is provided).-f, --force: Overwrite the output file if it exists (instead of pre-loading hashes and appending).-v, --verbose: Print processing stats, such as execution time, number of sequences, and duplication rates.-s, --dry-run: Calculate duplication rate without creating an output file.-t, --threshold <THRESHOLD>: Threshold for automatic hash size selection$^1$ (default: 0.001).-H, --hash <HASH>: Manually specify hash size (64 or 128 bits).-c, --compression <LEVEL>: GZIP compression level, 1–9 (default: 6).-P, --read-length <LENGTH>: Expected read length in base pairs, used to tune I/O buffers (default: 150).
1: The probability
You can run it directly from Cargo:
cargo run --release -- --input <INPUT> [OPTIONS]You can also rely on Pixi to run:
pixi run cargo build --release
pixi run fdedup --input <INPUT> [OPTIONS]You can download the latest release and run the containerized version of FDedup:
Using Apptainer:
apptainer run fdedup.sif fdedup --input <INPUT> [OPTIONS]Using Singularity:
singularity run fdedup.sif fdedup --input <INPUT> [OPTIONS]Note:
--forceis very slow when used in a Singularity container. We recommend just deleting the output file before running the container if you want to start from scratch.
After downloading the latest release, you can run the binary directly:
./fdedup --input <INPUT> [OPTIONS]If you want to build it from source, you need to have the following dependencies installed:
- Rust (>= 1.85)
- Pixi and pixitainer to build the container
You can build and run FastDedup directly with Cargo:
cargo build --release
cargo run --release -- --input <INPUT> [OPTIONS]You can also build and run FastDedup with pixitainer:
pixi containerize
apptainer run fdedup.sif fdedup --input <INPUT> [OPTIONS]If you are using FDedup in a pre-processing step, we recommend you to not export your file to a .gz format.
Incremental/resumable deduplication, only works with uncompressed output files.
If you output to a compressed format, FDedup requires --force to restart from scratch on any subsequent run.
However, if you output to an uncompressed format, FDedup will automatically detect any crash-induced corruption, safely truncate the file to the last valid sequence, and seamlessly resume deduplication.
- Support for Paired-End read deduplication.
- Add Multithreading to parallelize sequence hashing and processing.
- Support tracking sequence abundances (counts) instead of naive discarding.
- Add a possibility for exporting sequences as FASTA.
- Improve error handling.
This project is licensed under the MIT License. See the LICENSE file for details.
Raphaël Ribes (coding and design)
Céline Mandier (design)
Computations were performed on the ISDM-MESO HPC platform, funded in the framework of State-region planning contracts (Contrat de plan État-région – CPER) by the French Government, the Occitanie/Pyrénées-Méditerranée Region, Montpellier Méditerranée Métropole, and the University of Montpellier.