SnakeEUK

Database formatting pipeline for eukaryotic taxonomy FASTA files

A pipeline for processing taxonomic FASTA files and converting them into formats suitable for multiple downstream tools, including General taxonomy, DADA2, Mothur, QIIME2, and SINTAX.

A pipeline for processing taxonomic FASTA files and converting them into formats suitable for multiple downstream tool,s including General taxonomy, DADA2, Mothur, QIIME2, and SINTAX.

Overview

This pipeline is designed to process and clean FASTA files containing taxonomic information. It performs robust encoding conversion (from latin‑1 to ASCII) on the fly, filters and cleans taxonomy headers, and generates outputs tailored for various bioinformatics tools. The workflow is using Snakemake workflow to ensure reproducibility and parallel processing. Each tool-specific script is modular and can be adjusted to meet specific requirements.

Repository structure

.
├── Snakefile                 # Snakemake workflow file
├── config.yaml               # Pipeline configuration (versioning, etc.)
├── scripts
│   ├── utils.py              # Utility module for robust file handling (encoding conversion)
│   ├── general.py            # Generates General taxonomy formatted FASTA files
│   ├── dada2.py              # Converts FASTA headers for DADA2 (removes accession numbers)
│   ├── mothur.py             # Generates Mothur-compatible FASTA and TAX files
│   ├── qiime.py              # Generates QIIME2-compatible FASTA and TSV files with cleaned taxonomy fields
│   └── sintax.py             # Converts FASTA files to SINTAX format
└── README.md                 # This README file

Installation & dependencies

Requirements

Python 3.6+
Biopython
Install via: pip install biopython
Snakemake
Install via: pip install snakemake
Seqkit
Install via Conda: conda install -c bioconda seqkit or follow instructions on the Seqkit website

Installation

Clone the repository and install Python dependencies:

git clone https://github.com/alihkz94/metagenomics_metabarcoding.git
cd metagenomics_metabarcoding/Snake_EUK
pip install biopython snakemake

Ensure that seqkit is installed and available in your system PATH.

Usage

Input file format and requirements

Expected input structure

This pipeline is designed to process taxonomic FASTA files with a specific header format. Each sequence header must contain:

Accession ID: A unique sequence identifier (e.g., database accession number)
Taxonomic hierarchy: Semicolon-separated taxonomic levels from Kingdom to Species
Optional metadata: Additional information after the main taxonomy

Example header format:

>EUK1157284;Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella;.;UPD1.8i
>KF849527;Fungi;Glomeromycota;Glomeromycetes;Diversisporales;Acaulosporaceae;Acaulospora;.;Põlme

Taxonomic hierarchy structure

The pipeline expects up to 7 main taxonomic levels in the following order:

Kingdom (e.g., Straminipila, Fungi)
Phylum (e.g., Chrysophyta, Glomeromycota)
Class (e.g., Chrysophyceae, Glomeromycetes)
Order (e.g., Chromulinales, Diversisporales)
Family (e.g., Chromulinaceae, Acaulosporaceae)
Genus (e.g., Spumella, Acaulospora)
Species (e.g., species epithet or identifier)

Placeholder Handling

Dot placeholders (.): Used to indicate unclassified or missing taxonomic levels
Empty fields: Gaps in taxonomic assignment
"Unused" designations: Alternative placeholder text

The pipeline intelligently handles these placeholders by:

Removing trailing dots and empty fields
Converting placeholders to "unclassified" labels where appropriate
Truncating taxonomy at the first meaningful placeholder (DADA2 format)

Character encoding and cleaning

The pipeline includes robust encoding conversion capabilities:

Input encoding: Reads files using latin-1 encoding to handle special characters
Output encoding: Converts all content to ASCII, replacing problematic characters
Quote removal: Strips quotation marks from headers
Special character handling: Replaces non-ASCII characters with ASCII equivalents

File naming convention

Input files should be named according to the target region or dataset:

ITS.fasta - Internal Transcribed Spacer sequences
LSU.fasta - Large Subunit rRNA sequences
SSU.fasta - Small Subunit rRNA sequences
longread.fasta - Long-read sequencing datasets

Input files

Place your input FASTA files (e.g., ITS.fasta, LSU.fasta, SSU.fasta, longread.fasta) in the repository root (or designated input folder).

How to prepare your database

If you have a custom taxonomic database, format it as follows:

Header structure: >AccessionID;Kingdom;Phylum;Class;Order;Family;Genus;Species
Separator: Use semicolons (;) between taxonomic levels
Missing data: Use dots (.) or leave empty for unclassified levels
Encoding: Save file with UTF-8 or latin-1 encoding (pipeline handles conversion)
Sequence format: Standard FASTA format with sequences on separate lines

Example of properly formatted input:

>ABC123;Fungi;Ascomycota;Eurotiomycetes;Eurotiales;Aspergillaceae;Aspergillus;niger
ATCGATCGATCG...
>DEF456;Plantae;Streptophyta;.;.;.;.;.
GCTAGCTAGCTA...

Understanding your database and construction guidelines

Assessing your current database

Before using this pipeline, examine your existing database to understand its structure:

Check taxonomic depth: Count the number of semicolon-separated fields

head -5 your_database.fasta | grep ">" | sed 's/;/\n/g' | wc -l

Identify placeholder patterns: Look for common placeholder formats

grep -o ";[^;]*;" your_database.fasta | sort | uniq -c | head -20

Check for encoding issues: Look for non-ASCII characters

file your_database.fasta  # Should show ASCII or UTF-8

Database construction best practices

For Maximum Compatibility:

Standard Format: Use the format >AccessionID;Kingdom;Phylum;Class;Order;Family;Genus;Species
Consistent Separators: Always use semicolons (;) between taxonomic levels
Handle Missing Data: Use dots (.) for unknown/unclassified levels
Avoid Special Characters: Stick to alphanumeric characters and common punctuation
Quality Control: Ensure sequences are properly formatted and contain valid nucleotide codes

Example database construction:

# Well-formatted entries
>AB123456;Fungi;Ascomycota;Eurotiomycetes;Eurotiales;Aspergillaceae;Aspergillus;niger
ATCGATCGATCGATCG...

>CD789012;Plantae;Streptophyta;Magnoliopsida;Rosales;Rosaceae;Rosa;.
GCTAGCTAGCTAGCTA...

# Entries with missing higher-level taxonomy (acceptable)
>EF345678;Bacteria;.;.;.;.;Escherichia;coli
TACGTACGTACGTACG...

# Avoid these problematic formats:
# >BadExample1,Fungi,Ascomycota  # Wrong separator
# >BadExample2;Fungi;Ascomycota;  # Trailing separator
# >BadExample3 Fungi Ascomycota   # No separators

Taxonomic considerations

Taxonomic Rank Expectations:

The pipeline is optimized for standard Linnaean hierarchy
Supports 7 main taxonomic levels: Kingdom → Phylum → Class → Order → Family → Genus → Species
Can handle fewer levels but will pad with "unclassified" in General format
Additional levels beyond 7 will be preserved in some formats, truncated in others

Kingdom-Level Guidelines:

Use standard kingdom names (e.g., Fungi, Plantae, Bacteria, Protista)
For eukaryotic microorganisms, use appropriate supergroup names (e.g., Straminipila, Alveolata)
Maintain consistency across your database

Species-Level Considerations:

Species names can be binomial (Genus species) or just species epithet
Use underscores instead of spaces in species names if needed
Placeholder dots (.) are acceptable for unidentified species

Quality Assurance Steps

Validate Headers: Ensure all headers follow the expected format
Check taxonomy completeness: Verify taxonomic assignments are reasonable
Test pipeline: Run on a small subset first to verify output format
Sequence quality: Ensure sequences contain only valid nucleotide codes (A, T, G, C, N)
Duplicate detection: Check for and remove duplicate sequences if needed

By following these guidelines, you can construct a high-quality database that will work seamlessly with this pipeline and produce reliable outputs for downstream analysis tools.

Configuration

Edit config.yaml to set the desired version string, for example:

version: "1.9.4"

Running the pipeline

Run the entire pipeline with:

snakemake --cores <number_of_cores> --rerun-incomplete --keep-going

To run only for instance the DADA2 conversion over your original FASTA files, execute:

snakemake dada2/DADA2_EUK_ITS_v1.9.4.fasta dada2/DADA2_EUK_LSU_v1.9.4.fasta dada2/DADA2_EUK_SSU_v1.9.4.fasta dada2/DADA2_EUK_longread_v1.9.4.fasta --rerun-incomplete --keep-going --cores 8

Tip: If you suspect output files are outdated or incorrect, you can force re-run of jobs using --forceall or --forcerun <target>.

Output files

General: general/General_EUK_{base}_v{version}.fasta
DADA2: dada2/DADA2_EUK_{base}_v{version}.fasta
Note: The DADA2 script removes the accession number and truncates the header at the first placeholder.
Mothur: mothur/mothur_EUK_{base}_v{version}.fasta and mothur/mothur_EUK_{base}_v{version}.tax
QIIME2: qiime2/QIIME2_EUK_{base}_v{version}.fasta and qiime2/QIIME2_EUK_{base}_v{version}.tsv
Note: Taxonomy in the TSV files is cleaned to remove trailing dot placeholders.
SINTAX: sintax/SINTAX_EUK_{base}_v{version}.fasta

Pipeline processing details

Input transformation by tool

The pipeline processes the same input file differently for each downstream tool, optimizing the format for specific requirements:

1. General format (`general/`)

Purpose: Standardized format with complete taxonomic prefixes
Processing:
- Adds rank prefixes: k__, p__, c__, o__, f__, g__, s__
- Converts placeholders to "unclassified" labels
- Ensures all 7 taxonomic ranks are present

Example transformation:

Input:  >EUK1157284;Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella;.
Output: >EUK1157284;k__Straminipila;p__Chrysophyta;c__Chrysophyceae;o__Chromulinales;f__Chromulinaceae;g__Spumella;s__unclassified

2. DADA2 format (`dada2/`)

Purpose: Simplified taxonomy without accession numbers
Processing:
- Removes accession ID (first field)
- Truncates at first placeholder - stops processing when encountering . or empty field
- Maintains only valid taxonomic levels

Example transformation:

Input:  >EUK1157284;Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella;.;UPD1.8i
Output: >Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella;

3. Mothur format (`mothur/`)

Purpose: Separate FASTA and taxonomy files
Processing:
- FASTA file: Contains only accession IDs as headers
- TAX file: Tab-separated file with accession ID and cleaned taxonomy
- Removes trailing dot placeholders

Example transformation:

FASTA: >EUK1157284
TAX:   EUK1157284	Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella

4. QIIME2 format (`qiime2/`)

Purpose: QIIME2-compatible FASTA and TSV files
Processing:
- FASTA file: Contains only accession IDs as headers
- TSV file: Two-column format with "Feature ID" and "Taxon" headers
- Removes trailing dot placeholders and empty fields

Example transformation:

FASTA: >EUK1157284
TSV:   Feature ID	Taxon
       EUK1157284	Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella

5. SINTAX format (`sintax/`)

Purpose: SINTAX/USEARCH compatible format with rank prefixes
Processing:
- Uses General format as input (requires rank prefixes)
- Converts rank prefixes: k__ → d:, p__ → p:, c__ → c:, etc.
- Excludes "unclassified" entries
- Formats as tax=d:Kingdom,p:Phylum,c:Class...

Example transformation:

Input:  >EUK1157284;k__Straminipila;p__Chrysophyta;c__Chrysophyceae;o__Chromulinales;f__Chromulinaceae;g__Spumella;s__unclassified
Output: >EUK1157284;tax=d:Straminipila,p:Chrysophyta,c:Chrysophyceae,o:Chromulinales,f:Chromulinaceae,g:Spumella;

Data Quality improvements

The pipeline automatically handles several data quality issues:

Encoding problems: Converts non-ASCII characters to ASCII equivalents
Quote removal: Strips quotation marks that may interfere with parsing
Whitespace normalization: Trims excess whitespace from taxonomic fields
Sequence formatting:
- Converts sequences to uppercase (via seqkit)
- Removes line wrapping for consistent formatting
Placeholder standardization: Handles various placeholder formats (., empty, "unused")

Taxonomic completeness

The pipeline intelligently handles incomplete taxonomic assignments:

Missing higher ranks: Fills with "unclassified" labels in General format
Partial taxonomy: Preserves available levels, handles missing ones appropriately
Inconsistent depth: Normalizes taxonomy depth across different sequences

This ensures compatibility with downstream analysis tools that expect consistent taxonomic formatting.

Pipeline details

Robust Encoding Conversion:
All scripts utilize a custom file-like wrapper (implemented in utils.py) that reads files using latin-1 decoding and converts them on the fly to ASCII. This ensures that all non-ASCII characters are handled gracefully.
Taxonomy header cleaning:
The scripts are designed to remove unwanted placeholders (.) and extra taxonomic levels.
- For DADA2, the script removes the accession number and retains taxonomy fields only until the first placeholder.
- For QIIME2 and Mothur, the scripts clean the TSV taxonomy output by stripping trailing dot placeholders.
Reproducible workflow:
The entire process is managed by Snakemake, ensuring that jobs run in the correct order with proper dependencies, even when running in parallel.

Customization & troubleshooting

Modifying header formatting:
The header processing logic is contained within each script (e.g., dada2.py, qiime.py). You can modify the functions transform_header or clean_taxonomy as needed.
Resource management:
If you encounter memory or process issues (e.g., SIGKILLs), try reducing the number of cores with --cores 2 or increasing system resources.
Debugging:
Use Snakemake's verbose and print shell command options (--printshellcmds) for detailed execution logs.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

This pipeline was developed to streamline taxonomic FASTA file processing for downstream bioinformatics applications. Contributions, suggestions, and bug reports are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
scripts		scripts
LICENSE		LICENSE
config.yaml		config.yaml
example_input.fasta		example_input.fasta
readme.md		readme.md
snakeeuk.png		snakeeuk.png
snakefile		snakefile

Folders and files

Latest commit

History

Repository files navigation

SnakeEUK

Overview

Repository structure

Installation & dependencies

Requirements

Installation

Usage

Input file format and requirements

Expected input structure

Taxonomic hierarchy structure

Placeholder Handling

Character encoding and cleaning

File naming convention

Input files

How to prepare your database

Understanding your database and construction guidelines

Assessing your current database

Database construction best practices

Taxonomic considerations

Quality Assurance Steps

Configuration

Running the pipeline

Output files

Pipeline processing details

Input transformation by tool

1. General format (general/)

2. DADA2 format (dada2/)

3. Mothur format (mothur/)

4. QIIME2 format (qiime2/)

5. SINTAX format (sintax/)

Data Quality improvements

Taxonomic completeness

Pipeline details

Customization & troubleshooting

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. General format (`general/`)

2. DADA2 format (`dada2/`)

3. Mothur format (`mothur/`)

4. QIIME2 format (`qiime2/`)

5. SINTAX format (`sintax/`)

Packages