Skip to content

alihkz94/SnakeEUK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SnakeEUK SnakeEUK

Snakemake License

Database formatting pipeline for eukaryotic taxonomy FASTA files

A pipeline for processing taxonomic FASTA files and converting them into formats suitable for multiple downstream tools, including General taxonomy, DADA2, Mothur, QIIME2, and SINTAX.

A pipeline for processing taxonomic FASTA files and converting them into formats suitable for multiple downstream tool,s including General taxonomy, DADA2, Mothur, QIIME2, and SINTAX.

Overview

This pipeline is designed to process and clean FASTA files containing taxonomic information. It performs robust encoding conversion (from latin‑1 to ASCII) on the fly, filters and cleans taxonomy headers, and generates outputs tailored for various bioinformatics tools. The workflow is using Snakemake workflow to ensure reproducibility and parallel processing. Each tool-specific script is modular and can be adjusted to meet specific requirements.

Repository structure

.
├── Snakefile                 # Snakemake workflow file
├── config.yaml               # Pipeline configuration (versioning, etc.)
├── scripts
│   ├── utils.py              # Utility module for robust file handling (encoding conversion)
│   ├── general.py            # Generates General taxonomy formatted FASTA files
│   ├── dada2.py              # Converts FASTA headers for DADA2 (removes accession numbers)
│   ├── mothur.py             # Generates Mothur-compatible FASTA and TAX files
│   ├── qiime.py              # Generates QIIME2-compatible FASTA and TSV files with cleaned taxonomy fields
│   └── sintax.py             # Converts FASTA files to SINTAX format
└── README.md                 # This README file

Installation & dependencies

Requirements

  • Python 3.6+
  • Biopython
    Install via: pip install biopython
  • Snakemake
    Install via: pip install snakemake
  • Seqkit
    Install via Conda: conda install -c bioconda seqkit or follow instructions on the Seqkit website

Installation

Clone the repository and install Python dependencies:

git clone https://github.com/alihkz94/metagenomics_metabarcoding.git
cd metagenomics_metabarcoding/Snake_EUK
pip install biopython snakemake

Ensure that seqkit is installed and available in your system PATH.

Usage

Input file format and requirements

Expected input structure

This pipeline is designed to process taxonomic FASTA files with a specific header format. Each sequence header must contain:

  1. Accession ID: A unique sequence identifier (e.g., database accession number)
  2. Taxonomic hierarchy: Semicolon-separated taxonomic levels from Kingdom to Species
  3. Optional metadata: Additional information after the main taxonomy

Example header format:

>EUK1157284;Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella;.;UPD1.8i
>KF849527;Fungi;Glomeromycota;Glomeromycetes;Diversisporales;Acaulosporaceae;Acaulospora;.;Põlme

Taxonomic hierarchy structure

The pipeline expects up to 7 main taxonomic levels in the following order:

  1. Kingdom (e.g., Straminipila, Fungi)
  2. Phylum (e.g., Chrysophyta, Glomeromycota)
  3. Class (e.g., Chrysophyceae, Glomeromycetes)
  4. Order (e.g., Chromulinales, Diversisporales)
  5. Family (e.g., Chromulinaceae, Acaulosporaceae)
  6. Genus (e.g., Spumella, Acaulospora)
  7. Species (e.g., species epithet or identifier)

Placeholder Handling

  • Dot placeholders (.): Used to indicate unclassified or missing taxonomic levels
  • Empty fields: Gaps in taxonomic assignment
  • "Unused" designations: Alternative placeholder text

The pipeline intelligently handles these placeholders by:

  • Removing trailing dots and empty fields
  • Converting placeholders to "unclassified" labels where appropriate
  • Truncating taxonomy at the first meaningful placeholder (DADA2 format)

Character encoding and cleaning

The pipeline includes robust encoding conversion capabilities:

  • Input encoding: Reads files using latin-1 encoding to handle special characters
  • Output encoding: Converts all content to ASCII, replacing problematic characters
  • Quote removal: Strips quotation marks from headers
  • Special character handling: Replaces non-ASCII characters with ASCII equivalents

File naming convention

Input files should be named according to the target region or dataset:

  • ITS.fasta - Internal Transcribed Spacer sequences
  • LSU.fasta - Large Subunit rRNA sequences
  • SSU.fasta - Small Subunit rRNA sequences
  • longread.fasta - Long-read sequencing datasets

Input files

Place your input FASTA files (e.g., ITS.fasta, LSU.fasta, SSU.fasta, longread.fasta) in the repository root (or designated input folder).

How to prepare your database

If you have a custom taxonomic database, format it as follows:

  1. Header structure: >AccessionID;Kingdom;Phylum;Class;Order;Family;Genus;Species
  2. Separator: Use semicolons (;) between taxonomic levels
  3. Missing data: Use dots (.) or leave empty for unclassified levels
  4. Encoding: Save file with UTF-8 or latin-1 encoding (pipeline handles conversion)
  5. Sequence format: Standard FASTA format with sequences on separate lines

Example of properly formatted input:

>ABC123;Fungi;Ascomycota;Eurotiomycetes;Eurotiales;Aspergillaceae;Aspergillus;niger
ATCGATCGATCG...
>DEF456;Plantae;Streptophyta;.;.;.;.;.
GCTAGCTAGCTA...

Understanding your database and construction guidelines

Assessing your current database

Before using this pipeline, examine your existing database to understand its structure:

  1. Check taxonomic depth: Count the number of semicolon-separated fields

    head -5 your_database.fasta | grep ">" | sed 's/;/\n/g' | wc -l
  2. Identify placeholder patterns: Look for common placeholder formats

    grep -o ";[^;]*;" your_database.fasta | sort | uniq -c | head -20
  3. Check for encoding issues: Look for non-ASCII characters

    file your_database.fasta  # Should show ASCII or UTF-8

Database construction best practices

For Maximum Compatibility:

  1. Standard Format: Use the format >AccessionID;Kingdom;Phylum;Class;Order;Family;Genus;Species
  2. Consistent Separators: Always use semicolons (;) between taxonomic levels
  3. Handle Missing Data: Use dots (.) for unknown/unclassified levels
  4. Avoid Special Characters: Stick to alphanumeric characters and common punctuation
  5. Quality Control: Ensure sequences are properly formatted and contain valid nucleotide codes

Example database construction:

# Well-formatted entries
>AB123456;Fungi;Ascomycota;Eurotiomycetes;Eurotiales;Aspergillaceae;Aspergillus;niger
ATCGATCGATCGATCG...

>CD789012;Plantae;Streptophyta;Magnoliopsida;Rosales;Rosaceae;Rosa;.
GCTAGCTAGCTAGCTA...

# Entries with missing higher-level taxonomy (acceptable)
>EF345678;Bacteria;.;.;.;.;Escherichia;coli
TACGTACGTACGTACG...

# Avoid these problematic formats:
# >BadExample1,Fungi,Ascomycota  # Wrong separator
# >BadExample2;Fungi;Ascomycota;  # Trailing separator
# >BadExample3 Fungi Ascomycota   # No separators

Taxonomic considerations

Taxonomic Rank Expectations:

  • The pipeline is optimized for standard Linnaean hierarchy
  • Supports 7 main taxonomic levels: Kingdom → Phylum → Class → Order → Family → Genus → Species
  • Can handle fewer levels but will pad with "unclassified" in General format
  • Additional levels beyond 7 will be preserved in some formats, truncated in others

Kingdom-Level Guidelines:

  • Use standard kingdom names (e.g., Fungi, Plantae, Bacteria, Protista)
  • For eukaryotic microorganisms, use appropriate supergroup names (e.g., Straminipila, Alveolata)
  • Maintain consistency across your database

Species-Level Considerations:

  • Species names can be binomial (Genus species) or just species epithet
  • Use underscores instead of spaces in species names if needed
  • Placeholder dots (.) are acceptable for unidentified species

Quality Assurance Steps

  1. Validate Headers: Ensure all headers follow the expected format
  2. Check taxonomy completeness: Verify taxonomic assignments are reasonable
  3. Test pipeline: Run on a small subset first to verify output format
  4. Sequence quality: Ensure sequences contain only valid nucleotide codes (A, T, G, C, N)
  5. Duplicate detection: Check for and remove duplicate sequences if needed

By following these guidelines, you can construct a high-quality database that will work seamlessly with this pipeline and produce reliable outputs for downstream analysis tools.

Configuration

Edit config.yaml to set the desired version string, for example:

version: "1.9.4"

Running the pipeline

Run the entire pipeline with:

snakemake --cores <number_of_cores> --rerun-incomplete --keep-going

To run only for instance the DADA2 conversion over your original FASTA files, execute:

snakemake dada2/DADA2_EUK_ITS_v1.9.4.fasta dada2/DADA2_EUK_LSU_v1.9.4.fasta dada2/DADA2_EUK_SSU_v1.9.4.fasta dada2/DADA2_EUK_longread_v1.9.4.fasta --rerun-incomplete --keep-going --cores 8

Tip: If you suspect output files are outdated or incorrect, you can force re-run of jobs using --forceall or --forcerun <target>.

Output files

  • General: general/General_EUK_{base}_v{version}.fasta
  • DADA2: dada2/DADA2_EUK_{base}_v{version}.fasta
    Note: The DADA2 script removes the accession number and truncates the header at the first placeholder.
  • Mothur: mothur/mothur_EUK_{base}_v{version}.fasta and mothur/mothur_EUK_{base}_v{version}.tax
  • QIIME2: qiime2/QIIME2_EUK_{base}_v{version}.fasta and qiime2/QIIME2_EUK_{base}_v{version}.tsv
    Note: Taxonomy in the TSV files is cleaned to remove trailing dot placeholders.
  • SINTAX: sintax/SINTAX_EUK_{base}_v{version}.fasta

Pipeline processing details

Input transformation by tool

The pipeline processes the same input file differently for each downstream tool, optimizing the format for specific requirements:

1. General format (general/)

  • Purpose: Standardized format with complete taxonomic prefixes
  • Processing:
    • Adds rank prefixes: k__, p__, c__, o__, f__, g__, s__
    • Converts placeholders to "unclassified" labels
    • Ensures all 7 taxonomic ranks are present
  • Example transformation:
    Input:  >EUK1157284;Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella;.
    Output: >EUK1157284;k__Straminipila;p__Chrysophyta;c__Chrysophyceae;o__Chromulinales;f__Chromulinaceae;g__Spumella;s__unclassified
    

2. DADA2 format (dada2/)

  • Purpose: Simplified taxonomy without accession numbers
  • Processing:
    • Removes accession ID (first field)
    • Truncates at first placeholder - stops processing when encountering . or empty field
    • Maintains only valid taxonomic levels
  • Example transformation:
    Input:  >EUK1157284;Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella;.;UPD1.8i
    Output: >Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella;
    

3. Mothur format (mothur/)

  • Purpose: Separate FASTA and taxonomy files
  • Processing:
    • FASTA file: Contains only accession IDs as headers
    • TAX file: Tab-separated file with accession ID and cleaned taxonomy
    • Removes trailing dot placeholders
  • Example transformation:
    FASTA: >EUK1157284
    TAX:   EUK1157284	Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella
    

4. QIIME2 format (qiime2/)

  • Purpose: QIIME2-compatible FASTA and TSV files
  • Processing:
    • FASTA file: Contains only accession IDs as headers
    • TSV file: Two-column format with "Feature ID" and "Taxon" headers
    • Removes trailing dot placeholders and empty fields
  • Example transformation:
    FASTA: >EUK1157284
    TSV:   Feature ID	Taxon
           EUK1157284	Straminipila;Chrysophyta;Chrysophyceae;Chromulinales;Chromulinaceae;Spumella
    

5. SINTAX format (sintax/)

  • Purpose: SINTAX/USEARCH compatible format with rank prefixes
  • Processing:
    • Uses General format as input (requires rank prefixes)
    • Converts rank prefixes: k__d:, p__p:, c__c:, etc.
    • Excludes "unclassified" entries
    • Formats as tax=d:Kingdom,p:Phylum,c:Class...
  • Example transformation:
    Input:  >EUK1157284;k__Straminipila;p__Chrysophyta;c__Chrysophyceae;o__Chromulinales;f__Chromulinaceae;g__Spumella;s__unclassified
    Output: >EUK1157284;tax=d:Straminipila,p:Chrysophyta,c:Chrysophyceae,o:Chromulinales,f:Chromulinaceae,g:Spumella;
    

Data Quality improvements

The pipeline automatically handles several data quality issues:

  1. Encoding problems: Converts non-ASCII characters to ASCII equivalents
  2. Quote removal: Strips quotation marks that may interfere with parsing
  3. Whitespace normalization: Trims excess whitespace from taxonomic fields
  4. Sequence formatting:
    • Converts sequences to uppercase (via seqkit)
    • Removes line wrapping for consistent formatting
  5. Placeholder standardization: Handles various placeholder formats (., empty, "unused")

Taxonomic completeness

The pipeline intelligently handles incomplete taxonomic assignments:

  • Missing higher ranks: Fills with "unclassified" labels in General format
  • Partial taxonomy: Preserves available levels, handles missing ones appropriately
  • Inconsistent depth: Normalizes taxonomy depth across different sequences

This ensures compatibility with downstream analysis tools that expect consistent taxonomic formatting.

Pipeline details

  • Robust Encoding Conversion:
    All scripts utilize a custom file-like wrapper (implemented in utils.py) that reads files using latin-1 decoding and converts them on the fly to ASCII. This ensures that all non-ASCII characters are handled gracefully.

  • Taxonomy header cleaning:
    The scripts are designed to remove unwanted placeholders (.) and extra taxonomic levels.

    • For DADA2, the script removes the accession number and retains taxonomy fields only until the first placeholder.
    • For QIIME2 and Mothur, the scripts clean the TSV taxonomy output by stripping trailing dot placeholders.
  • Reproducible workflow:
    The entire process is managed by Snakemake, ensuring that jobs run in the correct order with proper dependencies, even when running in parallel.

Customization & troubleshooting

  • Modifying header formatting:
    The header processing logic is contained within each script (e.g., dada2.py, qiime.py). You can modify the functions transform_header or clean_taxonomy as needed.

  • Resource management:
    If you encounter memory or process issues (e.g., SIGKILLs), try reducing the number of cores with --cores 2 or increasing system resources.

  • Debugging:
    Use Snakemake's verbose and print shell command options (--printshellcmds) for detailed execution logs.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

This pipeline was developed to streamline taxonomic FASTA file processing for downstream bioinformatics applications. Contributions, suggestions, and bug reports are welcome.

About

Pipeline for processing taxonomic FASTA files and converting them into formats suitable for multiple downstream tools

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages