🧬 ORFEX

ORF Extraction & Annotation Pipeline

Automated Open Reading Frame prediction and functional annotation for genomic sequences

📖 Overview

ORFEX is an end-to-end pipeline for predicting open reading frames (ORFs) from genomic/metagenomic FASTA sequences and annotating them against a protein reference database. It chains together Prodigal for gene prediction, DIAMOND/BLASTP for homology search, and a custom Python annotation engine producing clean, organized output in a single command.

✨ Features

Automated ORF Prediction - Uses Prodigal with support for both single-genome and metagenomic modes
Flexible Homology Search - Auto-detects and uses DIAMOND (fast) or falls back to NCBI BLAST+ (comprehensive)
Rich Annotation Output - Generates annotated protein FASTA, gene FASTA, and a detailed TSV annotation table
Organized Directory Structure - All outputs neatly arranged in numbered subdirectories
Full Logging & Reproducibility - Saves run parameters, software versions, and a complete pipeline log
Summary Report - Auto-generates a human-readable summary with annotation statistics

🛠️ Prerequisites

Tool	Purpose	Install
Prodigal	ORF prediction	`conda install -c bioconda prodigal`
DIAMOND or BLAST+	Homology search	`conda install -c bioconda diamond`
Python 3.6+	Annotation script	Usually pre-installed

🚀 Quick Start

# Basic usage
./run_orf_analysis.sh -i genome.fasta -o results/ -d uniprot_sprot.fasta

# Metagenomic mode with custom e-value and threads
./run_orf_analysis.sh -i metagenome.fasta -o meta_results/ -d uniprot_sprot.fasta -m meta -e 1e-10 -t 8

📋 Usage

./run_orf_analysis.sh -i INPUT.fasta -o OUTPUT_DIR -d DATABASE.fasta [OPTIONS]

Required Arguments

Flag	Description
`-i`	Input FASTA file (genome or contigs)
`-o`	Output directory (created automatically)
`-d`	Protein database in FASTA format (e.g., UniProt/Swiss-Prot)

Optional Arguments

Flag	Description	Default
`-m`	Prodigal mode: `single` or `meta`	`meta`
`-e`	E-value threshold for homology search	`0.001`
`-t`	Number of CPU threads	`20`
`-h`	Show help message	—

🔬 How ORFEX Works

Input FASTA
    │
    ▼
┌─────────────────────────┐
│  Step 1: Prodigal        │  ORF prediction (single/meta mode)
│  Gene Calling            │  → proteins.faa, genes.fna, predictions.gff
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│  Step 2: DIAMOND/BLASTP  │  Homology search against protein database
│  Similarity Search       │  → blast_results.txt
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│  Step 3: Python Annotator│  Merge predictions + BLAST hits
│  annotate_orfs_new.py    │  → annotated FASTAs + TSV table
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│  Step 4: Summary Report  │  Statistics & organized output
└─────────────────────────┘

📂 Output Structure

OUTPUT_DIR/
├── input.fasta                          # Copy of input file
├── 01_prodigal/
│   ├── predictions.gff                  # ORF predictions (GFF3)
│   ├── proteins.faa                     # Predicted protein sequences
│   └── genes.fna                        # Predicted gene sequences
├── 02_blast/
│   └── blast_results.txt                # Tabular homology search results
├── 03_annotated/
│   ├── annotated_proteins.faa           # Proteins with functional annotations
│   ├── annotated_genes.fna              # Genes with functional annotations
│   └── annotation_table.tsv             # Comprehensive annotation table
├── run_info.txt                         # Pipeline parameters & software versions
├── pipeline.log                         # Complete execution log
└── SUMMARY.txt                          # Human-readable results summary

Annotation Table Columns (`annotation_table.tsv`)

Column	Description
`ORF_ID`	Unique ORF identifier from Prodigal
`Contig`	Source contig/scaffold name
`Start` / `End`	Genomic coordinates
`Strand`	`+` (forward) or `-` (reverse)
`Length_bp`	Nucleotide length
`Protein_Length_aa`	Amino acid length
`Best_Hit`	Top database match accession
`Identity_%`	Percent identity to best hit
`E-value`	Statistical significance
`Bit_Score`	Alignment quality score
`Annotation`	Functional description

📄 Pipeline Files

File	Description
`run_orf_analysis.sh`	Main ORFEX pipeline wrapper script
`annotate_orfs_new.py`	Python annotation engine - parses BLAST results, annotates FASTA headers, and generates the TSV table

💡 Example

# Annotate extracted sequences against UniProt/Swiss-Prot
./run_orf_analysis.sh \
    -i Clado_extracted.fasta \
    -o Clado_ORF_results \
    -d uniprot_sprot.fasta \
    -m meta \
    -e 1e-5 \
    -t 20

📝 Notes

DIAMOND is preferred over BLASTP for large datasets due to significantly faster runtime
Use -m single for complete, single-organism genomes; use -m meta (default) for contigs, draft genomes, or metagenomic assemblies
The pipeline automatically cleans up temporary database files after the BLAST step
All intermediate files are preserved for inspection and reproducibility

ORFEX -- From raw sequence to functional annotation, in one command.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
annotate_orfs_new.py		annotate_orfs_new.py
run_orf_analysis.sh		run_orf_analysis.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 ORFEX

📖 Overview

✨ Features

🛠️ Prerequisites

🚀 Quick Start

📋 Usage

Required Arguments

Optional Arguments

🔬 How ORFEX Works

📂 Output Structure

Annotation Table Columns (`annotation_table.tsv`)

📄 Pipeline Files

💡 Example

📝 Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 ORFEX

📖 Overview

✨ Features

🛠️ Prerequisites

🚀 Quick Start

📋 Usage

Required Arguments

Optional Arguments

🔬 How ORFEX Works

📂 Output Structure

Annotation Table Columns (annotation_table.tsv)

📄 Pipeline Files

💡 Example

📝 Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Annotation Table Columns (`annotation_table.tsv`)

Packages