Skip to content

aleponce4/akodon-genome-assembly-workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Akodon Genome Assembly and Annotation Workflow

Slurm workflow for assembly, repeat annotation, structural annotation, and functional annotation of an Akodon genome. This repository reorganizes previously separate HPC job scripts into one reproducible workflow.

Workflow

Assembly:

  1. Supernova
  2. supernova mkoutput
  3. scaffold filtering with seqkit
  4. QUAST
  5. BUSCO
  6. BUSCO plot
  7. MultiQC
  8. RepeatModeler
  9. repeat library merge and CD-HIT filtering
  10. RepeatMasker

Annotation:

  1. simplify masked genome headers
  2. download reference proteins from NCBI
  3. prepare combined protein FASTA
  4. GALBA
  5. BRAKER2
  6. BRAKER3
  7. TSEBRA
  8. longest-isoform filtering
  9. restore original headers
  10. InterProScan

Notes:

  • stages 14, 15, and 16 are alternative prediction tracks
  • TSEBRA combines selected predictor outputs
  • the annotation branch starts after RepeatMasker
  • Supernova should use raw, untrimmed 10x linked-read FASTQs

Main Tools

  • Supernova
  • seqkit
  • QUAST
  • BUSCO
  • MultiQC
  • RepeatModeler / RepeatMasker via dfam/tetools
  • GALBA
  • BRAKER2 / BRAKER3
  • TSEBRA
  • InterProScan

Files

Inputs

  • raw 10x FASTQs in data/
  • sample table in config/samples.tsv
  • BUSCO lineage data
  • container images for RepeatModeler/RepeatMasker, BRAKER, GALBA, and InterProScan
  • annotation inputs such as ncbi_dataset.tsv, protein FASTA, TSEBRA configs, and RNA BAMs for BRAKER3

Setup

Review:

Common settings:

  • DATA_DIR
  • SUPERNOVA_BIN
  • REPEATMODELER_IMAGE
  • BUSCO_LINEAGE_DIR
  • BRAKER_SIF
  • GALBA_SIF
  • INTERPROSCAN_SIF
  • INTERPROSCAN_DATA_DIR
  • Slurm account, partition, QoS, memory, and walltime

Bootstrap

bash scripts/hpc/probe_node_capabilities.sh
bash scripts/hpc/bootstrap_dependencies.sh install config/bootstrap.env
bash scripts/hpc/bootstrap_dependencies.sh verify config/bootstrap.env

Automated by default:

  • repo-local Conda environments
  • BUSCO lineage download
  • tetools_latest.sif
  • InterProScan image and data
  • get_longest_isoform.py
  • default TSEBRA config files

Still manual by default:

  • Supernova if the legacy path is unavailable
  • BRAKER and GALBA SIFs unless source paths are provided
  • biological inputs such as Vertebrata.fa, ncbi_dataset.tsv, and RNA BAMs

Run

Preflight:

bash scripts/check_pipeline_connections.sh config/pipeline.env

Submit:

bash run_pipeline.sh config/pipeline.env

Smoke test:

bash run_smoke_test.sh config/smoke_test.env

About

Genome assembly and annotation pipeline for Akodon using 10x Genomics Supernova and BRAKER-based gene prediction on HPC.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors