Slurm workflow for assembly, repeat annotation, structural annotation, and functional annotation of an Akodon genome. This repository reorganizes previously separate HPC job scripts into one reproducible workflow.
Assembly:
- Supernova
supernova mkoutput- scaffold filtering with
seqkit - QUAST
- BUSCO
- BUSCO plot
- MultiQC
- RepeatModeler
- repeat library merge and CD-HIT filtering
- RepeatMasker
Annotation:
- simplify masked genome headers
- download reference proteins from NCBI
- prepare combined protein FASTA
- GALBA
- BRAKER2
- BRAKER3
- TSEBRA
- longest-isoform filtering
- restore original headers
- InterProScan
Notes:
- stages
14,15, and16are alternative prediction tracks - TSEBRA combines selected predictor outputs
- the annotation branch starts after RepeatMasker
- Supernova should use raw, untrimmed 10x linked-read FASTQs
- Supernova
seqkit- QUAST
- BUSCO
- MultiQC
- RepeatModeler / RepeatMasker via
dfam/tetools - GALBA
- BRAKER2 / BRAKER3
- TSEBRA
- InterProScan
config/pipeline.env: pipeline paths and Slurm settingsconfig/bootstrap.env: dependency bootstrap settingsconfig/samples.tsv: sample metadatarun_pipeline.sh: full workflow submissionrun_smoke_test.sh: smoke testslurm/: numbered stage scriptsscripts/check_pipeline_connections.sh: preflight path checkscripts/hpc/bootstrap_dependencies.sh: HPC bootstrap
- raw 10x FASTQs in
data/ - sample table in
config/samples.tsv - BUSCO lineage data
- container images for RepeatModeler/RepeatMasker, BRAKER, GALBA, and InterProScan
- annotation inputs such as
ncbi_dataset.tsv, protein FASTA, TSEBRA configs, and RNA BAMs for BRAKER3
Review:
Common settings:
DATA_DIRSUPERNOVA_BINREPEATMODELER_IMAGEBUSCO_LINEAGE_DIRBRAKER_SIFGALBA_SIFINTERPROSCAN_SIFINTERPROSCAN_DATA_DIR- Slurm account, partition, QoS, memory, and walltime
bash scripts/hpc/probe_node_capabilities.sh
bash scripts/hpc/bootstrap_dependencies.sh install config/bootstrap.env
bash scripts/hpc/bootstrap_dependencies.sh verify config/bootstrap.envAutomated by default:
- repo-local Conda environments
- BUSCO lineage download
tetools_latest.sif- InterProScan image and data
get_longest_isoform.py- default TSEBRA config files
Still manual by default:
- Supernova if the legacy path is unavailable
- BRAKER and GALBA SIFs unless source paths are provided
- biological inputs such as
Vertebrata.fa,ncbi_dataset.tsv, and RNA BAMs
Preflight:
bash scripts/check_pipeline_connections.sh config/pipeline.envSubmit:
bash run_pipeline.sh config/pipeline.envSmoke test:
bash run_smoke_test.sh config/smoke_test.env