Akodon Genome Assembly and Annotation Workflow

Slurm workflow for assembly, repeat annotation, structural annotation, and functional annotation of an Akodon genome. This repository reorganizes previously separate HPC job scripts into one reproducible workflow.

Workflow

Assembly:

Supernova
supernova mkoutput
scaffold filtering with seqkit
QUAST
BUSCO
BUSCO plot
MultiQC
RepeatModeler
repeat library merge and CD-HIT filtering
RepeatMasker

Annotation:

simplify masked genome headers
download reference proteins from NCBI
prepare combined protein FASTA
GALBA
BRAKER2
BRAKER3
TSEBRA
longest-isoform filtering
restore original headers
InterProScan

Notes:

stages 14, 15, and 16 are alternative prediction tracks
TSEBRA combines selected predictor outputs
the annotation branch starts after RepeatMasker
Supernova should use raw, untrimmed 10x linked-read FASTQs

Main Tools

Supernova
seqkit
QUAST
BUSCO
MultiQC
RepeatModeler / RepeatMasker via dfam/tetools
GALBA
BRAKER2 / BRAKER3
TSEBRA
InterProScan

Files

config/pipeline.env: pipeline paths and Slurm settings
config/bootstrap.env: dependency bootstrap settings
config/samples.tsv: sample metadata
run_pipeline.sh: full workflow submission
run_smoke_test.sh: smoke test
slurm/: numbered stage scripts
scripts/check_pipeline_connections.sh: preflight path check
scripts/hpc/bootstrap_dependencies.sh: HPC bootstrap

Inputs

raw 10x FASTQs in data/
sample table in config/samples.tsv
BUSCO lineage data
container images for RepeatModeler/RepeatMasker, BRAKER, GALBA, and InterProScan
annotation inputs such as ncbi_dataset.tsv, protein FASTA, TSEBRA configs, and RNA BAMs for BRAKER3

Setup

Review:

Common settings:

DATA_DIR
SUPERNOVA_BIN
REPEATMODELER_IMAGE
BUSCO_LINEAGE_DIR
BRAKER_SIF
GALBA_SIF
INTERPROSCAN_SIF
INTERPROSCAN_DATA_DIR
Slurm account, partition, QoS, memory, and walltime

Bootstrap

bash scripts/hpc/probe_node_capabilities.sh
bash scripts/hpc/bootstrap_dependencies.sh install config/bootstrap.env
bash scripts/hpc/bootstrap_dependencies.sh verify config/bootstrap.env

Automated by default:

repo-local Conda environments
BUSCO lineage download
tetools_latest.sif
InterProScan image and data
get_longest_isoform.py
default TSEBRA config files

Still manual by default:

Supernova if the legacy path is unavailable
BRAKER and GALBA SIFs unless source paths are provided
biological inputs such as Vertebrata.fa, ncbi_dataset.tsv, and RNA BAMs

Run

Preflight:

bash scripts/check_pipeline_connections.sh config/pipeline.env

Submit:

bash run_pipeline.sh config/pipeline.env

Smoke test:

bash run_smoke_test.sh config/smoke_test.env

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
job_scripts/bin		job_scripts/bin
scripts		scripts
slurm		slurm
.gitignore		.gitignore
README.md		README.md
run_pipeline.sh		run_pipeline.sh
run_smoke_test.sh		run_smoke_test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Akodon Genome Assembly and Annotation Workflow

Workflow

Main Tools

Files

Inputs

Setup

Bootstrap

Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Akodon Genome Assembly and Annotation Workflow

Workflow

Main Tools

Files

Inputs

Setup

Bootstrap

Run

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages