GenomicsML

This repository contains code for training and fine-tuning Enformer-based models for genomic track prediction.

Getting Started

Commands to set up the environment:

module load python3
module load pytorch
module load tensorflow
[ ! -d "env" ] && virtualenv --system-site-packages env
source env/bin/activate
pip install -r requirements.txt

To use SCC interactive sessions:

qrsh -P aclab -l gpus=1 -l gpu_c=3.5 -pe omp 1

Job Submission

You can submit bash files on the SCC using generic_submit.sh with the following syntax:

bash generic_submit.sh job_script.sh arg1 arg2 arg3

This is roughly equivalent to qsub job_script.sh arg1 arg2 arg3, but will first copy job_script into the submitted_jobs directory with a unique name job_script.sh.TIMESTAMP and then submit that script instead. This means you can repeatedly modify and resubmit job_script even while jobs are queued without having to worry about which job id corresponds to which submission.

Running Experiments

The main entry point for training is submit_basenji_combined.py. You can either run it directly with Python or use basenji_combined.sh which sets up the environment and passes parameters to the Python script.

There are three main experiment types:

1. Pretraining (Training from Scratch)

Train an Enformer model from scratch on human genomic data.

Key flags:

model_architecture=enformer_pytorch - use the Enformer architecture
task_type=basenji - genomic track prediction task
ro_only_human=true - train on human data
ro_pretrained_pretraining=false - train from scratch (not fine-tuning)
efa_enformer_n_layer=11 - number of transformer layers
learning_rate=5e-5 - learning rate for training from scratch
epochs=40 - number of training epochs

Example command:

python submit_basenji_combined.py \
    --model_architecture enformer_pytorch \
    --task_type basenji \
    --ro_only_human true \
    --ro_pretrained_pretraining false \
    --efa_enformer_n_layer 11 \
    --learning_rate 5e-5 \
    --epochs 40 \
    --batch_size 1 \
    --loss_type manual_poisson \
    --lr_type cosine \
    --weight_decay 1e-4 \
    --grad_norm_clip 0.2 \
    --wandb_project my_pretraining_project

2. Fine-tuning on Different Species/Tracks

Fine-tune a pretrained model on data from a different species or specific histone modification tracks.

Key flags:

ro_pretrained_pretraining=true - enable fine-tuning from checkpoint
ro_pretrained_path=<path> - path to the pretrained model checkpoint
Species/track flag set to true (see list below)
learning_rate=3e-5 - learning rate for fine-tuning
epochs=10 - fewer epochs needed for fine-tuning

Available species flags:

ro_only_human, ro_only_mouse, ro_only_cattle, ro_only_pig, ro_only_chicken, ro_only_dog, ro_only_mole_rat, ro_only_rhesus, ro_only_mouse_12

Available histone track flags:

ro_only_mouse_h3k27ac, ro_only_mouse_h3k27me3
ro_only_rhesus_h3k27ac, ro_only_rhesus_h3k27me3
ro_only_chicken_h3k27ac, ro_only_chicken_h3k27me3

Example: Fine-tuning on cattle data:

python submit_basenji_combined.py \
    --model_architecture enformer_pytorch \
    --task_type basenji \
    --ro_only_cattle true \
    --ro_pretrained_pretraining true \
    --ro_pretrained_path /path/to/pretrained/model \
    --efa_enformer_n_layer 11 \
    --learning_rate 3e-5 \
    --epochs 10 \
    --batch_size 1 \
    --loss_type manual_poisson \
    --wandb_project cattle_finetuning

Example: Fine-tuning on chicken H3K27me3 track:

python submit_basenji_combined.py \
    --model_architecture enformer_pytorch \
    --task_type basenji \
    --ro_only_chicken_h3k27me3 true \
    --ro_pretrained_pretraining true \
    --ro_pretrained_path /path/to/pretrained/model \
    --efa_enformer_n_layer 11 \
    --learning_rate 3e-5 \
    --epochs 10 \
    --batch_size 1 \
    --wandb_project chicken_h3k27me3_finetuning

3. Number of Tracks Experiments

Study the effect of varying the number of training tracks on individual track performance.

Key flags:

num_tracks=<N> - number of tracks to use (e.g., 50, 100, 200, 500, 800, 1000, 1500, 1642)
track_of_interest=<track_id> - specific track to evaluate
seed=<seed> - random seed for reproducibility
ro_only_mouse=true - typically run on mouse data
learning_rate=3e-7 - lower learning rate when fine-tuning with specific tracks

Example:

python submit_basenji_combined.py \
    --model_architecture enformer_pytorch \
    --task_type basenji \
    --ro_only_mouse true \
    --num_tracks 800 \
    --track_of_interest 280 \
    --seed 800280 \
    --ro_pretrained_pretraining true \
    --ro_pretrained_path /path/to/checkpoint \
    --efa_enformer_n_layer 11 \
    --learning_rate 3e-7 \
    --epochs 10 \
    --batch_size 1 \
    --wandb_project num_tracks_experiment

Files Organization

There are 5 groups of files used in each training job:

Model files - e.g., model/model_basenji_rewrite.py
Training loop file - e.g., model/training_loop_reg.py
Dataset loader file - loader/expanded_basenji.py
Main script - submit_basenji_combined.py (combines all components)
Bash wrapper - basenji_combined.sh (sets up environment and parameters)

Model Architecture Options

The model_architecture parameter controls which model is used:

enformer_pytorch - Pure Enformer implementation. Parameters use the efa_ prefix (e.g., efa_enformer_n_layer).
enformer_rewrite - Custom model with more options. Parameters use the ro_ prefix. Sub-prefixes include:
- ro_enformer_* - Enformer block parameters
- ro_performer_* - Performer block parameters

WandB Setup

WandB is used for experiment tracking. It stores data online for easy access and provides ways to organize runs by parameters.

Create an account on the WandB website and ask to be added to the optimizedlearning team.
Install: pip install wandb (or pip install -r requirements.txt)
Initialize: run wandb init from the project directory and choose the optimizedlearning team
Login: run wandb login and follow the prompts for your API key

In the code, experiments are logged with wandb.log calls, and hyperparameters are tracked via wandb.config.

Data Generation

To generate a new dataset, use the scripts in genomics_debug/script/. Example:

/genomicsML/genomics_debug/script/debug_again_cattle.sh

Variables to configure for new datasets:

OUTPUT_BASE
ORGANISM_ORIGINAL (e.g., cattle_nooverlap)
ORGANISM (e.g., ${ORGANISM_ORIGINAL}_reassigned)
GAP_FILE - path to gap BED file
FASTA_FILE - path to genome FASTA
TXT_FILE - path to targets file
CHROM_SIZES - path to chromosome sizes file
ALIGNMENT - path to chain file for liftover

Adding a New Species

Update basenji_combined.sh: Add a new variable in the 'SPECIE TYPE' section (e.g., ro_only_cat=true) and add the corresponding argument (e.g., --ro_only_cat $ro_only_cat).

Update submit_basenji_combined.py: Add the dataset path for the new species. Search for an existing species like mouse_12 and replicate the pattern:

if args.ro_only_cat:
    train_dataset = BasenjiDataset('cat_nooverlap_reassigned', 'train', data_dir="/path/to/data/")
    validation_dataset = BasenjiDataset('cat_nooverlap_reassigned', 'valid', data_dir="/path/to/data/")

Update output features: Adjust the number of output features for the new species in the model configuration.

External Links

Todo document: Google Doc
Google Drive: Shared Folder

Name		Name	Last commit message	Last commit date
Latest commit History 314 Commits
basenji		basenji
enformer_pytorch		enformer_pytorch
enformer_pytorch_2/enformer_pytorch		enformer_pytorch_2/enformer_pytorch
genomics_debug/script		genomics_debug/script
homology_stratified_evaluation		homology_stratified_evaluation
loader		loader
model		model
monkey		monkey
monkey_norm		monkey_norm
performer_pytorch		performer_pytorch
.gitignore		.gitignore
README.md		README.md
basenji_classic.sh		basenji_classic.sh
basenji_combined.sh		basenji_combined.sh
basenji_dataset_generation_mouse.sh		basenji_dataset_generation_mouse.sh
basenji_load_classic_human.sh		basenji_load_classic_human.sh
basenji_load_classic_human_7000embd.sh		basenji_load_classic_human_7000embd.sh
basenji_load_classic_mouse.sh		basenji_load_classic_mouse.sh
encode_dataset.py		encode_dataset.py
enformer_eval.sh		enformer_eval.sh
expanded_basenji.py		expanded_basenji.py
generic_submit.sh		generic_submit.sh
load_and_train.sh		load_and_train.sh
requirements.txt		requirements.txt
run_encode_datasets.sh		run_encode_datasets.sh
run_tuning.sh		run_tuning.sh
submit_basenji_classic.py		submit_basenji_classic.py
submit_basenji_combined.py		submit_basenji_combined.py
submit_basenji_dataset.py		submit_basenji_dataset.py
submit_basenji_load_classic.py		submit_basenji_load_classic.py
submit_enformer_eval.py		submit_enformer_eval.py
submit_training.py		submit_training.py
submit_training.sh		submit_training.sh
submit_training_generate_dataset.py		submit_training_generate_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenomicsML

Getting Started

Job Submission

Running Experiments

1. Pretraining (Training from Scratch)

2. Fine-tuning on Different Species/Tracks

3. Number of Tracks Experiments

Files Organization

Model Architecture Options

WandB Setup

Data Generation

Adding a New Species

External Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenomicsML

Getting Started

Job Submission

Running Experiments

1. Pretraining (Training from Scratch)

2. Fine-tuning on Different Species/Tracks

3. Number of Tracks Experiments

Files Organization

Model Architecture Options

WandB Setup

Data Generation

Adding a New Species

External Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages