Skip to content

optimizedlearning/genomicsML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

314 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GenomicsML

This repository contains code for training and fine-tuning Enformer-based models for genomic track prediction.

Getting Started

Commands to set up the environment:

module load python3
module load pytorch
module load tensorflow
[ ! -d "env" ] && virtualenv --system-site-packages env
source env/bin/activate
pip install -r requirements.txt

To use SCC interactive sessions:

qrsh -P aclab -l gpus=1 -l gpu_c=3.5 -pe omp 1

Job Submission

You can submit bash files on the SCC using generic_submit.sh with the following syntax:

bash generic_submit.sh job_script.sh arg1 arg2 arg3

This is roughly equivalent to qsub job_script.sh arg1 arg2 arg3, but will first copy job_script into the submitted_jobs directory with a unique name job_script.sh.TIMESTAMP and then submit that script instead. This means you can repeatedly modify and resubmit job_script even while jobs are queued without having to worry about which job id corresponds to which submission.

Running Experiments

The main entry point for training is submit_basenji_combined.py. You can either run it directly with Python or use basenji_combined.sh which sets up the environment and passes parameters to the Python script.

There are three main experiment types:

1. Pretraining (Training from Scratch)

Train an Enformer model from scratch on human genomic data.

Key flags:

  • model_architecture=enformer_pytorch - use the Enformer architecture
  • task_type=basenji - genomic track prediction task
  • ro_only_human=true - train on human data
  • ro_pretrained_pretraining=false - train from scratch (not fine-tuning)
  • efa_enformer_n_layer=11 - number of transformer layers
  • learning_rate=5e-5 - learning rate for training from scratch
  • epochs=40 - number of training epochs

Example command:

python submit_basenji_combined.py \
    --model_architecture enformer_pytorch \
    --task_type basenji \
    --ro_only_human true \
    --ro_pretrained_pretraining false \
    --efa_enformer_n_layer 11 \
    --learning_rate 5e-5 \
    --epochs 40 \
    --batch_size 1 \
    --loss_type manual_poisson \
    --lr_type cosine \
    --weight_decay 1e-4 \
    --grad_norm_clip 0.2 \
    --wandb_project my_pretraining_project

2. Fine-tuning on Different Species/Tracks

Fine-tune a pretrained model on data from a different species or specific histone modification tracks.

Key flags:

  • ro_pretrained_pretraining=true - enable fine-tuning from checkpoint
  • ro_pretrained_path=<path> - path to the pretrained model checkpoint
  • Species/track flag set to true (see list below)
  • learning_rate=3e-5 - learning rate for fine-tuning
  • epochs=10 - fewer epochs needed for fine-tuning

Available species flags:

  • ro_only_human, ro_only_mouse, ro_only_cattle, ro_only_pig, ro_only_chicken, ro_only_dog, ro_only_mole_rat, ro_only_rhesus, ro_only_mouse_12

Available histone track flags:

  • ro_only_mouse_h3k27ac, ro_only_mouse_h3k27me3
  • ro_only_rhesus_h3k27ac, ro_only_rhesus_h3k27me3
  • ro_only_chicken_h3k27ac, ro_only_chicken_h3k27me3

Example: Fine-tuning on cattle data:

python submit_basenji_combined.py \
    --model_architecture enformer_pytorch \
    --task_type basenji \
    --ro_only_cattle true \
    --ro_pretrained_pretraining true \
    --ro_pretrained_path /path/to/pretrained/model \
    --efa_enformer_n_layer 11 \
    --learning_rate 3e-5 \
    --epochs 10 \
    --batch_size 1 \
    --loss_type manual_poisson \
    --wandb_project cattle_finetuning

Example: Fine-tuning on chicken H3K27me3 track:

python submit_basenji_combined.py \
    --model_architecture enformer_pytorch \
    --task_type basenji \
    --ro_only_chicken_h3k27me3 true \
    --ro_pretrained_pretraining true \
    --ro_pretrained_path /path/to/pretrained/model \
    --efa_enformer_n_layer 11 \
    --learning_rate 3e-5 \
    --epochs 10 \
    --batch_size 1 \
    --wandb_project chicken_h3k27me3_finetuning

3. Number of Tracks Experiments

Study the effect of varying the number of training tracks on individual track performance.

Key flags:

  • num_tracks=<N> - number of tracks to use (e.g., 50, 100, 200, 500, 800, 1000, 1500, 1642)
  • track_of_interest=<track_id> - specific track to evaluate
  • seed=<seed> - random seed for reproducibility
  • ro_only_mouse=true - typically run on mouse data
  • learning_rate=3e-7 - lower learning rate when fine-tuning with specific tracks

Example:

python submit_basenji_combined.py \
    --model_architecture enformer_pytorch \
    --task_type basenji \
    --ro_only_mouse true \
    --num_tracks 800 \
    --track_of_interest 280 \
    --seed 800280 \
    --ro_pretrained_pretraining true \
    --ro_pretrained_path /path/to/checkpoint \
    --efa_enformer_n_layer 11 \
    --learning_rate 3e-7 \
    --epochs 10 \
    --batch_size 1 \
    --wandb_project num_tracks_experiment

Files Organization

There are 5 groups of files used in each training job:

  1. Model files - e.g., model/model_basenji_rewrite.py
  2. Training loop file - e.g., model/training_loop_reg.py
  3. Dataset loader file - loader/expanded_basenji.py
  4. Main script - submit_basenji_combined.py (combines all components)
  5. Bash wrapper - basenji_combined.sh (sets up environment and parameters)

Model Architecture Options

The model_architecture parameter controls which model is used:

  • enformer_pytorch - Pure Enformer implementation. Parameters use the efa_ prefix (e.g., efa_enformer_n_layer).
  • enformer_rewrite - Custom model with more options. Parameters use the ro_ prefix. Sub-prefixes include:
    • ro_enformer_* - Enformer block parameters
    • ro_performer_* - Performer block parameters

WandB Setup

WandB is used for experiment tracking. It stores data online for easy access and provides ways to organize runs by parameters.

  1. Create an account on the WandB website and ask to be added to the optimizedlearning team.
  2. Install: pip install wandb (or pip install -r requirements.txt)
  3. Initialize: run wandb init from the project directory and choose the optimizedlearning team
  4. Login: run wandb login and follow the prompts for your API key

In the code, experiments are logged with wandb.log calls, and hyperparameters are tracked via wandb.config.

Data Generation

To generate a new dataset, use the scripts in genomics_debug/script/. Example:

/genomicsML/genomics_debug/script/debug_again_cattle.sh

Variables to configure for new datasets:

  • OUTPUT_BASE
  • ORGANISM_ORIGINAL (e.g., cattle_nooverlap)
  • ORGANISM (e.g., ${ORGANISM_ORIGINAL}_reassigned)
  • GAP_FILE - path to gap BED file
  • FASTA_FILE - path to genome FASTA
  • TXT_FILE - path to targets file
  • CHROM_SIZES - path to chromosome sizes file
  • ALIGNMENT - path to chain file for liftover

Adding a New Species

  1. Update basenji_combined.sh: Add a new variable in the 'SPECIE TYPE' section (e.g., ro_only_cat=true) and add the corresponding argument (e.g., --ro_only_cat $ro_only_cat).

  2. Update submit_basenji_combined.py: Add the dataset path for the new species. Search for an existing species like mouse_12 and replicate the pattern:

    if args.ro_only_cat:
        train_dataset = BasenjiDataset('cat_nooverlap_reassigned', 'train', data_dir="/path/to/data/")
        validation_dataset = BasenjiDataset('cat_nooverlap_reassigned', 'valid', data_dir="/path/to/data/")
  3. Update output features: Adjust the number of output features for the new species in the model configuration.

External Links

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors