A systematic delineation of 3′ UTR regulatory elements and their contextual associations.
This repository accompanies the SEERS manuscript (BioRxiv):
https://www.biorxiv.org/content/10.1101/2025.06.09.658412v2
This repo is focused on two practical parts:
- SEERS count/enrichment processing and k-mer analysis (R scripts)
Folder:SEERS_data_process_LiangN/ - TALE model training/evaluation scripts and a pretrained weight (Python scripts)
Folder:TALE_models_LiJY_260128/
Nn_pp.R
Extracts and counts N45 sequences from merged FASTQ files.Nn_pp_pool.R
Pools counts across biological/technical replicates.combine_dna_cyt_nuc.R
Computes enrichment-style values using DNA/Cytoplasm/Nucleus counts.kmer_profiling.R
Performs k-mer level association/correlation analyses.
Re_trained_seerr.py
Main retraining script for SEERS/TALE style modeling.eval_seers_lstm_on_3pL6_A549.py
Evaluation script for LSTM model on A549-related set.external_test_3pL6_HCT116.py
External test script for HCT116-related set.eval_external_models_on_3pL6_A549.py
Compare/evaluate external models on A549-related set.eval_clinvar_snps_pytorch.py
ClinVar SNP effect evaluation utility.eval_cnn1_kernel_sweep.py
CNN kernel sweep evaluation script.Randomseq_vs3utr.py
Utility for random-sequence vs 3′UTR comparisons.model/final_model.pth
Included pretrained model weight.
Use NGmerge as in the manuscript workflow:
./NGmerge -d -1 read1.fq.gz -2 read2.fq.gz -o merged.fq.gzcd SEERS_data_process_LiangN
Rscript Nn_pp.R
Rscript Nn_pp_pool.R
Rscript combine_dna_cyt_nuc.R
Rscript kmer_profiling.RNote: these scripts assume your input paths/filenames are configured inside scripts.
cd TALE_models_LiJY_260128
python Re_trained_seerr.py --train_csv ./train_data_260128/TALE-train-data-260128.csv --save_path ./models/seerr_torch_260128 --seeds 42
python eval_seers_lstm_on_3pL6_A549.py --models_root ./models/seerr_torch_260128 --test_csv ./train_data_260128/3pL6-A549-T1.csv
python external_test_3pL6_HCT116.py
python eval_external_models_on_3pL6_A549.py --test_csv ./train_data_260128/3pL6-A549-T1.csv
python eval_clinvar_snps_pytorch.py
python eval_cnn1_kernel_sweep.py --train_csv ./train_data_260128/TALE-train-data-260128.csvFull training dataset (Zenodo):
https://doi.org/10.5281/zenodo.18737939
For current TALE scripts, prefer the latest split-aware files from Zenodo:
TALE-train-data-260128.csv(single file withgroupcolumn:train/val/test)3pL6-A549-T1.csv(independent high-quality A549 external test set)
The training/evaluation scripts in TALE_models_LiJY_260128/ now support direct use of this group column format for easier standalone reproduction.
- R >= 4.0
- Python >= 3.9
- NGmerge
If you encounter package errors, please install dependencies required by each script in your local environment.
| Date | Version/Update | Description |
|---|---|---|
| 2026-04-08 | v2.1 | Added split-aware usage examples in README and synchronized quick-start commands with current Zenodo v260128 workflow. |
| 2026-01-28 | v2.0 | Updated scripts for revised manuscript and new datasets. |
| 2025-04-21 | v1.2 | Added kmer_motif.ipynb and N45_dissect.ipynb. |
| 2024-12-08 | v1.1 | Added TALE_SNP_effect.ipynb. |
| 2024-08-14 | v1.0 | TF 2.16 compatibility fix & Refactored Notebooks. |