GitHub - gao-lab/SEERS

SEERS: Selective Enrichment of Episomes with Random Sequences

A systematic delineation of 3′ UTR regulatory elements and their contextual associations.

📄 Paper

This repository accompanies the SEERS manuscript (BioRxiv):
https://www.biorxiv.org/content/10.1101/2025.06.09.658412v2

📌 What this repository currently contains

This repo is focused on two practical parts:

SEERS count/enrichment processing and k-mer analysis (R scripts)
Folder: SEERS_data_process_LiangN/
TALE model training/evaluation scripts and a pretrained weight (Python scripts)
Folder: TALE_models_LiJY_260128/

📂 Current folder guide (based on existing files)

`SEERS_data_process_LiangN/` (R)

Nn_pp.R
Extracts and counts N45 sequences from merged FASTQ files.
Nn_pp_pool.R
Pools counts across biological/technical replicates.
combine_dna_cyt_nuc.R
Computes enrichment-style values using DNA/Cytoplasm/Nucleus counts.
kmer_profiling.R
Performs k-mer level association/correlation analyses.

`TALE_models_LiJY_260128/` (Python)

Re_trained_seerr.py
Main retraining script for SEERS/TALE style modeling.
eval_seers_lstm_on_3pL6_A549.py
Evaluation script for LSTM model on A549-related set.
external_test_3pL6_HCT116.py
External test script for HCT116-related set.
eval_external_models_on_3pL6_A549.py
Compare/evaluate external models on A549-related set.
eval_clinvar_snps_pytorch.py
ClinVar SNP effect evaluation utility.
eval_cnn1_kernel_sweep.py
CNN kernel sweep evaluation script.
Randomseq_vs3utr.py
Utility for random-sequence vs 3′UTR comparisons.
model/final_model.pth
Included pretrained model weight.

🚀 Quick start (current scripts)

1) Prepare merged FASTQ (if starting from paired-end reads)

Use NGmerge as in the manuscript workflow:

./NGmerge -d -1 read1.fq.gz -2 read2.fq.gz -o merged.fq.gz

2) Run R-based processing

cd SEERS_data_process_LiangN
Rscript Nn_pp.R
Rscript Nn_pp_pool.R
Rscript combine_dna_cyt_nuc.R
Rscript kmer_profiling.R

Note: these scripts assume your input paths/filenames are configured inside scripts.

3) Run Python model scripts

cd TALE_models_LiJY_260128
python Re_trained_seerr.py --train_csv ./train_data_260128/TALE-train-data-260128.csv --save_path ./models/seerr_torch_260128 --seeds 42
python eval_seers_lstm_on_3pL6_A549.py --models_root ./models/seerr_torch_260128 --test_csv ./train_data_260128/3pL6-A549-T1.csv
python external_test_3pL6_HCT116.py
python eval_external_models_on_3pL6_A549.py --test_csv ./train_data_260128/3pL6-A549-T1.csv
python eval_clinvar_snps_pytorch.py
python eval_cnn1_kernel_sweep.py --train_csv ./train_data_260128/TALE-train-data-260128.csv

📦 Data

Full training dataset (Zenodo):
https://doi.org/10.5281/zenodo.18737939

For current TALE scripts, prefer the latest split-aware files from Zenodo:

TALE-train-data-260128.csv (single file with group column: train / val / test)
3pL6-A549-T1.csv (independent high-quality A549 external test set)

The training/evaluation scripts in TALE_models_LiJY_260128/ now support direct use of this group column format for easier standalone reproduction.

🧰 Environment (minimal)

R >= 4.0
Python >= 3.9
NGmerge

If you encounter package errors, please install dependencies required by each script in your local environment.

📝 Changelog

Date	Version/Update	Description
2026-04-08	v2.1	Added split-aware usage examples in README and synchronized quick-start commands with current Zenodo v260128 workflow.
2026-01-28	v2.0	Updated scripts for revised manuscript and new datasets.
2025-04-21	v1.2	Added `kmer_motif.ipynb` and `N45_dissect.ipynb`.
2024-12-08	v1.1	Added `TALE_SNP_effect.ipynb`.
2024-08-14	v1.0	TF 2.16 compatibility fix & Refactored Notebooks.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
SEERS_data_process_LiangN		SEERS_data_process_LiangN
TALE_models_LiJY_260128		TALE_models_LiJY_260128
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEERS: Selective Enrichment of Episomes with Random Sequences

📄 Paper

📌 What this repository currently contains

📂 Current folder guide (based on existing files)

`SEERS_data_process_LiangN/` (R)

`TALE_models_LiJY_260128/` (Python)

🚀 Quick start (current scripts)

1) Prepare merged FASTQ (if starting from paired-end reads)

2) Run R-based processing

3) Run Python model scripts

📦 Data

🧰 Environment (minimal)

📝 Changelog

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SEERS: Selective Enrichment of Episomes with Random Sequences

📄 Paper

📌 What this repository currently contains

📂 Current folder guide (based on existing files)

SEERS_data_process_LiangN/ (R)

TALE_models_LiJY_260128/ (Python)

🚀 Quick start (current scripts)

1) Prepare merged FASTQ (if starting from paired-end reads)

2) Run R-based processing

3) Run Python model scripts

📦 Data

🧰 Environment (minimal)

📝 Changelog

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`SEERS_data_process_LiangN/` (R)

`TALE_models_LiJY_260128/` (Python)

Packages