kentsislab/proteomegenerator3

Introduction

kentsislab/proteomegenerator3 is a bioinformatics pipeline that can be used to create sample-specific, proteogenomics search databases from long-read RNAseq data. It takes in a samplesheet and aligned long-read RNAseq data as input, performs guided, de novo transcript assembly, ORF prediction, and then produces a protein fasta file suitable for use with computational proteomics search platforms (e.g, Fragpipe, DIA-NN).

Pre-processing of aligned reads to create transcript read classes with bambu which can be re-used in future analyses. Optional filtering:
1. Filtering on MAPQ and read length with samtools
Transcript assembly, quantification, and filtering with bambu. Option to merge multiple samples into a unified transcriptome.
ORF prediction with Transdecoder.
Formatting of ORFs into a UniProt-style fasta file which can be used for computational proteomics searchs with Fragpipe, DIA-NN, Spectronaut.
Concatenation of sample-specific proteome fasta produced in #4 with a UniProt proteome of the user's choice to allow for spectra to compete between non-canonical and canonical proteoforms.
Deduplication of sequences and basic statistics with seqkit
MultiQC to collate package versions used (MultiQC)

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data. When using the profile, it will run on a minimal test dataset that can be run in 5-10 minutes on most modern laptops.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

subject_id,sample_id,sequence_type,filetype,filepath
PATIENT1,SAMPLE1,long_read,bam,/path/to/sample1.bam
PATIENT1,SAMPLE1,long_read,rc_file,/path/to/sample1.rds
PATIENT1,SAMPLE1,fusion,tsv,/path/to/sample1_fusions.tsv
PATIENT1,SAMPLE2,long_read,bam,/path/to/sample2.bam

Each row represents a single file associated with a sample. The columns are as follows:

Column	Required	Values	Description
`subject_id`	Yes	String (no spaces)	Subject/patient identifier
`sample_id`	Yes	String (no spaces)	Sample identifier
`sequence_type`	Yes	`long_read`, `short_read`, `fusion`	Data modality
`filetype`	Yes	`bam`, `rc_file`, `tsv`	File format
`filepath`	Yes	File path	Path to the file

Requirements:

Every sample MUST have at least one long_read + bam entry
rc_file entries are optional; use with --skip_preprocessing flag to speed up runtime by reusing Bambu read classes from previous runs
fusion entries require the --fusions flag to be processed
short_read entries require the --short_reads flag to be processed

To produce the necessary files, we recommend using the nf-core/nanoseq pipeline for alignment, or ctat-lr-fusion for fusion calling.

Now, you can run the pipeline using:

nextflow run kentsislab/proteomegenerator3 -r 1.2.2 \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --fasta <REF_GENOME> \
   --gtf <REF_GTF> \
   --outdir <OUTDIR>

Where REF_GENOME and REF_GTF are the reference genome and transcriptome respectively. These can be from GENCODE or Ensembl, but should match the reference used to align the data.

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Additional parameters

To see all optional parameters that could be used with the pipeline and their explanations, use the help menu:

nextflow run kentsislab/proteomegenerator3 -r 1.2.2 --help

This options can be run using flags. For example:

nextflow run kentsislab/proteomegenerator3 -r 1.2.2 \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --fasta <REF_GENOME> \
   --gtf <REF_GTF> \
   --outdir <OUTDIR> \
   --filter_reads

Will pre-filter the bam file before transcript assembly is performed on mapq and read length.

As another example, you can skip multi-sample transcript merging and process each sample independently:

nextflow run kentsislab/proteomegenerator3 -r 1.2.2 \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --fasta <REF_GENOME> \
   --gtf <REF_GTF> \
   --outdir <OUTDIR> \
   --skip_multisample

To include fusion predictions from ctat-lr-fusion in your proteome database, use the --fusions flag:

nextflow run kentsislab/proteomegenerator3 -r 1.2.2 \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --fasta <REF_GENOME> \
   --gtf <REF_GTF> \
   --outdir <OUTDIR> \
   --fusions

Note that when using --fusions, your samplesheet must include fusion + tsv entries with paths to ctat-lr-fusion output files. If fusion files are not provided for a sample, the pipeline will automatically skip fusion processing for that sample.

To include short-read RNA-seq data for complementary transcript assembly, use the --short_reads flag:

nextflow run kentsislab/proteomegenerator3 -r 1.2.2 \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --fasta <REF_GENOME> \
   --gtf <REF_GTF> \
   --outdir <OUTDIR> \
   --short_reads

When using --short_reads, your samplesheet should include short_read + bam entries:

subject_id,sample_id,sequence_type,filetype,filepath
PATIENT1,SAMPLE1,long_read,bam,/path/to/sample1_longread.bam
PATIENT1,SAMPLE1,short_read,bam,/path/to/sample1_shortread.bam

Short-read transcripts are assembled using StringTie and the resulting ORF predictions are merged with long-read (Bambu) predictions in the final proteome database.

To run with the latest version, which may not be stable you can use the -r dev -latest flags:

nextflow run kentsislab/proteomegenerator3 -r dev -latest --help

I have highlighted the following options here:

filter_reads: use this flag to pre-filter reads using mapq and read length
mapq: min mapq for read filtering [default: 20]
read_len: min read length for read filtering [default: 500]
filter_acc_reads: filter reads on accessory chromosomes; sometimes causes issues for bambu
skip_preprocessing: use previously generated bambu read classes
NDR: modulate bambu's novel discovery rate [default: 0.1]
recommended_NDR: run bambu with recommended NDR (as determined by bambu's algorithm)
skip_multisample: skip multi-sample transcript merging and process samples individually
single_best_only: select only the single best ORF per transcript [default: false]
fusions: enable processing of fusion predictions from ctat-lr-fusion [default: false]. When enabled, fusion ORFs will be included in the final proteome database. Requires fusion + tsv entries in samplesheet pointing to ctat-lr-fusion output files.
short_reads: enable short-read RNA-seq assembly and quantification with StringTie [default: false]. Requires short_read + bam entries in samplesheet pointing to aligned short-read BAM files. Short-read transcripts are assembled independently and merged with long-read predictions in the final proteome database.
uniprot_proteome: local path to UniProt proteome for (i) BLAST-based ORF validation in Transdecoder subworkflow and (ii) concatenation of the final proteome fasta file.
UPID: UniProt proteome ID (UPID) for automated download (if no local path was provided with option #12) [default: UP000005640]
min_orf_len: minimum ORF length (in amino acids) for Transdecoder [default: 100]
min_lr_cts: minimum full-length read counts for Bambu transcript filtering [default: 1.0]
min_stringtie_tpm: minimum TPM for StringTie transcript merging [default: 1.0]

Credits

kentsislab/proteomegenerator3 was originally written by Asher Preska Steinberg.

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

If you use kentsislab/proteomegenerator3 for your analysis, please cite our manuscript:

End-to-end proteogenomics for discovery of cryptic and non-canonical cancer proteoforms using long-read transcriptomics and multi-dimensional proteomics

Katarzyna Kulej, Asher Preska Steinberg, Jinxin Zhang, Gabriella Casalena, Eli Havasov, Sohrab P. Shah, Andrew McPherson, Alex Kentsis.

BioRXiv. 2025 Aug 28. doi: 10.1101/2025.08.23.671943.

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docker		docker
docs		docs
modules		modules
subworkflows		subworkflows
tests		tests
workflows		workflows
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
ro-crate-metadata.json		ro-crate-metadata.json
tower.yml		tower.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kentsislab/proteomegenerator3

Introduction

Usage

Additional parameters

Credits

Contributions and Support

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kentsislab/proteomegenerator3

Introduction

Usage

Additional parameters

Credits

Contributions and Support

Citations

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages