-
Notifications
You must be signed in to change notification settings - Fork 64
Home
Announcement: TD2 aka TransDecoder2 is newly available. Please give it a try - access it here.
TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly or reconstructed from genome-guided transcript annotations.
TransDecoder is applied to an entire transcriptome for a single organism involving thousands of transcript sequences as input.
TransDecoder is unlikely to work if you provide a small number of sequences as input, as it requires training a species-specific model based on hundreds of candidates derived from the inputs.
TransDecoder identifies likely coding sequences based on the following criteria:
- a minimum length open reading frame (ORF) is found in a transcript sequence
- a log-likelihood score similar to what is computed by the GeneID software is > 0
- the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 5 reading frames
- if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc.)
- a PSSM is built, trained, and used to refine the start codon prediction
- optional the putative peptide has a match to a protein homology search or Pfam domain search
The latest release of TransDecoder can be found here.
No compilation required - written in Perl, the favored scripting language of bioinformaticians back in the day.
The following Perl modules are needed and can be installed using cpanm like so:
curl -L https://cpanmin.us | perl - App::cpanminus
cpanm install URI::EscapeTransDecoder is also available for running via Docker or Singularity and includes BLAST+ and HMMER executables.
For most transcriptome use cases, run the full wrapper on the transcript FASTA:
TransDecoder --transcripts target_transcripts.fasta
The equivalent short form is:
TransDecoder -t target_transcripts.fasta
By default, TransDecoder extracts ORFs that are at least 100 amino acids long. You can lower this via
-m, but the false positive rate increases substantially as the minimum length drops.
Common options:
TransDecoder -t target_transcripts.fasta \
-m 100 \
--single_best_only \
-O transdecoder_outdir
If the transcripts are oriented according to the sense strand, include -S to search only the top strand.
Include
--complete_orfs_onlyif you want to exclude partials, but note that in the case of a 5' partial ORF the true start codon may lie further upstream.
The final set of candidate coding regions is written as *.transdecoder.* files, including .pep, .cds, .gff3, and .bed.
The wrapper can now perform the cDNA extraction and genome-coordinate propagation for you.
Provide both the genome FASTA and transcript GTF:
TransDecoder \
--genome test.genome.fasta \
--gtf transcripts.gtf
If -t/--transcripts is omitted in genome mode, a cDNA FASTA is created automatically in the output directory using the GTF basename.
This workflow performs the following steps internally:
- extracts transcript cDNA sequences from the genome
- predicts ORFs on transcripts
- propagates final ORF coordinates back to genome coordinates
The genome-based coding annotation is written as:
transcripts.fasta.transdecoder.genome.gff3
The sample_data/ directory includes runnable examples covering transcript-only, genome+GTF, PASA, StringTie, and supertranscript workflows.
Final outputs are written to the chosen output directory:
transcripts.fasta.transdecoder.pep : peptide sequences for final candidate ORFs
transcripts.fasta.transdecoder.cds : nucleotide coding sequences for final candidate ORFs
transcripts.fasta.transdecoder.gff3 : ORF coordinates on the target transcripts
transcripts.fasta.transdecoder.bed : BED file describing ORF positions on the transcripts
transcripts.fasta.transdecoder.genome.gff3 : genome-coordinate ORF annotations when genome/GTF mode is used
A working directory named <basename(transcripts)>.transdecoder_dir/ is created under the selected output directory and contains intermediate files such as:
longest_orfs.pep : all ORFs meeting the minimum length criteria, regardless of coding potential
longest_orfs.gff3 : positions of all ORFs found in the target transcripts
longest_orfs.cds : nucleotide coding sequences for all detected ORFs
longest_orfs.cds.top_500_longest : longest CDS entries selected for model training
hexamer.scores : log-likelihood score for each k-mer (coding/random)
longest_orfs.cds.scores : log-likelihood scores for candidate ORFs
longest_orfs.cds.scores.selected : ORFs selected based on the coding criteria