Home

Announcement: TD2 aka TransDecoder2 is newly available. Please give it a try - access it here.

TransDecoder (Find Coding Regions Within Transcripts)

TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly or reconstructed from genome-guided transcript annotations.

TransDecoder is applied to an entire transcriptome for a single organism involving thousands of transcript sequences as input.

TransDecoder is unlikely to work if you provide a small number of sequences as input, as it requires training a species-specific model based on hundreds of candidates derived from the inputs.

TransDecoder identifies likely coding sequences based on the following criteria:

a minimum length open reading frame (ORF) is found in a transcript sequence
a log-likelihood score similar to what is computed by the GeneID software is > 0
the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 5 reading frames
if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc.)
a PSSM is built, trained, and used to refine the start codon prediction
optional the putative peptide has a match to a protein homology search or Pfam domain search

Obtaining TransDecoder

The latest release of TransDecoder can be found here.

No compilation required - written in Perl, the favored scripting language of bioinformaticians back in the day.

The following Perl modules are needed and can be installed using cpanm like so:

curl -L https://cpanmin.us | perl - App::cpanminus
cpanm install URI::Escape

TransDecoder is also available for running via Docker or Singularity and includes BLAST+ and HMMER executables.

Running TransDecoder

Predicting coding regions from a transcript FASTA file

For most transcriptome use cases, run the full wrapper on the transcript FASTA:

TransDecoder --transcripts target_transcripts.fasta

The equivalent short form is:

TransDecoder -t target_transcripts.fasta

By default, TransDecoder extracts ORFs that are at least 100 amino acids long. You can lower this via -m, but the false positive rate increases substantially as the minimum length drops.

Common options:

 TransDecoder -t target_transcripts.fasta \
    -m 100 \
    --single_best_only \
    -O transdecoder_outdir

If the transcripts are oriented according to the sense strand, include -S to search only the top strand.

Include --complete_orfs_only if you want to exclude partials, but note that in the case of a 5' partial ORF the true start codon may lie further upstream.

The final set of candidate coding regions is written as *.transdecoder.* files, including .pep, .cds, .gff3, and .bed.

Starting from a genome FASTA and transcript-structure GTF file

The wrapper can now perform the cDNA extraction and genome-coordinate propagation for you.

Provide both the genome FASTA and transcript GTF:

 TransDecoder \
    --genome test.genome.fasta \
    --gtf transcripts.gtf

If -t/--transcripts is omitted in genome mode, a cDNA FASTA is created automatically in the output directory using the GTF basename.

This workflow performs the following steps internally:

extracts transcript cDNA sequences from the genome
predicts ORFs on transcripts
propagates final ORF coordinates back to genome coordinates

The genome-based coding annotation is written as:

transcripts.fasta.transdecoder.genome.gff3

Sample data and execution

The sample_data/ directory includes runnable examples covering transcript-only, genome+GTF, PASA, StringTie, and supertranscript workflows.

Output files explained

Final outputs are written to the chosen output directory:

transcripts.fasta.transdecoder.pep         : peptide sequences for final candidate ORFs
transcripts.fasta.transdecoder.cds         : nucleotide coding sequences for final candidate ORFs
transcripts.fasta.transdecoder.gff3        : ORF coordinates on the target transcripts
transcripts.fasta.transdecoder.bed         : BED file describing ORF positions on the transcripts
transcripts.fasta.transdecoder.genome.gff3 : genome-coordinate ORF annotations when genome/GTF mode is used

A working directory named <basename(transcripts)>.transdecoder_dir/ is created under the selected output directory and contains intermediate files such as:

longest_orfs.pep                 : all ORFs meeting the minimum length criteria, regardless of coding potential
longest_orfs.gff3                : positions of all ORFs found in the target transcripts
longest_orfs.cds                 : nucleotide coding sequences for all detected ORFs
longest_orfs.cds.top_500_longest : longest CDS entries selected for model training
hexamer.scores                   : log-likelihood score for each k-mer (coding/random)
longest_orfs.cds.scores          : log-likelihood scores for candidate ORFs
longest_orfs.cds.scores.selected : ORFs selected based on the coding criteria

Provide feedback

Saved searches

Use saved searches to filter your results more quickly