Skip to content

Commit 5ab48d2

Browse files
authored
Update README.md
1 parent 77b56ce commit 5ab48d2

File tree

1 file changed

+76
-41
lines changed

1 file changed

+76
-41
lines changed

README.md

Lines changed: 76 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,16 @@
1-
For StringTie's manual and prepared source and binary packages, please refer to the official website: <https://ccb.jhu.edu/software/stringtie>
1+
![alt text](https://img.shields.io/badge/License-MIT-blue.svg "MIT License")
2+
3+
## StringTie: efficient transcript assembly and quantitation of RNA-Seq data
4+
5+
Stringtie employs efficient algorithms for transcript structure recovery and abundance estimation from bulk RNA-Seq reads aligned to a reference genome.
6+
It takes as input spliced alignments in coordinate-sorted SAM/BAM/CRAM format and produces a GTF output which consists of assembled
7+
transcript structures and their estimated expression levels (FPKM/TPM and base coverage values).
8+
9+
For additional StringTie documentation and the latest official source and binary packages please refer to the official website: <https://ccb.jhu.edu/software/stringtie>
210

311
## Obtaining and installing StringTie
412

5-
Source and binary packages for this software, along with a small test data set
6-
can be directly downloaded from the [Releases](https://github.com/gpertea/stringtie/releases) page for this repository.
13+
Source and binary packages for this software can be directly downloaded from the [Releases](https://github.com/gpertea/stringtie/releases) page for this repository.
714
StringTie is compatible with a wide range of Linux and Apple OS systems.
815
The main program (StringTie) does not have any other library dependencies (besides zlib) and in order to compile it from source it requires
916
a C++ compiler which supports the C++ 11 standard (GCC 4.8 or newer).
@@ -16,38 +23,52 @@ git clone https://github.com/gpertea/stringtie
1623
cd stringtie
1724
make release
1825
```
26+
During the first run of the above make command a few library dependencies will be downloaded and compiled, but any subsequent stringtie updates (using `git pull`)
27+
should rebuild much faster.
1928

20-
If the compilation is successful, the resulting `stringtie` binary can then be copied to
21-
a programs directory of choice.
29+
To complete the installation, the resulting `stringtie` binary can then be copied to a programs directory of choice (preferably one that is in the current shell's PATH).
2230

23-
Installation of StringTie this way should take less than a minute on a regular Linux or Apple MacOS
31+
Building and installing of StringTie this way should take less than a minute on a regular Linux or Apple MacOS
2432
desktop.
2533

26-
Note that simply running `make` would produce an executable which is more suitable for debugging
27-
and runtime checking but which can be significantly slower than the optimized version which
28-
is obtained by using `make release` as instructed above.
34+
Note that simply running `make` would produce a less optimized executable which is suitable for debugging
35+
and runtime checking but that is significantly slower than the optimized version which
36+
is built by using the `make release` command as instructed above.
2937

3038
### Using pre-compiled (binary) releases
3139
Instead of compiling from source, some users may prefer to download an already compiled binary for Linux
32-
and Apple OS X, ready to run. These binary package releases are compiled on older versions of these
33-
operating systems in order to provide compatibility with a wide range of (older) OS versions, not just the most recent distributions.
40+
and Apple MacOS, ready to run. These binary package releases are compiled on older versions of these
41+
operating systems in order to provide compatibility with a wide range of OS versions not just the most recent distributions.
3442
These precompiled packages are made available on the <a href="https://github.com/gpertea/stringtie/releases">Releases</a> page for this repository.
3543
Please note that these binary packages do not include the optional [super-reads module](#the-super-reads-module),
36-
which currently can only be built on Linux machines, from the source made available in this repository.
44+
which currently can only be built on Linux machines from the source made available in this repository.
3745

3846
## Running StringTie
3947

40-
Run stringtie from the command line like this:
48+
The generic command line for the default usage has this format:
4149
```
42-
stringtie [options] <aligned_reads.bam>
50+
stringtie [-o <output.gtf>] [other_options] <read_alignments.bam>
4351
```
44-
The main input of the program is a SAMTools BAM file with RNA-Seq mappings
45-
sorted by genomic location (for example the accepted_hits.bam file produced
46-
by TopHat).
52+
The main output is a GTF file containing the structural definitions of the transcripts assembled by StringTie from the read alignment data. The name of the output file should be specified with the `-o` option. If this `-o` option is not used, the output GTF with the assembled transcripts will be printed to the standard
53+
output (and can be captured into a file using the `>` output redirect operator).
4754

48-
The main output of the program is a GTF file containing the structural definitions of the transcripts assembled by StringTie from the read alignment data. The name of the output file should be specified by with the `-o` option.
55+
The main input of the program (_<read_alignments.bam>_) must be a SAM, BAM or CRAM file with RNA-Seq read
56+
alignments sorted by their genomic location (for example the `accepted_hits.bam` file produced
57+
by TopHat, or HISAT2 output sorted with `samtools sort` etc.). The output
58+
59+
__Note__: if the `--mix` option is used, StringTie expects two alignment files to be given as positional parameters, in a specific order: the short read alignments must be the first file given while the long read alignments must be the second input file. Both alignment files must be sorted by genomic location.
60+
```
61+
stringtie [-o <output.gtf>] --mix [other_options] <short_read_alns.bam> <long_read_alns.bam>
62+
```
63+
64+
Note that the command line parser in StringTie allows arbitrary order and mixing of the positional parameters with the other options of the program, so the input alignment files can also precede or be given in between the other options -- the following command line is equivalent to the one above:
65+
66+
```
67+
stringtie <short_read_alns.bam> <long_read_alns.bam> --mix [other_options] [-o <output.gtf>]
68+
```
4969

5070
### Running StringTie on the provided test/demo data
71+
5172
When building from this source repository, after the program was compiled with `make release` as instructed above, the generated binary can be tested on a small data set with a command like this:
5273
```
5374
make test
@@ -89,26 +110,28 @@ stringtie -L -o long_reads.out.gtf long_reads.bam
89110
stringtie -L -G human-chr19_P.gff -o long_reads_guided.out.gtf long_reads.bam
90111
```
91112

92-
The above runs should take around one second each on a regular Linux or MacOS desktop.
93-
(see also <a href="https://github.com/gpertea/stringtie/blob/master/test_data/README.md">test_data/README.md</a>).
94-
95113
For very large data sets one can expect up to one hour of processing time. A minimum of 8GB of RAM is recommended for running StringTie on regular size RNA-Seq samples, with 16 GB or more being strongly advised for larger data sets.
96114

97115

98116
### StringTie options
99117

100-
The following optional parameters can be specified (use -h/--help to get the
101-
usage message):
118+
The following optional parameters can be specified (use `-h` or `--help` to get the complete usage message):
102119
```
120+
Options:
103121
--version : print just the version at stdout and exit
104-
--conservative : conservative transcriptome assembly, same as -t -c 1.5 -f 0.05
105-
--rf assume stranded library fr-firststrand
106-
--fr assume stranded library fr-secondstrand
107-
-G reference annotation to use for guiding the assembly process (GTF/GFF3)
122+
--conservative : conservative transcript assembly, same as -t -c 1.5 -f 0.05
123+
--mix : both short and long read data alignments are provided
124+
(long read alignments must be the 2nd BAM/CRAM input file)
125+
--rf : assume stranded library fr-firststrand
126+
--fr : assume stranded library fr-secondstrand
127+
-G reference annotation to use for guiding the assembly process (GTF/GFF)
128+
--ptf : load point-features from a given 4 column feature file <f_tab>
108129
-o output path/file name for the assembled transcripts GTF (default: stdout)
109130
-l name prefix for output transcripts (default: STRG)
110131
-f minimum isoform fraction (default: 0.01)
111-
-L use long reads settings (default:false)
132+
-L long reads processing; also enforces -s 1.5 -g 0 (default:false)
133+
-R if long reads are provided, just clean and collapse the reads but
134+
do not assemble
112135
-m minimum assembled transcript length (default: 200)
113136
-a minimum anchor length for junctions (default: 10)
114137
-j minimum junction coverage (default: 1)
@@ -123,14 +146,19 @@ usage message):
123146
-M fraction of bundle allowed to be covered by multi-hit reads (default:1)
124147
-p number of threads (CPUs) to use (default: 1)
125148
-A gene abundance estimation output file
149+
-E define window around possibly erroneous splice sites from long reads to
150+
look out for correct splice sites (default: 25)
126151
-B enable output of Ballgown table files which will be created in the
127152
same directory as the output GTF (requires -G, -o recommended)
128153
-b enable output of Ballgown table files but these files will be
129154
created under the directory path given as <dir_path>
130155
-e only estimate the abundance of given reference transcripts (requires -G)
156+
--viral : only relevant for long reads from viral data where splice sites
157+
do not follow consensus (default:false)
131158
-x do not assemble any transcripts on the given reference sequence(s)
132159
-u no multi-mapping correction (default: correction enabled)
133160
-h print this usage message and exit
161+
--ref/--cram-ref reference genome FASTA file for CRAM input
134162
135163
Transcript merge usage mode:
136164
stringtie --merge [Options] { gtf_list | strg1.gtf ...}
@@ -153,12 +181,13 @@ the following options are available:
153181
-i keep merged transcripts with retained introns; by default
154182
these are not kept unless there is strong evidence for them
155183
-l <label> name prefix for output transcripts (default: MSTRG)
184+
156185
```
157186

158187
## Input files
159188

160-
StringTie takes as input a binary SAM (BAM) file sorted by reference position.
161-
This file contains spliced read alignments such as the ones produced by TopHat or HISAT2.
189+
StringTie takes as input a SAM, BAM or CRAM file sorted by coordinate (genomic location).
190+
This file should contain spliced RNA-seq read alignments such as the ones produced by TopHat or HISAT2.
162191
A text file in SAM format should be converted to BAM and sorted using the
163192
samtools program:
164193
```
@@ -167,18 +196,24 @@ samtools view -Su alns.sam | samtools sort - alns.sorted
167196
The file resulted from the above command (alns.sorted.bam) can be used
168197
directly as input to StringTie.
169198

170-
Any SAM spliced read alignment (a read alignment across at least one junction)
171-
needs to contain the XS tag to indicate the strand from which the RNA that produced
172-
this read originated. TopHat alignments already include this tag, but if you use
199+
Any SAM record with a spliced alignment (i.e. having a read alignment across at least one junction)
200+
should have the `XS` tag to indicate the transcription strand - the genomic strand from which the RNA that produced
201+
this read originated. TopHat and HISAT2 alignments already include this tag, but if you use
173202
a different read mapper you should check that this tag is also included for spliced alignment
174-
records. For example HISAT2 should be run with the `--dta` option in order to tag spliced
175-
alignments this way. As explained above, the alignments in SAM format should be sorted and
176-
preferrably converted to BAM.
177-
178-
Optionally, a reference annotation file in GTF/GFF3 format can be provided to StringTie
179-
using the `-G` option. In this case, StringTie will check to see if the reference transcripts
180-
are expressed in the RNA-Seq data, and for the ones that are expressed it will compute coverage
181-
and FPKM values.
203+
records. STAR aligner should be run with the option `--outSAMstrandField intronMotif` in order to generate this tag.
204+
205+
There is an exception when the `XS` tags are not necessary in the case of long RNA-seq reads aligned with `minimap2`
206+
with the `-ax splice` option. minimap2 adds the `ts` tags to splice alignments to indicate the transcription strand
207+
(though in a different manner than the `XS` tag), and StringTie can recognize the `ts` tag as well, if the `XS` tag is missing.
208+
Thus the long read spliced alignments produced by `minimap2` can be also assembled by StringTie (with the option `-L` or
209+
as the 2nd input file for the `--mix` option).
210+
211+
As explained above, the alignments must be sorted by coordinate before they can be used as input for StringTie.
212+
213+
Optionally, a reference annotation file in GTF or GFF3 format can be provided to StringTie
214+
using the `-G` option which can be used as 'guides' for the assembly process, or their expression levels
215+
can be directly estimated (without any assembly) when the `-e` option is given.
216+
182217
Note that the reference transcripts should be fully covered by reads in order to be included
183218
in StringTie's output with the original ID of the reference transcript shown in the
184219
_`reference_id`_ GTF attribute in the output file . Other transcripts assembled from

0 commit comments

Comments
 (0)