forked from skovaka/stringtie2
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME
More file actions
112 lines (94 loc) · 5.26 KB
/
README
File metadata and controls
112 lines (94 loc) · 5.26 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
Obtaining and installing StringTie
----------------------------------
In order to build StringTie from this GitHub repository
the following steps can be taken:
git clone https://github.com/gpertea/stringtie
cd stringtie
make release
Note that simply running "make" will produce an executable which is more
suitable for debugging and runtime checking so it will be significantly
slower than the optimized version which is obtained by using
"make release".
Running StringTie
-----------------
Run stringtie from the command line like this:
stringtie [options] <aligned_reads.bam>
The main input of the program is a SAMtools BAM file with RNA-Seq mappings
sorted by genomic location (for example the accepted_hits.bam file produced
by TopHat).
The following optional parameters can be specified (use -h/--help to get the
usage message):
--version : print just the version at stdout and exit
--conservative : conservative transcriptome assembly, same as -t -c 1.5 -f 0.05
--rf assume stranded library fr-firststrand
--fr assume stranded library fr-secondstrand
-G reference annotation to use for guiding the assembly process (GTF/GFF3)
-o output path/file name for the assembled transcripts GTF (default: stdout)
-l name prefix for output transcripts (default: STRG)
-f minimum isoform fraction (default: 0.01)
-L use long reads settings (default:false)
-m minimum assembled transcript length (default: 200)
-a minimum anchor length for junctions (default: 10)
-j minimum junction coverage (default: 1)
-t disable trimming of predicted transcripts based on coverage
(default: coverage trimming is enabled)
-c minimum reads per bp coverage to consider for multi-exon transcript
(default: 1)
-s minimum reads per bp coverage to consider for single-exon transcript
(default: 4.75)
-v verbose (log bundle processing details)
-g maximum gap allowed between read mappings (default: 50)
-M fraction of bundle allowed to be covered by multi-hit reads (default:1)
-p number of threads (CPUs) to use (default: 1)
-A gene abundance estimation output file
-B enable output of Ballgown table files which will be created in the
same directory as the output GTF (requires -G, -o recommended)
-b enable output of Ballgown table files but these files will be
created under the directory path given as <dir_path>
-e only estimate the abundance of given reference transcripts (requires -G)
-x do not assemble any transcripts on the given reference sequence(s)
-u no multi-mapping correction (default: correction enabled)
-h print this usage message and exit
Transcript merge usage mode:
stringtie --merge [Options] { gtf_list | strg1.gtf ...}
With this option StringTie will assemble transcripts from multiple
input files generating a unified non-redundant set of isoforms. In this mode
the following options are available:
-G <guide_gff> reference annotation to include in the merging (GTF/GFF3)
-o <out_gtf> output file name for the merged transcripts GTF
(default: stdout)
-m <min_len> minimum input transcript length to include in the merge
(default: 50)
-c <min_cov> minimum input transcript coverage to include in the merge
(default: 0)
-F <min_fpkm> minimum input transcript FPKM to include in the merge
(default: 1.0)
-T <min_tpm> minimum input transcript TPM to include in the merge
(default: 1.0)
-f <min_iso> minimum isoform fraction (default: 0.01)
-g <gap_len> gap between transcripts to merge together (default: 250)
-i keep merged transcripts with retained introns; by default
these are not kept unless there is strong evidence for them
-l <label> name prefix for output transcripts (default: MSTRG)
Input files
===========
StringTie takes as input a binary SAM (BAM) file sorted by reference position.
This file contains spliced read alignments such as the ones produced by TopHat or HISAT2.
A text file in SAM format should be converted to BAM and sorted using the
samtools program:
samtools view -Su alns.sam | samtools sort - alns.sorted
The file resulted from the above command (alns.sorted.bam) can be used
directly as input to StringTie.
Any SAM spliced read alignment (a read alignment across at least one junction)
needs to contain the XS tag to indicate the strand from which the RNA that produced
this read originated. TopHat alignments already include this tag, but if you use
a different read mapper you should check that this tag is also included for spliced alignment
records. For example HISAT2 should be run with the `--dta` option in order to tag spliced
alignments this way. As explained above, the alignments in SAM format should be sorted and
preferrably converted to BAM.
Optionally, a reference annotation file in GTF/GFF3 format can be provided to StringTie.
In this case, StringTie will check to see if the reference transcripts are expressed in the
RNA-Seq data, and for the ones that are expressed it will compute coverage and FPKM values.
Note that the reference transcripts need to be fully covered by reads in order to be included
in StringTie's output. Other transcripts assembled from the data by StringTie and not present
in the reference file will be printed as well ("novel" transcripts).