forked from gpertea/stringtie
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME
More file actions
96 lines (74 loc) · 4.08 KB
/
README
File metadata and controls
96 lines (74 loc) · 4.08 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
This document provides basic installation and usage instructions for StringTie
Obtaining and installing StringTie
----------------------------------
The current version of StringTie can be downloaded from
http://ccb.jhu.edu/software/stringtie/
In order to build StringTie from the source package,
the following steps should be taken:
1. Unpack the downloaded StringTie source archive
in a directory of your choice, e.g.:
cd ~/src/
tar xvfz ~/Downloads/stringtie-N.NN.tar.gz
A directory called stringtie-N.NN (where N.NN is the current
numeric version of the program) will be created in the current directory.
2. Change to that directory and build the stringtie executable:
cd stringtie-N.NN
make release
3. Optionally, the stringtie executable can be copied to one of the
shell's PATH directories for easy access, e.g.:
cp stringtie ~/bin/
Running StringTie
-----------------
Run stringtie from the command line like this:
stringtie <aligned_reads.bam> [options]
The main input of the program is a SAMtools BAM file with RNA-Seq mappings
sorted by genomic location (for example the accepted_hits.bam file produced
by TopHat).
The following optional parameters can be specified (use -h/--help to get the
usage message):
-G reference annotation to use for guiding the assembly process (GTF/GFF3)
-l name prefix for output transcripts (default: STRG)
-f minimum isoform fraction (default: 0.1)
-m minimum assembled transcript length to report (default 200bp)
-o output path/file name for the assembled transcripts GTF (default: stdout)
-a minimum anchor length for junctions (default: 10)
-j minimum junction coverage (default: 1)
-t disable trimming of predicted transcripts based on coverage
(default: coverage trimming is enabled)
-c minimum reads per bp coverage to consider for transcript assembly (default: 2.5)
-s coverage saturation threshold; further read alignments will be
ignored in a region where a local coverage depth of <maxcov>
is reached (default: 1,000,000);
-v verbose (log bundle processing details)
-g gap between read mappings triggering a new bundle (default: 50)
-C output file with reference transcripts that are covered by reads
-M fraction of bundle allowed to be covered by multi-hit reads (default:0.95)
-p number of threads (CPUs) to use (default: 1)
-B enable output of Ballgown table files which will be created in the
same directory as the output GTF (requires -G, -o recommended)
-b enable output of Ballgown table files but these files will be
created under the directory path given as <dir_path>
-e only estimates the abundance of given reference transcripts (requires -G)
Input files
-----------
StringTie takes as input a binary SAM (BAM) file which must be sorted by
reference position. This file contains spliced read alignments such as the
ones produced by TopHat. If you have a text file in SAM format you should
first convert it to the BAM format using the samtools view command:
samtools view -S -b input.sam > input.bam
Any SAM spliced read alignment (a read alignment across at least one junction)
needs to contain the tag XS to indicate which strand of the RNA that produced
this read came from. TopHat alignments already include this tag, but if you use
a different read mapper you should check that this tag is also included for all
spliced alignment records. You would also need to check that the reads are
properly sorted. For a SAM file, this can be done with the command:
sort -k 3,3 -k 4,4n input.sam > input.sorted.sam
For BAM files, the samtools program can be used to sort the alignments.
Optionally, a reference annotation file in GTF/GFF3 format
can be provided to StringTie. In this case, StringTie will check
to see if the reference transcripts are expressed in the RNA-Seq data,
and for the ones that are expressed it will compute coverage and FPKM values.
Note that the reference transcripts need to be fully covered by reads
in order to be included in StringTie's output. Other transcripts
assembled from the data by StringTie and not present in the reference
file will be printed as well.