Update README.md

gpertea · web-flow · commit 35f1f0433a3f · 2021-11-04T21:37:14.000-04:00
diff --git a/README.md b/README.md
@@ -2,16 +2,15 @@
 
 ## StringTie - efficient transcript assembly and quantitation of RNA-Seq data
 
-This software employs efficient algorithms for transcript structure recovery and abundance estimation from bulk RNA-Seq reads aligned to a reference genome. 
-StringTie takes as input RNA-seq read alignments in coordinate-sorted SAM/BAM/CRAM format and produces a GTF output which consists of assembled 
+Stringtie employs efficient algorithms for transcript structure recovery and abundance estimation from bulk RNA-Seq reads aligned to a reference genome. 
+It takes as input RNA-seq read alignments in coordinate-sorted SAM/BAM/CRAM format and produces a GTF output which consists of assembled 
 transcript structures and their estimated expression levels (FPKM/TPM and base coverage values).
 
 For additional StringTie documentation and the latest official source and binary packages please refer to the official website: <https://ccb.jhu.edu/software/stringtie>
 
 ## Obtaining and installing StringTie
 
-Source and binary packages for this software, along with a small test data set 
-can be directly downloaded from the [Releases](https://github.com/gpertea/stringtie/releases) page for this repository. 
+Source and binary packages for this software can be directly downloaded from the [Releases](https://github.com/gpertea/stringtie/releases) page for this repository. 
 StringTie is compatible with a wide range of Linux and Apple OS systems.
 The main program (StringTie) does not have any other library dependencies (besides zlib) and in order to compile it from source it requires
 a C++ compiler which supports the C++ 11 standard (GCC 4.8 or newer).
@@ -62,13 +61,14 @@ __Note__: if the `--mix` option is used, StringTie expects two alignment files t
 stringtie [-o <output.gtf>] --mix [other_options] <short_read_alns.bam> <long_read_alns.bam> 
 ```
 
-Note that the command line parser in StringTie allows arbitrary order and mixing of the input positional parameters with the other options of the program, so the input alignment files can precede or be given in between the other options, so the following command line if equivalent to the one above:
+Note that the command line parser in StringTie allows arbitrary order and mixing of the positional parameters with the other options of the program, so the input alignment files can also precede or be given in between the other options -- the following command line is equivalent to the one above:
 
 ```
 stringtie <short_read_alns.bam> <long_read_alns.bam> --mix [other_options] [-o <output.gtf>] 
 ```
 
 ### Running StringTie on the provided test/demo data
+
 When building from this source repository, after the program was compiled with `make release` as instructed above, the generated binary can be tested on a small data set with a command like this:
 ```
 make test
@@ -110,26 +110,28 @@ stringtie -L -o long_reads.out.gtf long_reads.bam
 stringtie -L -G human-chr19_P.gff -o long_reads_guided.out.gtf long_reads.bam
 ```
 
-The above runs should take around one second each on a regular Linux or MacOS desktop. 
-(see also <a href="https://github.com/gpertea/stringtie/blob/master/test_data/README.md">test_data/README.md</a>).
-
 For very large data sets one can expect up to one hour of processing time. A minimum of 8GB of RAM is recommended for running StringTie on regular size RNA-Seq samples, with 16 GB or more being strongly advised for larger data sets.
 
 
 ### StringTie options
 
-The following optional parameters can be specified (use -h/--help to get the
-usage message):
+The following optional parameters can be specified (use `-h` or `--help` to get the complete usage message):
 ```
+Options:
  --version : print just the version at stdout and exit
- --conservative : conservative transcriptome assembly, same as -t -c 1.5 -f 0.05
- --rf assume stranded library fr-firststrand
- --fr assume stranded library fr-secondstrand
- -G reference annotation to use for guiding the assembly process (GTF/GFF3)
+ --conservative : conservative transcript assembly, same as -t -c 1.5 -f 0.05
+ --mix : both short and long read data alignments are provided
+        (long read alignments must be the 2nd BAM/CRAM input file)
+ --rf : assume stranded library fr-firststrand
+ --fr : assume stranded library fr-secondstrand
+ -G reference annotation to use for guiding the assembly process (GTF/GFF)
+ --ptf : load point-features from a given 4 column feature file <f_tab>
  -o output path/file name for the assembled transcripts GTF (default: stdout)
  -l name prefix for output transcripts (default: STRG)
  -f minimum isoform fraction (default: 0.01)
- -L use long reads settings (default:false)
+ -L long reads processing; also enforces -s 1.5 -g 0 (default:false)
+ -R if long reads are provided, just clean and collapse the reads but
+    do not assemble
  -m minimum assembled transcript length (default: 200)
  -a minimum anchor length for junctions (default: 10)
  -j minimum junction coverage (default: 1)
@@ -144,14 +146,19 @@ usage message):
  -M fraction of bundle allowed to be covered by multi-hit reads (default:1)
  -p number of threads (CPUs) to use (default: 1)
  -A gene abundance estimation output file
+ -E define window around possibly erroneous splice sites from long reads to
+    look out for correct splice sites (default: 25)
  -B enable output of Ballgown table files which will be created in the
     same directory as the output GTF (requires -G, -o recommended)
  -b enable output of Ballgown table files but these files will be 
     created under the directory path given as <dir_path>
  -e only estimate the abundance of given reference transcripts (requires -G)
+ --viral : only relevant for long reads from viral data where splice sites
+    do not follow consensus (default:false)
  -x do not assemble any transcripts on the given reference sequence(s)
  -u no multi-mapping correction (default: correction enabled)
  -h print this usage message and exit
+ --ref/--cram-ref reference genome FASTA file for CRAM input
 
 Transcript merge usage mode: 
   stringtie --merge [Options] { gtf_list | strg1.gtf ...}
@@ -174,12 +181,13 @@ the following options are available:
   -i               keep merged transcripts with retained introns; by default
                    these are not kept unless there is strong evidence for them
   -l <label>       name prefix for output transcripts (default: MSTRG)
+
 ```
 
 ## Input files
 
-StringTie takes as input a binary SAM (BAM) file sorted by reference position. 
-This file contains spliced read alignments such as the ones produced by TopHat or HISAT2.
+StringTie takes as input a SAM, BAM or CRAM file sorted by coordinate (genomic location). 
+This file should contain spliced RNA-seq read alignments such as the ones produced by TopHat or HISAT2.
 A text file in SAM format should be converted to BAM and sorted using the 
 samtools program:
 ```
@@ -188,18 +196,24 @@ samtools view -Su alns.sam | samtools sort - alns.sorted
 The file resulted from the above command (alns.sorted.bam) can be used 
 directly as input to StringTie. 
 
-Any SAM spliced read alignment (a read alignment across at least one junction)
-needs to contain the XS tag to indicate the strand from which the RNA that produced
-this read originated. TopHat alignments already include this tag, but if you use
+Any SAM record with a spliced alignment (i.e. having a read alignment across at least one junction)
+should have the `XS` tag to indicate the transcription strand - the genomic strand from which the RNA that produced
+this read originated. TopHat and HISAT2 alignments already include this tag, but if you use
 a different read mapper you should check that this tag is also included for spliced alignment
-records. For example HISAT2 should be run with the `--dta` option in order to tag spliced 
-alignments this way. As explained above, the alignments in SAM format should be sorted and
-preferrably converted to BAM.
-
-Optionally, a reference annotation file in GTF/GFF3 format can be provided to StringTie 
-using the `-G` option. In this case, StringTie will check to see if the reference transcripts 
-are expressed in the RNA-Seq data, and for the ones that are expressed it will compute coverage
-and FPKM values.
+records. STAR aligner should be run with the option `--outSAMstrandField intronMotif` in order to generate this tag.
+
+There is an exception when the `XS` tags are not necessary in the case of long RNA-seq reads aligned with `minimap2` 
+with the `-ax splice` option. minimap2 adds the `ts` tags to splice alignments to indicate the transcription strand 
+(though in a different manner than the `XS` tag), and StringTie can recognize the `ts` tag as well, if the `XS` tag is missing. 
+Thus the long read spliced alignments produced by `minimap2` can be also assembled by StringTie (with the option `-L` or 
+as the 2nd input file for the `--mix` option).
+
+As explained above, the alignments must be sorted by coordinate before they can be used as input for StringTie.
+
+Optionally, a reference annotation file in GTF or GFF3 format can be provided to StringTie 
+using the `-G` option which can be used as 'guides' for the assembly process, or their expression levels
+can be directly estimated (without any assembly) when the `-e` option is given.
+
 Note that the reference transcripts should be fully covered by reads in order to be included
 in StringTie's output with the original ID of the reference transcript shown in the 
 _`reference_id`_ GTF attribute in the output file . Other transcripts assembled from