rMATS v4.0.2 (turbo)

Xing Lab, Children's Hospital of Philadelphia

Which version to use:

rMATS v4.0.2 (turbo) was built with two different settings of Python interpreter. In order to know which version you should use, you need to check what Unicode type your Python is built with. Open python console and type in:

>>> import sys
>>> print sys.maxunicode
1114111

This output indicates that your python is built with --enable-unicode=ucs4, and you should use rMATS-turbo-xxx-UCS4.

>>> import sys
>>> print sys.maxunicode
65535

This output indicates that your python is built with --enable-unicode=ucs2, and you should use rMATS-turbo-xxx-UCS2.

Installation:

Install Python 2.7.x and corresponding versions of NumPy, BLAS, LAPACK, GSL and gfortran (Fortran 77 library needed).
At the time of compiling this software, the default C++ compiler on Mac OS X doesn't fully support OpenMP. In order to enjoy multi-threading feature on Mac OS X, GCC 5 is needed. The following is the installation procedure we used in testing:

# For CentOS 6:
pip install numpy
yum install lapack-devel blas-devel
yum install gsl-devel.x86_64
yum install gcc-gfortran

# For Ubuntu 14:
pip install numpy
sudo apt-get install libblas-dev liblapack-dev
sudo apt-get install libgsl0ldbl
sudo apt-get install gfortran

# For Mac OS X Yosemite 10.10.5 (Using Homebrew for package management):
brew install gcc@5
brew install gsl
pip install numpy

Download and install STAR (version 2.5 or later) for FASTQ input.

Download rMATS v4.0.2 (turbo)

Obtain STAR genome index for genome by either of the following two ways
Download pre-built STAR indexes if using Human (hg38, hg19) or Mouse (mm10).
Build your own STAR index following STAR manual from genome fasta sequence

Untar rMATS and STAR indexes. For example, assuming you use pre-build STAR genome indexes, unpack rMATS.4.0.2.tgz in your working directory and unpack STARindex.tgz in your working directory:

tar -xzf rMATS.4.0.2.tgz
cd rMATS.4.0.2/
... # move/copy/download data to this folder.
tar -xzf gtf.tgz
tar -xzf testData.tgz

Test rMATS v4.0.2 (turbo) on PC3E & GS689 dataset:

Run rmats.py as below to test rMATS runs properly.

cd rMATS.4.0.2/
python rMATS-turbo-xxx-UCSx/rmats.py --b1 b1.txt --b2 b2.txt --gtf gtf/Homo_sapiens.Ensembl.GRCh37.75.gtf --od bam_test -t paired --readLength 50 --cstat 0.0001 --libType fr-unstranded

Output can be found in the bamTest directories. The test run output should look like the rMATS output description.

Using rMATS v4.0.2 (turbo):

The following is a detailed description of the options used with rMATS v4.0.2 (turbo).

Usage:

Running with fastq

python rmats.py --s1 s1.txt --s2 s2.txt --gtf gtfFile --bi STARindexFolder -od outDir -t readType -readLength readLength [options]*

Running with bam

python rmats.py --b1 b1.txt --b2 b2.txt --gtf gtfFile --od outDir -t readType --nthread nthread --readLength readLength --tstat tstat [options]*

Required Parameters:

--s1 s1.txt	A text file contains FASTQ file(s) for the sample_1.(Only if using fastq)
--s2 s2.txt	A text file contains FASTQ file(s) for the sample_2.(Only if using fastq)
--b1 b1.txt	A text file records mapping results for the sample_1 in bam format. (Only if using bam)
--b2 b2.txt	A text file records mapping results for the sample_2 in bam format. (Only if using bam)
-t readType	Type of read used in the analysis. readType is either 'paired' or 'single'. 'paired' is for paired-end data and 'single' is for single-end data
--readLength <int>	The length of each read
--gtf gtfFile	An annotation of genes and transcripts in GTF format
--bi STARIndexFolder	The folder name of the STAR binary indexes (i.e., the name of the folder that contains SA file). For example, use ~/STARindex/hg19 for hg19. (Only if using fastq)
--od outDir	The output directory

Optional:

--tophatAnchor <int>	The "anchor length" or "overhang length" used in the aligner. At least “anchor length” NT must be mapped to each end of a given junction. The default is 6. (This parameter applies only if using fastq)
--nthread <float>	The number of thread. The optimal number of thread should be equal to the number of CPU core.
--cstat <float>	The cutoff splicing difference. The cutoff used in the null hypothesis test for differential splicing. The default is 0.0001 for 0.01% difference. Valid: 0 ≤ cutoff < 1
--tstat <float>	The number of thread for statistical model.
--statoff	Turn statistics part off.
-libType libraryType	Library type. Default is unstranded (fr-unstranded). Use fr-firststrand or fr-secondstrand for strand-specific data.

Examples:

Example using fastq.

$cat s1.txt:
231ESRP.25K.rep-1.R1.fastq:231ESRP.25K.rep-1.R2.fastq,231ESRP.25K.rep-2.R1.fastq:231ESRP.25K.rep-2.R2.fastq
$cat s2.txt:
231EV.25K.rep-1.R1.fastq:231EV.25K.rep-1.R2.fastq,231EV.25K.rep-2.R1.fastq:231EV.25K.rep-2.R2.fastq
$

python rMATS-turbo-xxx-UCSx/rmats.py --s1 s1.txt --s2 s2.txt --gtf gtf/Homo_sapiens.Ensembl.GRCh37.72.gtf --bi ~/STARindex/hg19 --od out_test -t paired --nthread 6 --readLength 50 --tophatAnchor 8 --cstat 0.0001 --tstat 6

Example using bam.

$cat b1.txt:
231ESRP.25K.rep-1.bam,231ESRP.25K.rep-2.bam
$cat b2.txt:
231EV.25K.rep-1.bam,231EV.25K.rep-2.bam
$

python rMATS-turbo-xxx-UCSx/rmats.py --b1 b1.txt --b2 b2.txt -gtf gtf/Homo_sapiens.Ensembl.GRCh37.75.gtf -od bam_test -t paired --readLength 50 --cstat 0.0001 --libType fr-unstranded

Output:

All output files are in --od which contains rMATS output of AS events, all possible alternative splicing (AS) events derived from GTF and RNA

AS_Event.MATS.JC.txt evaluates splicing with only reads that span splicing junctions
IJC_SAMPLE_1: inclusion junction counts for SAMPLE_1, replicates are separated by comma
SJC_SAMPLE_1: skipping junction counts for SAMPLE_1, replicates are separated by comma
IJC_SAMPLE_2: inclusion junction counts for SAMPLE_2, replicates are separated by comma
SJC_SAMPLE_2: skipping junction counts for SAMPLE_2, replicates are separated by comma

AS_Event.MATS.JCEC.txt evaluates splicing with reads that span splicing junctions and reads on target (striped regions on home page figure)
IC_SAMPLE_1: inclusion counts for SAMPLE_1, replicates are separated by comma
SC_SAMPLE_1: skipping counts for SAMPLE_1, replicates are separated by comma
IC_SAMPLE_2: inclusion counts for SAMPLE_2, replicates are separated by comma
SC_SAMPLE_2: skipping counts for SAMPLE_2, replicates are separated by comma

Important columns contained in output files above.
IncFormLen: length of inclusion form, used for normalization
SkipFormLen: length of skipping form, used for normalization
IncLevel1: inclusion level for SAMPLE_1 replicates (comma separated) calculated from normalized counts
IncLevel2: inclusion level for SAMPLE_2 replicates (comma separated) calculated from normalized counts
IncLevelDifference: average(IncLevel1) - average(IncLevel2)
P-Value: Significance of splicing difference between two sample groups. (Only available if statistical model is on)
FDR: False Discovery Rate calculated from p-value. (Only available if statistical model is on)

fromGTF.AS_Event.txt all possible alternative splicing (AS) events derived from GTF and RNA.
ID: event id
GendID: ensembl gene id
geneSymbol: gene symbol
chr: chromosome
strand: strand of the gene
event coordinates: coordinates of the exons in the events, with multiple columns, different for each event type

JC.raw.input.AS_Event.txt evaluates splicing with only reads that span splicing junctions
IJC_SAMPLE_1: inclusion junction counts for SAMPLE_1, replicates are separated by comma
SJC_SAMPLE_1: skipping junction counts for SAMPLE_1, replicates are separated by comma
IJC_SAMPLE_2: inclusion junction counts for SAMPLE_2, replicates are separated by comma
SJC_SAMPLE_2: skipping junction counts for SAMPLE_2, replicates are separated by comma
IncFormLen: length of inclusion form, used for normalization
SkipFormLen: length of skipping form, used for normalization

JCEC.raw.input.AS_Event.txt evaluates splicing with reads that span splicing junctions and reads on target (striped regions on home page figure)
IC_SAMPLE_1: inclusion counts for SAMPLE_1, replicates are separated by comma
SC_SAMPLE_1: skipping counts for SAMPLE_1, replicates are separated by comma
IC_SAMPLE_2: inclusion counts for SAMPLE_2, replicates are separated by comma
SC_SAMPLE_2: skipping counts for SAMPLE_2, replicates are separated by comma
IncFormLen: length of inclusion form, used for normalization
SkipFormLen: length of skipping form, used for normalization

Alternative Splicing Events

rMATS analyzes skipped exon (SE), alternative 5' splice site (A5SS), alternative 3' splice site (A3SS), mutually exclusive exons (MXE), and retained intron (RI) events. Possible alternative splicing events are identified from the RNA-Seq data and annotation of transcripts in GTF format. The following is a list of provided GTF files:

Human, Homo sapiens (Ensembl or UCSC Known Genes)

Alternatively, you can download your own transcript annotation in GTF format. However, the first column (chromosome/contig name) in the GTF must match the sequence names in your STARindex.