This tutorial will serve as a guideline for RNA sequence analysis of a non-model organism. We will be using a needle transcriptome from red spruce as our example. The RNA-Seq data consists of Illumina short reads as well as 454 reads.
We will be going through:
- Quality control of the libraries
- De novo assembly of the reads
- Clustering of the assembled transcripts from each assembly
- Mapping the reads back to assembled/clustered transcriptome
- Generating raw counts
- GFOLD to analyze differential gene expression between two treatments (elevated and ambient CO2).
The packages we’ll be using can be found here:
Sickle
SolexaQA
Trinity
MIRA
Vsearch
TransDecoder
Bowtie2
eXpress
GFold
The data consists of three libraries under three different treatments;
- ambient CO2
- elevated CO2
- one that had been cotreated for both conditions
De Novo non model species directory
/UCHC/PublicShare/RNASeq_Workshop/RedSpruce
Each step in this work flow is composed into a directory
RedSpruce
|-- Reads
|-- QualityControl
|-- Assembly
|-- Clustered
|-- CodingRegions
|-- Index
|-- Align
|-- Count
|-- gfold
Raw Reads
Two types of sequencing methods were used to collect the RNA-Seq data, namely Illumina and 454. Where Illumina reads are in the form of fastq, where as 454 reads are in the form of fasta and quality score (qual) files.
Reads/
|-- 454/
| |-- control_454.fasta
| |-- control_454.qual
| |-- cotreated_454.fasta
| |-- cotreated_454.qual
| |-- elevated_454.fasta
| `-- elevated_454.qual
`-- Illumina/
|-- ambient.fastq
|-- cotreated.fastq
`-- elevated.fastq
Quality control
Quality control of Illumina reads using Sickle:
Step one is to perform quality control on the reads, and we will be using Sickle for the Illumina reads. Using the Sickle program we will trim our unpaired raw reads, to keep a minimum quality score value of 30 and a minimum length of 35.
sickle se -f ../../Reads/Illumina/ambient.fastq -t illumina -o ambient.trimmed.fastq -q 30 -l 35 sickle se -f ../../Reads/Illumina/elevated.fastq -t illumina -o elevated.trimmed.fastq -q 30 -l 35 sickle se -f ../../Reads/Illumina/cotreated.fastq -t illumina -o cotreated.trimmed.fastq -q 30 -l 35
The trimmed output files are what we will be using for the next steps of our analysis.
se unpaired reads / single-end reads
-f input file
-o output file
-q minimum quality score
-l minimum read length
-t Type of quality score value, e.g. (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)) (required)
The full script for trimming of raw reads using sickle can be found at: /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/QualityControl/Illumina/sickle.sh
Once the run is completed the directory will contain the following files that has been trimmed, which will be used in the further analysis:
QualityControl/
|--Illumina/
|-- ambient.trimmed.fastq
|-- cotreated.trimmed.fastq
|-- elevated.trimmed.fastq
`-- sickle_267830.out
De novo assembly
In De novo assembly section, we will be woking in the /Assembly directory. Where Illumina quality controlled reads will be assembled using Trinity and the 454 quality controlled reads will be assembled using the MIRA. For each assembly there will be separate directories called /Trinity and /MIRA.
De novo assembly of Illumina reads using Trinity:
Now we assemble all three quality controlled Illumina libraries with Trinity which is a de novo transcriptome assembler. This is most appropriate to use on a read set where no reference genome exists or where the reference genome is not high quality. Assembly requires a great deal of memory (RAM) and can take a few days to run if the read set is large. The BBC cluster has a high memory node you can connect to access this. In this example, we are assembling short reads using the highmem.q which has a 514 GB RAM. The script trinity.sh can be found at /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/Assembly/Trinity.
module load trinity/2.4.0
Trinity --seqType fq \
--max_memory 256G \
--single ../../QualityControl/Illumina/ambient.trimmed.fastq,../../QualityControl/Illumina/cotreated.trimmed.fastq,../../QualityControl/Illumina/elevated.trimmed.fastq \
--min_contig_length 300 \
--CPU 16 \
--normalize_reads
If you do not specify a output directory, the out put files will be directed to the trinity default folder called trinity_out_dir/.
Assembly/
|--trinity_out_dir/
|-- chrysalis/
|-- inchworm.K25.L25.DS.fa
|-- inchworm.K25.L25.DS.fa.finished
|-- inchworm.kmer_count
|-- insilico_read_normalization/
|-- jellyfish.kmers.fa
|-- jellyfish.kmers.fa.histo
|-- output_tutorial.txt
|-- partitioned_reads.files.list
|-- partitioned_reads.files.list.ok
|-- pipeliner.35650.cmds
|-- read_partitions/
|-- recursive_trinity.cmds
|-- recursive_trinity.cmds.completed
|-- recursive_trinity.cmds.ok
|-- single.fa
|-- single.fa.ok
|-- single.fa.read_count
`-- Trinity.fasta
There will be number of files created and the assembled file will be called Trinity.fasta. We will be using this for our next step.
To check the quality of your Trinity assembly we can use a utility program provided with the trinity software called TrinityStats.pl. To look at the contig summery we use a script called “trinity_stats.sh” located at, /common/RNASeq_Workshop/RedSpruce/Assembly/Trinity in the cluster.
perl $TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta > TrinitySummary.txt
If you submit the using a script, the result will be piped in to the output file call “TrinitySummary.txt” and it will look like:
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 63856
Total trinity transcripts: 114946
Percent GC: 42.14
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3636
Contig N20: 2760
Contig N30: 2230
Contig N40: 1835
Contig N50: 1492
Median contig length: 662
Average contig: 1022.21
Total assembled bases: 117498506
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 3583
Contig N20: 2710
Contig N30: 2170
Contig N40: 1761
Contig N50: 1389
Median contig length: 582
Average contig: 940.82
Total assembled bases: 60076924
Clustering
Clustering of Illumina and 454 reads using vsearch:
To obtain a set of unique genes from both runs, we will cluster the two resulting assemblies together. First, the two assembies will be combined into one file using the Unix command cat, which refers to concatanate.
cat ../Assembly/Trinity/trinity_out_dir/Trinity.fasta ../Assembly/Mira/Mira_454_assembly/Mira_454_d_results/Mira_454_out.unpadded.fasta > combine.fasta
Once the files are combined, we will use vsearch to find redundancy between the assembled transcripts and create a single output known as a centroids file. The threshold for clustering in this example is set to 80% identity. The following script combine.sh will be located at /common/RNASeq_Workshop/RedSpruce/Clustered
vsearch --threads 8 --log LOGFile \
--cluster_fast combine.fasta \
--id 0.80 --centroids centroids.fasta --uc clusters.uc
Usage: vsearch [OPTIONS]
--threads INT number of threads to use, zero for all cores (0)
--log FILENAME write messages, timing and memory info to file
--cluster_fast FILENAME cluster sequences after sorting by length
--id REAL reject if identity lower, accepted values: 0-1.0
--centroids FILENAME output centroid sequences to FASTA file
--uc FILENAME specify filename for UCLUST-like output
After the run, it will contain the following files; where the centroids.fasta file will contain the unique genes from the 454 and Illumina assemblies.
Clustered/
|-- centroids.fasta
|-- clusters.uc
|-- combine.fasta
|-- combine.sh
`-- LOGFile
For the next step we will be using the centroids.fasta file.
Identifying the Coding Regions
Identifying coding regions using TransDecoder:
Now that we have our reads assembled and clustered together into the single centroids file, we can use TransDecoder to determine optimal open reading frames from the assembly (ORFs). Assembled RNA-Seq transcripts may have 5′ or 3′ UTR sequence attached and this can make it difficult to determine the CDS in non-model spieces.
First we will use the TransDecoder utility to run on the centroid fasta file, as follows:
TransDecoder.LongOrfs -t ../Clustered/centroids.fasta
Usage: Transdecoder.LongOrfs [Options]
Options explained:
-t transcripts.fasta
By default it will identify ORFs that are at least 100 amino acids long. (you can change this by using -m parameter). It will produce a folder called centroids.fasta.transdecoder_dir, and will include all the temporary files in that directory.
CodingRegions/
|-- centroids.fasta.transdecoder_dir/
|-- longest_orfs.pep
|-- longest_orfs.gff3
|-- longest_orfs.cds
|-- base_freqs.dat.ok
`-- base_freqs.dat
Next step is to, identify ORFs with homology to known proteins via blast or pfam searches. This will maximize the sensitivity for capturing the ORFs that have functional significance. It done using the following code:
hmmscan --cpu 8 --domtblout pfam.domtblout /common/Pfam/Pfam-B.hmm centroids.fasta.transdecoder_dir/longest_orfs.pep
It will create a file called pfam.domtblout in the main directory.
The final step is to predict the likely coding regions using the Pfam (or Blast) search results, using the following:
TransDecoder.Predict -t ../Clustered/centroids.fasta --retain_pfam_hits pfam.domtblout --cpu 8
The full script (Transdecoder.sh) for identifying the coding regions using the TransDecoder is located at: /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/CodingRegions/Transdecoder.sh
After the run is completed it will create the following files:
CodingRegions/
|-- centroids.fasta.transdecoder_dir/
| |-- longest_orfs.pep
| |-- longest_orfs.gff3.inx
| |-- longest_orfs.gff3
| |-- longest_orfs.cds.top_longest_5000.nr80.clstr
| |-- longest_orfs.cds.top_longest_5000.nr80
| |-- longest_orfs.cds.top_longest_5000
| |-- longest_orfs.cds.top_500_longest.ok
| |-- longest_orfs.cds.top_500_longest
| |-- longest_orfs.cds.scores.selected
| |-- longest_orfs.cds.scores.ok
| |-- longest_orfs.cds.scores
| |-- longest_orfs.cds.eclipsed_removed.gff3
| |-- longest_orfs.cds.best_candidates.gff3
| |-- longest_orfs.cds
| |-- hexamer.scores.ok
| |-- hexamer.scores
| |-- base_freqs.dat.ok
| `-- base_freqs.dat
|-- pfam.domtblout
|-- centroids.fasta.transdecoder.pep
|-- centroids.fasta.transdecoder.gff3
|-- centroids.fasta.transdecoder.cds
`-- centroids.fasta.transdecoder.bed
In the next step, for creating an index, we will be using the centroids.fasta.transdecoder.cds file.
Creating A Index
Indexing the reads using bowtie2:
Before aligning the reads we need to create a index for the created transcoder assembly. We will be using Bowtie2 to build the index using the following command:
bowtie2-build ../CodingRegions/centroids.fasta.transdecoder.cds centroids_transdecoder_Index
Usage: bowtie2-build [options] [reference_in] [bt2_index_base]
main arguments
reference_in comma-separated list of files with ref sequences
bt2_index_base base name for the index files to write
Options
--threads # of threads, By default bowtie2-build is using only one thread. Increasing the number of threads will speed up the index building considerably in most cases.
The script (bowtie2.sh) for creating the index can be found at /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/Index/bowtie2.sh
Index/
|-- centroids_transdecoder_Index.1.bt2
|-- centroids_transdecoder_Index.2.bt2
|-- centroids_transdecoder_Index.3.bt2
|-- centroids_transdecoder_Index.4.bt2
|-- centroids_transdecoder_Index.rev.1.bt2
`-- centroids_transdecoder_Index.rev.2.bt2
Aligning the Reads to the Index
Aligning the reads using bowtie2:
Once the index has been created, we will use bowtie2 to align our three trimmed reads, using the following command:
bowtie2 --threads 8 -x ../Index/centroids_transdecoder_Index ../QualityControl/Illumina/elevated.trimmed.fastq -S elevated.sam
bowtie2 --threads 8 -x ../Index/centroids_transdecoder_Index ../QualityControl/Illumina/ambient.trimmed.fastq -S ambient.sam
bowtie2 --threads 8 -x ../Index/centroids_transdecoder_Index ../QualityControl/Illumina/cotreated.trimmed.fastq -S cotreated.sam
Usage: bowtie2 [Options] -x [bt2_base_index] {reads / options}
Main arguments:
-x [bt2_base_index] The basename of the index for the reference genome (i.e. with out the *.bt2 extension)
-S Output file name of the SAM alignment
Options:
--threads Number of processors
The script bowtie2_Align.sh used to align using the bowtie2 can be found at /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/Align/. Once aligned it will produce three SAM files.
Align/
|-- ambient.sam
|-- cotreated.sam
`-- elevated.sam
Converting SAM to BAM format:
SAM format file is a large human readable file. In a computational point of view, accessing this data for processing will take time. For faster processing this need to be converted into computer friendly binary format called the BAM files. We use samtools to convert the SAM file in to BAM files. Using the view command the it will be converted into BAM format and will be sorted using the sort command as shown bellow:
samtools view -uhS elevated.sam | samtools sort - elevated_sorted
samtools view -uhS ambient.sam | samtools sort - ambient_sorted
samtools view -uhS cotreated.sam | samtools sort - cotreated_sorted
Usage: samtools [command] [options] in.sam
Command:
view prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format
Options:
-h Include the header in the output
-S Indicate the input was in SAM format
-u Output uncompressed BAM. This option saves time spent on compression/decompression and is thus preferred when the output is piped to another samtools command
Usage: samtools [command] [-o out.bam]
Command:
sort Sort alignments by leftmost coordinates
-o Write the final sorted output to FILE, rather than to standard output.
The script (sam2bam.sh) can be found at /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/Align. Once the conversion is done it will produce the following files which will be binary format.
Align/
|-- elevated_sorted.bam
|-- cotreated_sorted.bam
`-- ambient_sorted.bam
Aligned Read Counts
Extracting Raw Read Counts with eXpress:
Once it has been aligned, we will use the eXpress software, to get the counts of differential expression.
express ../CodingRegions2/centroids.fasta.transdecoder.cds ../Align/elevated_sorted.bam -o elevated_express
express ../CodingRegions2/centroids.fasta.transdecoder.cds ../Align/ambient_sorted.bam -o ambient_express
express ../CodingRegions2/centroids.fasta.transdecoder.cds ../Align/cotreated_sorted.bam -o cotreated_express
Usage: express [options] [target_seqs.fa] [hits.(sam/bam)]
Required arguments:
target_seqs.fa target sequence file in fasta format
hits.(sam/bam) read alignment file in SAM or BAM format
Options:
-o [ --output-dir ] arg (=.) write all output files to this directory
The script (express.sh) can be found at /common/RNASeq_Workshop/RedSpruce/Count/express.sh directory. The program will run and will create a directory for each run and will include the output files inside the directories.
Count\
|-- elevated_express\
| |-- results.xprs
| `-- params.xprs
|-- cotreated_express\
| |-- results.xprs
| `-- params.xprs
`-- ambient_express\
|-- results.xprs
`-- params.xprs
Analysis
Converting counts to Gfold format:
Before analyzing for the differential expression using the gfold package, we need to convert our counts to Gfold count format. The program expects, 5 data columns, which are “Gene Symbol”, “Gene Name”, “Read Count”, “Gene exon length”, “FPKM”. To do this we will read the results.xprs file and grab the columns which have the required fields, and will rewrite them to a new file, where it will be used as the input file for the Gfold program.
require("plyr")
express_file_1 <- data.frame(read.table("../Count/ambient_express/results.xprs",sep="\t", header = TRUE))
express_file_2 <- data.frame(read.table("../Count/cotreated_express/results.xprs",sep="\t", header = TRUE))
express_file_3 <- data.frame(read.table("../Count/elevated_express/results.xprs",sep="\t", header = TRUE))
subset1 <- data.frame(express_file_1$target_id, express_file_1$target_id, express_file_1$tot_counts, express_file_1$length,express_file_1$fpkm)
subset2 <- data.frame(express_file_2$target_id, express_file_2$target_id, express_file_2$tot_counts, express_file_2$length,express_file_2$fpkm)
subset3 <- data.frame(express_file_3$target_id, express_file_3$target_id, express_file_3$tot_counts, express_file_3$length,express_file_3$fpkm)
sorted1 <- subset1[order(express_file_1$target_id),]
sorted2 <- subset2[order(express_file_2$target_id),]
sorted3 <- subset3[order(express_file_3$target_id),]
subset1 <- rename(subset1,c("express_file_1.target_id"="GeneSymbol","express_file_1.target_id.1"="GeneName","express_file_1.tot_counts"="Read Count","express_file_1.length" ="Gene_exon_length","express_file_1.fpkm"="RPKM"))
subset2 <- rename(subset2,c("express_file_2.target_id"="GeneSymbol","express_file_2.target_id.1"="GeneName","express_file_2.tot_counts"="Read Count","express_file_2.length" ="Gene_exon_length","express_file_2.fpkm"="RPKM"))
subset3 <- rename(subset3,c("express_file_3.target_id"="GeneSymbol","express_file_3.target_id.1"="GeneName","express_file_3.tot_counts"="Read Count","express_file_3.length" ="Gene_exon_length","express_file_3.fpkm"="RPKM"))
write.table(sorted1, "ambient.read_cnt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted2, "cotreated.read_cnt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted3, "elevated.read_cnt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
We have created a R script (eXpress2gfold.R), which will be called, through eXpress2gfold.sh script. The script is located at /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/gfold/eXpress2gfold.sh .
The above script will produce the following gfold read count files as the output, which will be used in analyzing the differential expression.
gfold/
|-- ambient.read_cnt
|-- cotreated.read_cnt
`-- elevated.read_cnt
Differential expression using Gfold:
Using the formatted count files, we will use Gfold to analyze the differential expression of the three samples. Gfold program is very useful when no replicates are available. Gfold generalizes the fold change by, considering the log fold change value for each gene, and assigns it as a Gfold value. It overcomes the short comping of the p-value, that measure the expression under different conditions, by measuring a relative expression value. It also overcomes the deficiency of low read counts. The following code finds the differentially expressed genes between the two groups of samples.
gfold diff -s1 cotreated -s2 elevated -suf .read_cnt -o cotreated_VS_elevated.diff
gfold diff -s1 cotreated -s2 ambient -suf .read_cnt -o cotreated_VS_ambient.diff
gfold diff -s1 elevated -s2 ambient -suf .read_cnt -o elevated_VS_ambient.diff
Usage: gfold [jobs] [options]
Jobs:
diff Calculate GFOLD value and other statics. It accepts the count values as input.
Options explained:
-s1 prefix for read_count of 1st group separated by comma.
-s2 prefix for read_count of 2nd group separated by comma.
-suf suffix of count file
-o output file and for diff job, there will be two output files, where second will have .ext extension.
-norm normalization method. 'Count' stands for normalization by total number of mapped reads.
The prepared script for running Gfold is called, gfold.sh and it is located at: /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/gfold. The job will create following files:
gfold/
|-- cotreated_VS_ambient.diff
|-- cotreated_VS_ambient.diff.ext
|-- cotreated_VS_elevated.diff
|-- cotreated_VS_elevated.diff.ext
|-- elevated_VS_ambient.diff
`-- elevated_VS_ambient.diff.ext
The first few lines of the output file of cotreated_VS_ambient.diff will look like:
# This file is generated by gfold V1.1.4 on Wed Jul 19 13:51:02 2017
# Normalization constants :
# cotreated 52838582 1.0984
# elevated 55466744 1
# The GFOLD value could be considered as a reliable log2 fold change.
# It is positive/negative if the gene is up/down regulated.
# A gene with zero GFOLD value should never be considered as
# differentially expressed. For a comprehensive description of
# GFOLD, please refer to the manual.
#GeneSymbol GeneName GFOLD(0.01) E-FDR log2fdc 1stRPKM 2ndRPKM
Gene.10065 Gene.10065 -0.15334 1 -0.2063 84.977 63.876
Gene.10096 Gene.10096 2.46607 1 2.5501 25.495 129.567
Gene.10097 Gene.10097 1.90102 1 2.072 27.710 101.209
Gene.10138 Gene.10138 1.38044 1 1.4319 55.704 130.366
Gene.10198 Gene.10198 -0.24752 1 -0.2928 121.876 86.279
.
.
.
Gene.10423 Gene.10423 2.83385 1 2.91202 26.908 175.725
Gene.10440 Gene.10440 0 1 -0.05898 55.542 46.239
Gene.10487 Gene.10487 2.39166 1 2.56462 7.236 37.198
The output file contains 7 columns, namely “GeneSymbol”, “GeneName”, “GFOLD(0.01)”,
“E-FDR”, “log2fdc”, “1stRPKM”, “2ndRPKM”. From that, we will focus on the GFOLD column for each gene. The GFOLD value, can be considered as a reliable log2 fold value indicating the change. The positive value indicate a up regulated and a negative value indicate a down regulated gene.