RNA-seq for Non-model Species (Red Spruce)

This tutorial will serve as a guideline for RNA sequence analysis of a non-model organism. We will be using a needle transcriptome from red spruce as our example.  The RNA-Seq data  consists of Illumina short reads as well as 454 reads.

We will be going through:

  • Quality control of the libraries
  • De novo assembly of the reads
  • Clustering of the assembled transcripts from each assembly
  • Mapping the reads back to assembled/clustered transcriptome
  • Generating raw counts
  • GFOLD to analyze differential gene expression between two treatments (elevated and ambient CO2).

The packages we’ll be using can be found here:
Sickle
SolexaQA
Trinity
MIRA
Vsearch
TransDecoder
Bowtie2
eXpress
GFold

The data consists of three libraries under three different treatments;

  1. ambient CO
  2. elevated CO2
  3. one that had been cotreated for both conditions

De Novo non model species directory

/UCHC/PublicShare/RNASeq_Workshop/RedSpruce

Each step in this work flow is composed into a directory

RedSpruce
|-- Reads
|-- QualityControl
|-- Assembly
|-- Clustered
|-- CodingRegions
|-- Index
|-- Align
|-- Count
|-- gfold

 

Raw Reads 

Two types of sequencing methods were used to collect the RNA-Seq data, namely Illumina and 454. Where Illumina reads are in the form of fastq, where as 454 reads are in the form of fasta and quality score (qual) files.

Reads/
|-- 454/
|   |-- control_454.fasta
|   |-- control_454.qual
|   |-- cotreated_454.fasta
|   |-- cotreated_454.qual
|   |-- elevated_454.fasta
|   `-- elevated_454.qual
`-- Illumina/
    |-- ambient.fastq
    |-- cotreated.fastq
    `-- elevated.fastq

 

Quality control 

Quality control of Illumina reads using Sickle:

Step one is to perform quality control on the reads, and we will be using Sickle for the Illumina reads. Using the Sickle program we will trim our unpaired raw reads, to keep a minimum quality score value of 30 and a minimum length of 35.

sickle se -f ../../Reads/Illumina/ambient.fastq -t illumina -o ambient.trimmed.fastq -q 30 -l 35
sickle se -f ../../Reads/Illumina/elevated.fastq -t illumina -o elevated.trimmed.fastq -q 30 -l 35
sickle se -f ../../Reads/Illumina/cotreated.fastq -t illumina -o cotreated.trimmed.fastq -q 30 -l 35

The trimmed output files are what we will be using for the next steps of our analysis.

se   unpaired reads / single-end reads
-f   input file 
-o   output file
-q   minimum quality score
-l   minimum read length
-t   Type of quality score value, e.g. (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)) (required)

The full script for trimming of raw reads using sickle can be found at: /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/QualityControl/Illumina/sickle.sh
Once the run is completed the directory will contain the following files that has been trimmed, which will be used in the further analysis:

QualityControl/
  |--Illumina/
     |-- ambient.trimmed.fastq
     |-- cotreated.trimmed.fastq
     |-- elevated.trimmed.fastq
     `-- sickle_267830.out

 

 

De novo assembly 

In De novo assembly section, we will be woking in the /Assembly directory. Where Illumina quality controlled reads will be assembled using Trinity and the 454 quality controlled reads will be assembled using the MIRA. For each assembly there will be separate directories called /Trinity and /MIRA.
De novo assembly of Illumina reads using Trinity:
Now we assemble all three quality controlled Illumina libraries with Trinity which is a de novo transcriptome assembler. This is most appropriate to use on a read set where no reference genome exists or where the reference genome is not high quality. Assembly requires a great deal of memory (RAM) and can take a few days to run if the read set is large. The BBC cluster has a high memory node you can connect to access this. In this example, we are assembling short reads using the highmem.q which has a 514 GB RAM. The script trinity.sh can be found at /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/Assembly/Trinity.

module load trinity/2.4.0

Trinity --seqType fq \
        --max_memory 256G \
        --single ../../QualityControl/Illumina/ambient.trimmed.fastq,../../QualityControl/Illumina/cotreated.trimmed.fastq,../../QualityControl/Illumina/elevated.trimmed.fastq \
        --min_contig_length 300 \
        --CPU 16 \
        --normalize_reads

If you do not specify a output directory, the out put files will be directed to the trinity default folder called trinity_out_dir/.

Assembly/
  |--trinity_out_dir/
        |-- chrysalis/
        |-- inchworm.K25.L25.DS.fa
        |-- inchworm.K25.L25.DS.fa.finished
        |-- inchworm.kmer_count
        |-- insilico_read_normalization/
        |-- jellyfish.kmers.fa
        |-- jellyfish.kmers.fa.histo
        |-- output_tutorial.txt
        |-- partitioned_reads.files.list
        |-- partitioned_reads.files.list.ok
        |-- pipeliner.35650.cmds
        |-- read_partitions/
        |-- recursive_trinity.cmds
        |-- recursive_trinity.cmds.completed
        |-- recursive_trinity.cmds.ok
        |-- single.fa
        |-- single.fa.ok
        |-- single.fa.read_count
        `-- Trinity.fasta

 

There will be number of files created and the assembled file will be called Trinity.fasta. We will be using this for our next step.

To check the quality of your Trinity assembly we can use a utility program provided with the trinity software called TrinityStats.pl. To look at the contig summery we use a script called “trinity_stats.sh” located at, /common/RNASeq_Workshop/RedSpruce/Assembly/Trinity in the cluster.

perl $TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta > TrinitySummary.txt

If you submit the using a script, the result will be piped in to the output file call “TrinitySummary.txt” and it will look like:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':	63856
Total trinity transcripts:	114946
Percent GC: 42.14

########################################
Stats based on ALL transcript contigs:
########################################

	Contig N10: 3636
	Contig N20: 2760
	Contig N30: 2230
	Contig N40: 1835
	Contig N50: 1492

	Median contig length: 662
	Average contig: 1022.21
	Total assembled bases: 117498506


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

	Contig N10: 3583
	Contig N20: 2710
	Contig N30: 2170
	Contig N40: 1761
	Contig N50: 1389

	Median contig length: 582
	Average contig: 940.82
	Total assembled bases: 60076924

 

Clustering 

Clustering of Illumina and 454 reads using vsearch:

To obtain a set of unique genes from both runs, we will cluster the two resulting assemblies together. First, the two assembies will be combined into one file using the Unix command cat, which refers to concatanate.

cat ../Assembly/Trinity/trinity_out_dir/Trinity.fasta ../Assembly/Mira/Mira_454_assembly/Mira_454_d_results/Mira_454_out.unpadded.fasta > combine.fasta 

Once the files are combined, we will use vsearch to find redundancy between the assembled transcripts and create a single output known as a centroids file. The threshold for clustering in this example is set to 80% identity. The following script combine.sh will be located at /common/RNASeq_Workshop/RedSpruce/Clustered

vsearch --threads 8 --log LOGFile \
	--cluster_fast combine.fasta \
	--id 0.80 --centroids centroids.fasta --uc clusters.uc

 

Usage: vsearch [OPTIONS]
--threads INT               number of threads to use, zero for all cores (0)
--log FILENAME              write messages, timing and memory info to file
--cluster_fast FILENAME     cluster sequences after sorting by length
--id REAL                   reject if identity lower, accepted values: 0-1.0
--centroids FILENAME        output centroid sequences to FASTA file
--uc FILENAME               specify filename for UCLUST-like output

 

After the run, it will contain the following files; where the centroids.fasta file will contain the unique genes from the 454 and Illumina assemblies.

Clustered/
|-- centroids.fasta
|-- clusters.uc
|-- combine.fasta
|-- combine.sh
`-- LOGFile

 

For the next step we will be using the centroids.fasta file.

 

Identifying the Coding Regions 

Identifying coding regions using TransDecoder:

Now that we have our reads assembled and clustered together into the single centroids file, we can use TransDecoder to determine optimal open reading frames from the assembly (ORFs). Assembled RNA-Seq transcripts may have 5′ or 3′ UTR sequence attached and this can make it difficult to determine the CDS in non-model spieces.
First we will use the TransDecoder utility to run on the centroid fasta file, as follows:

TransDecoder.LongOrfs -t ../Clustered/centroids.fasta

 

Usage: Transdecoder.LongOrfs [Options]

Options explained:
-t                             transcripts.fasta

By default it will identify ORFs that are at least 100 amino acids long. (you can change this by using -m parameter). It will produce a folder called centroids.fasta.transdecoder_dir, and will include all the temporary files in that directory.

CodingRegions/
|-- centroids.fasta.transdecoder_dir/
    |-- longest_orfs.pep
    |-- longest_orfs.gff3
    |-- longest_orfs.cds
    |-- base_freqs.dat.ok
    `-- base_freqs.dat

 

Next step is to, identify ORFs with homology to known proteins via blast or pfam searches. This will maximize the sensitivity for capturing the ORFs that have functional significance. It done using the following code:

hmmscan --cpu 8 --domtblout pfam.domtblout /common/Pfam/Pfam-B.hmm centroids.fasta.transdecoder_dir/longest_orfs.pep

 

It will create a file called pfam.domtblout in the main directory.
The final step is to predict the likely coding regions using the Pfam (or Blast) search results, using the following:

TransDecoder.Predict -t ../Clustered/centroids.fasta --retain_pfam_hits pfam.domtblout --cpu 8

 

The full script (Transdecoder.sh) for identifying the coding regions using the TransDecoder is located at: /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/CodingRegions/Transdecoder.sh

After the run is completed it will create the following files:

CodingRegions/
|-- centroids.fasta.transdecoder_dir/
|   |-- longest_orfs.pep
|   |-- longest_orfs.gff3.inx
|   |-- longest_orfs.gff3
|   |-- longest_orfs.cds.top_longest_5000.nr80.clstr
|   |-- longest_orfs.cds.top_longest_5000.nr80
|   |-- longest_orfs.cds.top_longest_5000
|   |-- longest_orfs.cds.top_500_longest.ok
|   |-- longest_orfs.cds.top_500_longest
|   |-- longest_orfs.cds.scores.selected
|   |-- longest_orfs.cds.scores.ok
|   |-- longest_orfs.cds.scores
|   |-- longest_orfs.cds.eclipsed_removed.gff3
|   |-- longest_orfs.cds.best_candidates.gff3
|   |-- longest_orfs.cds
|   |-- hexamer.scores.ok
|   |-- hexamer.scores
|   |-- base_freqs.dat.ok
|   `-- base_freqs.dat
|-- pfam.domtblout
|-- centroids.fasta.transdecoder.pep
|-- centroids.fasta.transdecoder.gff3
|-- centroids.fasta.transdecoder.cds
`-- centroids.fasta.transdecoder.bed

In the next step, for creating an index, we will be using the centroids.fasta.transdecoder.cds file.

Creating A Index 

Indexing the reads using bowtie2:

Before aligning the reads we need to create a index for the created transcoder assembly. We will be using Bowtie2 to build the index using the following command:

bowtie2-build ../CodingRegions/centroids.fasta.transdecoder.cds centroids_transdecoder_Index

 

Usage: bowtie2-build [options] [reference_in] [bt2_index_base]  
main arguments
reference_in            comma-separated list of files with ref sequences
bt2_index_base          base name for the index files to write

Options
--threads          # of threads, By default bowtie2-build is using only one thread. Increasing the number of threads will speed up the index building considerably in most cases.

The script (bowtie2.sh) for creating the index can be found at /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/Index/bowtie2.sh

Index/
|-- centroids_transdecoder_Index.1.bt2
|-- centroids_transdecoder_Index.2.bt2
|-- centroids_transdecoder_Index.3.bt2
|-- centroids_transdecoder_Index.4.bt2
|-- centroids_transdecoder_Index.rev.1.bt2
`-- centroids_transdecoder_Index.rev.2.bt2

 

Aligning the Reads to the Index 

Aligning the reads using bowtie2:

Once the index has been created, we will use bowtie2 to align our three trimmed reads, using the following command:

bowtie2 --threads 8 -x ../Index/centroids_transdecoder_Index ../QualityControl/Illumina/elevated.trimmed.fastq -S elevated.sam
bowtie2 --threads 8 -x ../Index/centroids_transdecoder_Index ../QualityControl/Illumina/ambient.trimmed.fastq -S ambient.sam
bowtie2 --threads 8 -x ../Index/centroids_transdecoder_Index ../QualityControl/Illumina/cotreated.trimmed.fastq -S cotreated.sam

 

Usage: bowtie2 [Options] -x [bt2_base_index] {reads / options}
Main arguments:
-x [bt2_base_index]     The basename of the index for the reference genome (i.e. with out the *.bt2 extension)
-S                      Output file name of the SAM alignment

Options:
--threads               Number of processors 

The script bowtie2_Align.sh used to align using the bowtie2 can be found at /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/Align/. Once aligned it will produce three SAM files.

Align/
|-- ambient.sam
|-- cotreated.sam
`-- elevated.sam

 

Converting SAM to BAM format:
SAM format file is a large human readable file. In a computational point of view, accessing this data for processing will take time. For faster processing this need to be converted into computer friendly binary format called the BAM files. We use samtools to convert the SAM file in to BAM files. Using the view command the it will be converted into BAM format and will be sorted using the sort command as shown bellow:

samtools view -uhS elevated.sam | samtools sort - elevated_sorted
samtools view -uhS ambient.sam | samtools sort - ambient_sorted
samtools view -uhS cotreated.sam | samtools sort - cotreated_sorted

 

Usage: samtools [command] [options] in.sam
Command:
view     prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format 

Options:
-h      Include the header in the output
-S      Indicate the input was in SAM format
-u      Output uncompressed BAM. This option saves time spent on compression/decompression and is thus preferred when the output is piped to another samtools command


Usage: samtools [command] [-o out.bam]
Command:
sort    Sort alignments by leftmost coordinates

-o      Write the final sorted output to FILE, rather than to standard output.

 

The script (sam2bam.sh) can be found at /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/Align. Once the conversion is done it will produce the following files which will be binary format.

Align/
|-- elevated_sorted.bam
|-- cotreated_sorted.bam
`-- ambient_sorted.bam

 

Aligned Read Counts 

Extracting Raw Read Counts with eXpress:

Once it has been aligned, we will use the eXpress software, to get the counts of differential expression.

express ../CodingRegions2/centroids.fasta.transdecoder.cds ../Align/elevated_sorted.bam -o elevated_express
express ../CodingRegions2/centroids.fasta.transdecoder.cds ../Align/ambient_sorted.bam -o ambient_express
express ../CodingRegions2/centroids.fasta.transdecoder.cds ../Align/cotreated_sorted.bam -o cotreated_express

 

Usage:  express [options] [target_seqs.fa] [hits.(sam/bam)]

Required arguments:
 target_seqs.fa     target sequence file in fasta format
 hits.(sam/bam)     read alignment file in SAM or BAM format

Options:
-o [ --output-dir ] arg (=.)        write all output files to this directory

 

The script (express.sh) can be found at /common/RNASeq_Workshop/RedSpruce/Count/express.sh directory. The program will run and will create a directory for each run and will include the output files inside the directories.

Count\
|-- elevated_express\
|   |-- results.xprs
|   `-- params.xprs
|-- cotreated_express\
|   |-- results.xprs
|   `-- params.xprs
`-- ambient_express\
    |-- results.xprs
    `-- params.xprs

 

Analysis 

Converting counts to Gfold format:

Before analyzing for the differential expression using the gfold package, we need to convert our counts to Gfold count format. The program expects, 5 data columns, which are “Gene Symbol”, “Gene Name”, “Read Count”, “Gene exon length”, “FPKM”. To do this we will read the results.xprs file and grab the columns which have the required fields, and will rewrite them to a new file, where it will be used as the input file for the Gfold program.

require("plyr")

express_file_1 <- data.frame(read.table("../Count/ambient_express/results.xprs",sep="\t", header = TRUE))
express_file_2 <- data.frame(read.table("../Count/cotreated_express/results.xprs",sep="\t", header = TRUE))
express_file_3 <- data.frame(read.table("../Count/elevated_express/results.xprs",sep="\t", header = TRUE))

subset1 <- data.frame(express_file_1$target_id, express_file_1$target_id, express_file_1$tot_counts, express_file_1$length,express_file_1$fpkm)
subset2 <- data.frame(express_file_2$target_id, express_file_2$target_id, express_file_2$tot_counts, express_file_2$length,express_file_2$fpkm)
subset3 <- data.frame(express_file_3$target_id, express_file_3$target_id, express_file_3$tot_counts, express_file_3$length,express_file_3$fpkm)

sorted1 <- subset1[order(express_file_1$target_id),]
sorted2 <- subset2[order(express_file_2$target_id),]
sorted3 <- subset3[order(express_file_3$target_id),]

subset1 <- rename(subset1,c("express_file_1.target_id"="GeneSymbol","express_file_1.target_id.1"="GeneName","express_file_1.tot_counts"="Read Count","express_file_1.length" ="Gene_exon_length","express_file_1.fpkm"="RPKM"))
subset2 <- rename(subset2,c("express_file_2.target_id"="GeneSymbol","express_file_2.target_id.1"="GeneName","express_file_2.tot_counts"="Read Count","express_file_2.length" ="Gene_exon_length","express_file_2.fpkm"="RPKM"))
subset3 <- rename(subset3,c("express_file_3.target_id"="GeneSymbol","express_file_3.target_id.1"="GeneName","express_file_3.tot_counts"="Read Count","express_file_3.length" ="Gene_exon_length","express_file_3.fpkm"="RPKM"))

write.table(sorted1, "ambient.read_cnt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted2, "cotreated.read_cnt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted3, "elevated.read_cnt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

 

We have created a R script (eXpress2gfold.R), which will be called, through eXpress2gfold.sh script. The script is located at /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/gfold/eXpress2gfold.sh .
The above script will produce the following gfold read count files as the output, which will be used in analyzing the differential expression.

gfold/
|-- ambient.read_cnt
|-- cotreated.read_cnt
`-- elevated.read_cnt

 

Differential expression using Gfold:

Using the formatted count files, we will use Gfold to analyze the differential expression of the three samples. Gfold program is very useful when no replicates are available. Gfold generalizes the fold change by, considering the log fold change value for each gene, and assigns it as a Gfold value. It overcomes the short comping of the p-value, that measure the expression under different conditions, by measuring a relative expression value. It also overcomes the deficiency of low read counts. The following code finds the differentially expressed genes between the two groups of samples.


gfold diff -s1 cotreated -s2 elevated -suf .read_cnt -o cotreated_VS_elevated.diff
gfold diff -s1 cotreated -s2 ambient -suf .read_cnt -o cotreated_VS_ambient.diff
gfold diff -s1 elevated -s2 ambient -suf .read_cnt -o elevated_VS_ambient.diff
Usage: gfold [jobs] [options]
Jobs:
diff     Calculate GFOLD value and other statics. It accepts the count values as input. 

Options explained:
-s1       prefix for read_count of 1st group separated by comma.
-s2       prefix for read_count of 2nd group separated by comma.
-suf      suffix of count file
-o        output file and for diff job, there will be two output files, where second will have .ext extension.
-norm     normalization method. 'Count' stands for normalization by total number of mapped reads.

 

The prepared script for running Gfold is called, gfold.sh and it is located at: /UCHC/PublicShare/RNASeq_Workshop/RedSpruce/gfold. The job will create following files:

gfold/
|-- cotreated_VS_ambient.diff
|-- cotreated_VS_ambient.diff.ext
|-- cotreated_VS_elevated.diff
|-- cotreated_VS_elevated.diff.ext
|-- elevated_VS_ambient.diff
`-- elevated_VS_ambient.diff.ext

 

The first few lines of the output file of cotreated_VS_ambient.diff will look like:

# This file is generated by gfold V1.1.4 on Wed Jul 19 13:51:02 2017
# Normalization constants :
#    cotreated  52838582        1.0984
#    elevated   55466744        1
# The GFOLD value could be considered as a reliable log2 fold change.
# It is positive/negative if the gene is up/down regulated.
# A gene with zero GFOLD value should never be considered as 
# differentially expressed. For a comprehensive description of 
# GFOLD, please refer to the manual.
#GeneSymbol   GeneName   GFOLD(0.01)  E-FDR   log2fdc     1stRPKM   2ndRPKM
Gene.10065    Gene.10065   -0.15334      1     -0.2063     84.977    63.876
Gene.10096    Gene.10096    2.46607      1      2.5501     25.495   129.567
Gene.10097    Gene.10097    1.90102      1      2.072      27.710   101.209
Gene.10138    Gene.10138    1.38044      1      1.4319     55.704   130.366
Gene.10198    Gene.10198   -0.24752      1     -0.2928    121.876    86.279
.
.
.
Gene.10423   Gene.10423     2.83385      1      2.91202     26.908   175.725
Gene.10440   Gene.10440     0            1     -0.05898     55.542    46.239
Gene.10487   Gene.10487     2.39166      1      2.56462      7.236    37.198

 

The output file contains 7 columns, namely “GeneSymbol”, “GeneName”, “GFOLD(0.01)”,
“E-FDR”, “log2fdc”, “1stRPKM”, “2ndRPKM”. From that, we will focus on the GFOLD column for each gene. The GFOLD value, can be considered as a reliable log2 fold value indicating the change. The positive value indicate a up regulated and a negative value indicate a down regulated gene.