RNA-seq for Non-model Species (Red Spruce)_V1

This tutorial will serve as a guideline for RNA sequence analysis of a non-model organism. We will be using a needle transcriptome from red spruce as our example.  The RNA-Seq data  consists of Illumina short reads as well as 454 reads.

We will be going through:

  • Quality control of the libraries
  • De novo assembly of the reads
  • Clustering of the assembled transcripts from each assembly
  • Mapping the reads back to assembled/clustered transcriptome
  • Generating raw counts
  • DESeq2 to analyze differential gene expression between two treatments (elevated and ambient CO2).

The packages we’ll be using can be found here:

Sickle

SolexaQA

Trinity

MIRA

usearch

TransDecoder

Bowtie2

eXpress

DESeq2

 

The data consists of three libraries under three different treatments;

  1. ambient CO
  2. elevated CO2
  3. one that had been cotreated for both conditions

 

Two types of sequencing methods were used;

  1. 454 reads were sequenced for three libraries and are located on the BBC cluster at /common/RNASeq_Workshop/RedSpruce/454
  2. Illumina reads were sequenced for three libraries and are located on the BBC cluster at /common/RNASeq_Workshop/RedSpruce/Illumina

 

Quality control 

Quality control of Illumina reads using Sickle:

Step one is to perform quality control on the reads, and we will be using Sickle for the Illumina reads. We are using unpaired reads, as indicated by the “se” flag in the script below. The -f flag designates the input file, -o is the output file, -q is our minimum quality score and -l is the minimum read length. We are using a minimum quality of 30 and a minimum length of 35. The trimmed output files are what we will be using for the next steps of our analysis. The script can be found at /common/RNASeq_Workshop/RedSpruce/Illumina/RS_sickle.sh

sickle se -f RS_ambient.fastq -t illumina -o RS_ambient_trimmed.fastq -q 30 -l 35
sickle se -f RS_elevated.fastq -t illumina -o RS_elevated_trimmed.fastq -q 30 -l 35
sickle se -f RS_cotreated.fastq -t illumina -o RS_cotreated_trimmed.fastq -q 30 -l 35

Quality control of 454 Reads using SolexaQA:

Prior to running the quality control on the 454 reads we need to create a fastq file using our raw reads. We do have the sequence  file in ‘fasta’ format and the quality score in the ‘qual’ format file. Using the 454_to_fastq.py script we will combine the sequence data and the quality score into a single file which will be a ‘fastq’ file.

python 454_to_fastq.py [fasta file] [qual file]

The script  will be calling a python script (/common/RNASeq_Workshop/RedSpruce/454/454_to_fastq.py) that converts 454 raw reads to Illumina (known as Solexa in the past).

python /common/RNASeq_Workshop/RedSpruce/454/454_to_fastq.py RS_elevated_CO2_11302010.sff.fasta RS_elevated_CO2_11302010.sff.qual
python /common/RNASeq_Workshop/RedSpruce/454/454_to_fastq.py RS_control_11302010.sff.fasta RS_control_11302010.sff.qual
python /common/RNASeq_Workshop/RedSpruce/454/454_to_fastq.py RS_co-treated_11302010.sff.fasta RS_co-treated_11302010.sff.qual

Next we use SolexaQA to perform quality control of the 454 reads. We run by p-value using the --torrent option, with a minimum length of 50.  Two scripts will be run from the package.  The first is (dynamictrim) responsible for trimming based on quality.  The second (lengthsort) will remove very short sequences from the set.  The script can be found at: /common/RNASeq_Workshop/RedSpruce/454/454_pvalue.sh

SolexaQA dynamictrim RS_elevated_CO2_11302010.fastq --torrent -d elevated
SolexaQA dynamictrim RS_control_11302010.fastq --torrent -d control
SolexaQA dynamictrim RS_co-treated_11302010.fastq --torrent -d cotreated
SolexaQA lengthsort elevated/RS_elevated_CO2_11302010.fastq.trimmed -l 50
SolexaQA lengthsort control/RS_control_11302010.fastq.trimmed -l 50
SolexaQA lengthsort cotreated/RS_co-treated_11302010.fastq.trimmed -l 50

 

De novo assembly

De novo assembly of Illumina reads using Trinity:

Now we assemble all three quality controlled Illumina libraries with Trinity which is a de novo transcriptome assembler.  This is most appropriate to use on a read set where no reference genome exists or where the reference genome is not high quality.  Assembly requires a great deal of memory (RAM) and can take a few days to run if the read set is large.  The BBC cluster has a high memory node you can connect to access this.   In this example, we are assembling short reads  The script can be found at: /common/RNASeq_Workshop/RedSpruce/Illumina/Trinity/RS_illumina_trinity.sh

module load trinity/2.0.2

Trinity --seqType fq --max_memory 50G --single /common/RNASeq_Workshop/RedSpruce/Illumina/RS_ambient_trimmed.fastq,/common/RNASeq_Workshop/RedSpruce/Illumina/RS_cotreated_trimmed.fastq,/common/RNASeq_Workshop/RedSpruce/Illumina/RS_elevated_trimmed.fastq --min_contig_length 300 --CPU 16 --normalize_reads

De novo assembly of 454 reads using MIRA:

Trinity is specialized for short read data so different assemblers will be used for the Roche 454.  In this case, we will use MIRA [MIRA documentation] which also requires substantial memory and potentially time to complete the assemly.  All three quality controlled 454 libraries will be assembled togehter.

In order to run mira, one must first construct a configuration file, which is given bellow for this exercise. The configuration file will contain the location of trimmed fastq files,  and the number of processes to be used in the genome assembly. The script can be found in the following path :

/common/RNASeq_Workshop/RedSpruce/454/mira_est/RS_454.conf

job = est,denovo,accurate
parameters = -GE:not=8

readgroup = SE_Control_454
data = /common/RNASeq_Workshop/RedSpruce/454/control/RS_control_trimmed.fastq
technology = 454

readgroup = SE_Elevated_454
data = /common/RNASeq_Workshop/RedSpruce/454/elevated/RS_elevated_trimmed.fastq
technology = 454

readgroup = SE_Cotreated_454
data = /common/RNASeq_Workshop/Red_Spruce/454/cotreated/RS_co-treated_trimmed.fastq
technology = 454

When mira is running the files need to be in a local disk space in the cluster, so we will create a temporary location for our run which is named as the WORKING_DIR in the scratch space. Once the run is completed we will copy all its contents back to our destination directory (DESTINATION_DIR) using the following script.

This script is located at /common/RNASeq_Workshop/RedSpruce/454/mira_est/mira_RS_454.sh

WORKING_DIR="/state/partition1/scratch/RedSpruce/est"

if [ ! -d "$WORKING_DIR" ]; then
mkdir -p $WORKING_DIR
fi

DESTINATION_DIR="/common/RNASeq_Workshop/RedSpruce/454/mira_est"

cd $WORKING_DIR

cp /common/RNASeq_Workshop/RedSpruce/454/mira_est/* .

mira RS_454.conf > rs_log.txt

cp -r * $DESTINATION_DIR
rm -rf $WORKING_DIR

 

Mira assembly run will create four directories, and the assembled transcriptome file (‘RS_Mira_454_out.padded.fasta’) will be located in the ‘RS_Mira_454_assembly/RS_Mira_454_d_results’ folder.

Clustering

Clustering of Illumina and 454 reads using usearch:

To obtain a set of unique genes from both runs, we will cluster the two resulting assemblies together.  First, the two assembies will be combined into one file using the Unix command cat, which refers to concatanate.  Once the files are combined, we will use USEARCH‘s command --cluster_fast to find redundancy between the assembled transcripts and create a single output known as a centroids file.  The threshold for clustering in this example is set to 80% identity using --id. The script can be found at /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/combine.sh

cat /common/RNASeq_Workshop/RedSpruce/Illumina/Trinity/trinity_out_dir/Trinity.fasta /common/RNASeq_Workshop/RedSpruce/454/mira_est/RS_454_assembly/RS_454_d_results/RS_454_out.unpadded.fasta > temp.fasta
usearch --cluster_fast temp.fasta \
        --id 0.80 --centroids RS_454_Illumina_centroids.fasta

Identifying coding regions using TransDecoder:

Now that we have our reads assembled and clustered together into the single centroids file, we can use TransDecoder to determine optimal open reading frames from the assembly (ORFs).  Assembled RNA-Seq transcripts may have 5′ or 3′ UTR sequence attached and this can make it difficult to determine the CDS in non-model spieces.  Transdecoder references the Pfam library which contains information on annotated protein domains.  At least one try domain must be found to result in an ORF identification as this assists with frame determination.

The transdecoder  command options used:

-t <string>                    transcripts.fasta 
--search_pfam <string>         /path/to/pfam_db.hmm  
--CPU <int>                    number of threads to use (passed on to hmmscan) (default: 2)

 

/opt/bioinformatics/trinity2/trinity-plugins/TransDecoder_r20131110/TransDecoder \
           -t /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/RS_clustered_filtered_centroids.fasta \
           --CPU 1 \
           --search_pfam /common/RNASeq_Workshop/RedSpruce/Pfam-B.hmm

Indexing reads and aligning with Bowtie2:

Next, we will be using Bowtie2 to align the transcripts of our three treatment groups to our transdecoded assembly. First we need to create a Bowtie2 index of our transdecoded assembly using the following script, which can be found at:

/common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/Variants/TransdecodedAssembly/bowtieRStransdecodeclusteredindex.sh

bowtie2-build /common/RNASeq_Workshop/Red_Spruce/Trinity_combined/Variants/TransdecodeAssembly/longest_orfs.cds.fasta RStransdecodeindex

Now that we have indexed our transdecoded assembly, we can use Bowtie2 to align our three treatments to the assembly using the script below, found at /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/bowtieRStrandecalign.sh

bowtie2 -x RStransdecodeclusteredindex /common/RNASeq_Workshop/RedSpruce/Illumina/RS-elevated_trimmed.fastq -S RSelevated.sam

bowtie2 -x RStransdecodeclusteredindex /common/RNASeq_Workshop/RedSpruce/Illumina/RS-ambient_trimmed.fastq -S RSambient.sam

bowtie2 -x RStransdecodeclusteredindex /common/RNASeq_Workshop/RedSpruce/Illumina/RS-cotreated_trimmed.fastq -S RScotreated.sam

Counting gene expression with eXpress:

First we convert the  .sam output files from bowtie2 to the binary .bam form using samtools. Script found at /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/SAMtoBAM.sh

samtools view -bhS RSambient.sam | samtools sort - RSambient_sorted
samtools view -bhS RScotreated.sam | samtools sort - RScotreated_sorted
samtools view -bhS RSelevated.sam | samtools sort - RSelevated_sorted

Now we can get counts of differential expression using eXpress.

/common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/express.sh

express /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/Variants/TransdecodedAssembly/RS_longest_orfs.cds.clustered.fasta RSambient_sorted.bam

express /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/Variants/TransdecodedAssembly/RS_longest_orfs.cds.clustered.fasta RScotreated_sorted.bam

express /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/Variants/TransdecodedAssembly/RS_longest_orfs.cds.clustered.fasta RSelevated_sorted.bam

 

Analysis of counts with DESeq2:

For the remaining steps I find it easier to to work from a desktop rather than the server. So you can download the .xprs files you just created with eXpress from the server onto your computer. You will also need to download R to run DESeq2, and I’d also recommend installing RStudio, which provides a graphical interface that makes working with R scripts much easier. They can be found here:

R Download

RStudio

The R DESeq2 library also must be installed. To install this package, start the R console and enter:

source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")

The R code below is long and slightly complicated, but I will highlight major points. This script was adapted from here and here, and much credit goes to those authors. Some important notes:

  • The most important information comes out as “-replaceoutliers-results.csv” there we can see adjusted and normal p-values, as well as log2foldchange for all of the genes.
  • par(mar) manipulation is used to make the most appealing figures, but these values are not the same for every display or system or figure. Much documentation is available online on how to manipulate and best use par() and ggplot2 graphing parameters: DESeq2 Manual
directory <- "~/Documents/redspruce_samples/mira+trinity_clustered"
outputPrefix <- "Spruce_deseq2"
setwd(directory)

#Imoport eXpress files
express_file_1 <- data.frame(read.table("results_RSambient.xprs",sep="\t", header = TRUE))
express_file_2 <- data.frame(read.table("results_RScotreated.xprs",sep="\t", header = TRUE))
express_file_3 <- data.frame(read.table("results_RSelevated.xprs",sep="\t", header = TRUE))

#Grab the target_id and tot_count columns from eXpress files
subset1 <- data.frame(express_file_1$target_id, express_file_1$tot_counts)
subset2 <- data.frame(express_file_2$target_id, express_file_2$tot_counts)
subset3 <- data.frame(express_file_3$target_id, express_file_3$tot_counts)

#Sort the files by target_id
sorted1 <- subset1[order(express_file_1$target_id),]
sorted2 <- subset2[order(express_file_2$target_id),]
sorted3 <- subset3[order(express_file_3$target_id),]


write.table(sorted1, "results_RSambient.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted2, "results_RScotreated.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted3, "results_RSelevated.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)


sampleFiles<- c("results_RSambient.txt",
                "results_RScotreated.txt",
                "results_RSelevated.txt")
sampleNames <- c("Spruce_ambient","Spruce_cotreated","Spruce_elevated")
sampleCondition <- c("ambient","cotreated","elevated")
sampleTable <- data.frame(sampleName = sampleNames, fileName = sampleFiles, condition = sampleCondition)

treatments = c("ambient","cotreated","elevated")

library("DESeq2")
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                       directory = directory,
                                       design = ~ condition)
colData(ddsHTSeq)$condition <- factor(colData(ddsHTSeq)$condition,
                                      levels = treatments)

#guts
dds <- DESeq(ddsHTSeq)
res <- results(dds)

# copied from: https://benchtobioinformatics.wordpress.com/category/dexseq/
# order results by pvalue value (most significant to least)
res= subset(res, pvalue<0.1)
res <- res[order(res$pvalue),]

# should see DataFrame of baseMean, log2Foldchange, stat, pval, padj

# save data results and normalized reads to csv
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds,normalized =TRUE)), by = 'row.names', sort = FALSE)
names(resdata)[1] <- 'target_id'
write.csv(resdata, file = paste0(outputPrefix, "-results-with-normalized.csv"))

# send normalized counts to tab delimited file for GSEA, etc.
write.table(as.data.frame(counts(dds),normalized=T), file = paste0(outputPrefix, "_normalized_counts.txt"), sep = '\t')

# produce DataFrame of results of statistical tests
mcols(res, use.names = T)
write.csv(as.data.frame(mcols(res, use.name = T)),file = paste0(outputPrefix, "-test-conditions.csv"))

# replacing outlier value with estimated value as predicted by distrubution using
# "trimmed mean" approach. recommended if you have several replicates per treatment
# DESeq2 will automatically do this if you have 7 or more replicates

ddsClean <- replaceOutliersWithTrimmedMean(dds)
ddsClean <- DESeq(ddsClean)
tab <- table(initial = results(dds)$pvalue < 0.1,
             cleaned = results(ddsClean)$pvalue < 0.1)
addmargins(tab)
write.csv(as.data.frame(tab),file = paste0(outputPrefix, "-replaceoutliers.csv"))
resClean <- results(ddsClean)
resClean = subset(res, pvalue<0.1)
resClean <- resClean[order(resClean$padj),]
write.csv(as.data.frame(resClean),file = paste0(outputPrefix, "-replaceoutliers-results.csv"))

####################################################################################
# Exploritory data analysis of RNAseq data with DESeq2
#
# these next R scripts are for a variety of visualization, QC and other plots to
# get a sense of what the RNAseq data looks like based on DESEq2 analysis
#
# 1) MA plot
# 2) rlog stabilization and variance stabiliazation
# 3) variance stabilization plot
# 4) heatmap of clustering analysis
# 5) PCA plot
#
#
####################################################################################

# MA plot of RNAseq data for entire dataset
# http://en.wikipedia.org/wiki/MA_plot
# genes with padj < 0.1 are colored Red
plotMA(dds, ylim=c(-8,8),main = "RNAseq experiment")
dev.copy(png, paste0(outputPrefix, "-MAplot_initial_analysis.png"))
dev.off()

# transform raw counts into normalized values
# DESeq2 has two options:  1) rlog transformed and 2) variance stabilization
# variance stabilization is very good for heatmaps, etc.
rld <- rlogTransformation(dds, blind=T)
vsd <- varianceStabilizingTransformation(dds, blind=T)

# save normalized values
write.table(as.data.frame(assay(rld),file = paste0(outputPrefix, "-rlog-transformed-counts.txt"), sep = '\t'))
write.table(as.data.frame(assay(vsd),file = paste0(outputPrefix, "-vst-transformed-counts.txt"), sep = '\t'))

# plot to show effect of transformation
# axis is square root of variance over the mean for all samples
par(mai = ifelse(1:4 <= 2, par('mai'),0))
px <- counts(dds)[,1] / sizeFactors(dds)[1]
ord <- order(px)
ord <- ord[px[ord] < 150]
ord <- ord[seq(1,length(ord),length=50)]
last <- ord[length(ord)]
vstcol <- c('blue','black')
matplot(px[ord], cbind(assay(vsd)[,1], log2(px))[ord, ],type='l', lty = 1, col=vstcol, xlab = 'n', ylab = 'f(n)')
legend('bottomright',legend=c(expression('variance stabilizing transformation'), expression(log[2](n/s[1]))), fill=vstcol)
dev.copy(png,paste0(outputPrefix, "-variance_stabilizing.png"))
dev.off()

# clustering analysis
# excerpts from http://dwheelerau.com/2014/02/17/how-to-use-deseq2-to-analyse-rnaseq-data/
library("RColorBrewer")
library("gplots")
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds), paste(condition, sampleNames, sep=" : "))
#Or if you want conditions use:
#rownames(mat) <- colnames(mat) <- with(colData(dds),condition)
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
dev.copy(png, paste0(outputPrefix, "-clustering.png"))
heatmap.2(mat, trace = "none", col = rev(hmcol), margin = c(18,18))
dev.off()

#Principal components plot shows additional but rough clustering of samples
library("genefilter")
library("ggplot2")
library("grDevices")

rv <- rowVars(assay(rld))
select <- order(rv, decreasing=T)[seq_len(min(500,length(rv)))]
pc <- prcomp(t(assay(vsd)[select,]))

# set condition
condition <- treatments
scores <- data.frame(pc$x, condition)

(pcaplot <- ggplot(scores, aes(x = PC1, y = PC2, col = (factor(condition))))
+ geom_point(size = 5)
+ ggtitle("Principal Components")
+ scale_colour_brewer(name = " ", palette = "Set1")
+ theme(
  plot.title = element_text(face = 'bold'),
  legend.position = c(.9,.2),
  legend.key = element_rect(fill = 'NA'),
  legend.text = element_text(size = 10, face = "bold"),
  axis.text.y = element_text(colour = "Black"),
  axis.text.x = element_text(colour = "Black"),
  axis.title.x = element_text(face = "bold"),
  axis.title.y = element_text(face = 'bold'),
  panel.grid.major.x = element_blank(),
  panel.grid.major.y = element_blank(),
  panel.grid.minor.x = element_blank(),
  panel.grid.minor.y = element_blank(),
  panel.background = element_rect(color = 'black',fill = NA)
))

ggsave(pcaplot,file=paste0(outputPrefix, "-ggplot2.pdf"))

# scatter plot of rlog transformations between Sample conditions
# nice way to compare control and experimental samples
head(assay(rld))
# plot(log2(1+counts(dds,normalized=T)[,1:2]),col='black',pch=20,cex=0.3, main='Log2 transformed')
plot(assay(rld)[,1:3],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,2:4],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,6:5],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
# plot(assay(rld)[,7:8],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
# plot(assay(rld)[,9:10],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
# plot(assay(rld)[,11:12],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
# plot(assay(rld)[,13:14],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
# plot(assay(rld)[,15:16],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")

# heatmap of data
library("RColorBrewer")
library("gplots")
# 1000 top expressed genes with heatmap.2
select <- order(rowMeans(counts(ddsClean,normalized=T)),decreasing=T)[1:1000]
my_palette <- colorRampPalette(c("blue",'white','red'))(n=1000)
heatmap.2(assay(vsd)[select,], col=my_palette,
          scale="row", key=T, keysize=1, symkey=T,
          density.info="none", trace="none",
          cexCol=0.6, labRow=F,
          main="1000 Top Expressed Genes Heatmap")
dev.copy(png, paste0(outputPrefix, "-HEATMAP.png"))
dev.off()

# top 2000 genes based on row variance with heatmap3
#library(heatmap3)
#colsidecolors = c("darkgreen","darkgreen",
#  "mediumpurple2","mediumpurple2",
#  "mediumpurple2","mediumpurple2",
#  "darkgreen","darkgreen",
#  "mediumpurple2","mediumpurple2",
#  "darkgreen","darkgreen",
#  "mediumpurple2","mediumpurple2",
#  "mediumpurple2","mediumpurple2")
#rv <- rowVars(assay(vsd))
#select <- order(rv, decreasing=T)[seq_len(min(2000,length(rv)))]
#my_palette <- colorRampPalette(c("blue", "white", "red"))(1024)
#png(paste0(outputPrefix, "-heatmap3.png"))
#heatmap3(assay(vsd)[select,],col=my_palette,
#         labRow = F,cexCol = 0.8,margins=c(6,6))
#dev.off()

mar.old <- par('mar')
print(mar.old)
par(mar=c(5.1, 4.1, 2.0, 1.0))
png(paste0(outputPrefix, "-Dispersion-plot.png"))
plotDispEsts(dds, main="Dispersion Estimates")
par(mar=mar.old)
dev.off()

sessionInfo()