This tutorial will serve as a guideline for RNA sequence analysis of a non-model organism. We will be using a needle transcriptome from red spruce as our example. The RNA-Seq data consists of Illumina short reads as well as 454 reads.
We will be going through:
- Quality control of the libraries
- De novo assembly of the reads
- Clustering of the assembled transcripts from each assembly
- Mapping the reads back to assembled/clustered transcriptome
- Generating raw counts
- DESeq2 to analyze differential gene expression between two treatments (elevated and ambient CO2).
The packages we’ll be using can be found here:
The data consists of three libraries under three different treatments;
- ambient CO2
- elevated CO2
- one that had been cotreated for both conditions
Two types of sequencing methods were used;
- 454 reads were sequenced for three libraries and are located on the BBC cluster at
- Illumina reads were sequenced for three libraries and are located on the BBC cluster at
Quality control of Illumina reads using Sickle:
Step one is to perform quality control on the reads, and we will be using Sickle for the Illumina reads. We are using unpaired reads, as indicated by the “
se” flag in the script below. The
-f flag designates the input file,
-o is the output file,
-q is our minimum quality score and
-l is the minimum read length. We are using a minimum quality of 30 and a minimum length of 35. The trimmed output files are what we will be using for the next steps of our analysis. The script can be found at
sickle se -f RS_ambient.fastq -t illumina -o RS_ambient_trimmed.fastq -q 30 -l 35 sickle se -f RS_elevated.fastq -t illumina -o RS_elevated_trimmed.fastq -q 30 -l 35 sickle se -f RS_cotreated.fastq -t illumina -o RS_cotreated_trimmed.fastq -q 30 -l 35
Quality control of 454 Reads using SolexaQA:
Prior to running the quality control on the 454 reads we need to create a fastq file using our raw reads. We do have the sequence file in ‘fasta’ format and the quality score in the ‘qual’ format file. Using the 454_to_fastq.py script we will combine the sequence data and the quality score into a single file which will be a ‘fastq’ file.
python 454_to_fastq.py [fasta file] [qual file]
The script will be calling a python script (
/common/RNASeq_Workshop/RedSpruce/454/454_to_fastq.py) that converts 454 raw reads to Illumina (known as Solexa in the past).
python /common/RNASeq_Workshop/RedSpruce/454/454_to_fastq.py RS_elevated_CO2_11302010.sff.fasta RS_elevated_CO2_11302010.sff.qual python /common/RNASeq_Workshop/RedSpruce/454/454_to_fastq.py RS_control_11302010.sff.fasta RS_control_11302010.sff.qual python /common/RNASeq_Workshop/RedSpruce/454/454_to_fastq.py RS_co-treated_11302010.sff.fasta RS_co-treated_11302010.sff.qual
Next we use SolexaQA to perform quality control of the 454 reads. We run by p-value using the
--torrent option, with a minimum length of 50. Two scripts will be run from the package. The first is (dynamictrim) responsible for trimming based on quality. The second (lengthsort) will remove very short sequences from the set. The script can be found at:
SolexaQA dynamictrim RS_elevated_CO2_11302010.fastq --torrent -d elevated SolexaQA dynamictrim RS_control_11302010.fastq --torrent -d control SolexaQA dynamictrim RS_co-treated_11302010.fastq --torrent -d cotreated SolexaQA lengthsort elevated/RS_elevated_CO2_11302010.fastq.trimmed -l 50 SolexaQA lengthsort control/RS_control_11302010.fastq.trimmed -l 50 SolexaQA lengthsort cotreated/RS_co-treated_11302010.fastq.trimmed -l 50
De novo assembly
De novo assembly of Illumina reads using Trinity:
Now we assemble all three quality controlled Illumina libraries with Trinity which is a de novo transcriptome assembler. This is most appropriate to use on a read set where no reference genome exists or where the reference genome is not high quality. Assembly requires a great deal of memory (RAM) and can take a few days to run if the read set is large. The BBC cluster has a high memory node you can connect to access this. In this example, we are assembling short reads The script can be found at:
module load trinity/2.0.2 Trinity --seqType fq --max_memory 50G --single /common/RNASeq_Workshop/RedSpruce/Illumina/RS_ambient_trimmed.fastq,/common/RNASeq_Workshop/RedSpruce/Illumina/RS_cotreated_trimmed.fastq,/common/RNASeq_Workshop/RedSpruce/Illumina/RS_elevated_trimmed.fastq --min_contig_length 300 --CPU 16 --normalize_reads
De novo assembly of 454 reads using MIRA:
Trinity is specialized for short read data so different assemblers will be used for the Roche 454. In this case, we will use MIRA [MIRA documentation] which also requires substantial memory and potentially time to complete the assemly. All three quality controlled 454 libraries will be assembled togehter.
In order to run mira, one must first construct a configuration file, which is given bellow for this exercise. The configuration file will contain the location of trimmed fastq files, and the number of processes to be used in the genome assembly. The script can be found in the following path :
job = est,denovo,accurate parameters = -GE:not=8 readgroup = SE_Control_454 data = /common/RNASeq_Workshop/RedSpruce/454/control/RS_control_trimmed.fastq technology = 454 readgroup = SE_Elevated_454 data = /common/RNASeq_Workshop/RedSpruce/454/elevated/RS_elevated_trimmed.fastq technology = 454 readgroup = SE_Cotreated_454 data = /common/RNASeq_Workshop/Red_Spruce/454/cotreated/RS_co-treated_trimmed.fastq technology = 454
When mira is running the files need to be in a local disk space in the cluster, so we will create a temporary location for our run which is named as the WORKING_DIR in the scratch space. Once the run is completed we will copy all its contents back to our destination directory (DESTINATION_DIR) using the following script.
This script is located at
WORKING_DIR="/state/partition1/scratch/RedSpruce/est" if [ ! -d "$WORKING_DIR" ]; then mkdir -p $WORKING_DIR fi DESTINATION_DIR="/common/RNASeq_Workshop/RedSpruce/454/mira_est" cd $WORKING_DIR cp /common/RNASeq_Workshop/RedSpruce/454/mira_est/* . mira RS_454.conf > rs_log.txt cp -r * $DESTINATION_DIR rm -rf $WORKING_DIR
Mira assembly run will create four directories, and the assembled transcriptome file (‘RS_Mira_454_out.padded.fasta’) will be located in the ‘RS_Mira_454_assembly/RS_Mira_454_d_results’ folder.
Clustering of Illumina and 454 reads using usearch:
To obtain a set of unique genes from both runs, we will cluster the two resulting assemblies together. First, the two assembies will be combined into one file using the Unix command cat, which refers to concatanate. Once the files are combined, we will use USEARCH‘s command
--cluster_fast to find redundancy between the assembled transcripts and create a single output known as a centroids file. The threshold for clustering in this example is set to 80% identity using
--id. The script can be found at
cat /common/RNASeq_Workshop/RedSpruce/Illumina/Trinity/trinity_out_dir/Trinity.fasta /common/RNASeq_Workshop/RedSpruce/454/mira_est/RS_454_assembly/RS_454_d_results/RS_454_out.unpadded.fasta > temp.fasta
usearch --cluster_fast temp.fasta \ --id 0.80 --centroids RS_454_Illumina_centroids.fasta
Identifying coding regions using TransDecoder:
Now that we have our reads assembled and clustered together into the single centroids file, we can use TransDecoder to determine optimal open reading frames from the assembly (ORFs). Assembled RNA-Seq transcripts may have 5′ or 3′ UTR sequence attached and this can make it difficult to determine the CDS in non-model spieces. Transdecoder references the Pfam library which contains information on annotated protein domains. At least one try domain must be found to result in an ORF identification as this assists with frame determination.
The transdecoder command options used:
-t <string> transcripts.fasta --search_pfam <string> /path/to/pfam_db.hmm --CPU <int> number of threads to use (passed on to hmmscan) (default: 2)
/opt/bioinformatics/trinity2/trinity-plugins/TransDecoder_r20131110/TransDecoder \ -t /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/RS_clustered_filtered_centroids.fasta \ --CPU 1 \ --search_pfam /common/RNASeq_Workshop/RedSpruce/Pfam-B.hmm
Indexing reads and aligning with Bowtie2:
Next, we will be using Bowtie2 to align the transcripts of our three treatment groups to our transdecoded assembly. First we need to create a Bowtie2 index of our transdecoded assembly using the following script, which can be found at:
bowtie2-build /common/RNASeq_Workshop/Red_Spruce/Trinity_combined/Variants/TransdecodeAssembly/longest_orfs.cds.fasta RStransdecodeindex
Now that we have indexed our transdecoded assembly, we can use Bowtie2 to align our three treatments to the assembly using the script below, found at
bowtie2 -x RStransdecodeclusteredindex /common/RNASeq_Workshop/RedSpruce/Illumina/RS-elevated_trimmed.fastq -S RSelevated.sam bowtie2 -x RStransdecodeclusteredindex /common/RNASeq_Workshop/RedSpruce/Illumina/RS-ambient_trimmed.fastq -S RSambient.sam bowtie2 -x RStransdecodeclusteredindex /common/RNASeq_Workshop/RedSpruce/Illumina/RS-cotreated_trimmed.fastq -S RScotreated.sam
Counting gene expression with eXpress:
First we convert the .sam output files from bowtie2 to the binary .bam form using samtools. Script found at
samtools view -bhS RSambient.sam | samtools sort - RSambient_sorted samtools view -bhS RScotreated.sam | samtools sort - RScotreated_sorted samtools view -bhS RSelevated.sam | samtools sort - RSelevated_sorted
Now we can get counts of differential expression using eXpress.
express /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/Variants/TransdecodedAssembly/RS_longest_orfs.cds.clustered.fasta RSambient_sorted.bam express /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/Variants/TransdecodedAssembly/RS_longest_orfs.cds.clustered.fasta RScotreated_sorted.bam express /common/RNASeq_Workshop/RedSpruce/mira+trinity_clustered/Variants/TransdecodedAssembly/RS_longest_orfs.cds.clustered.fasta RSelevated_sorted.bam
Analysis of counts with DESeq2:
For the remaining steps I find it easier to to work from a desktop rather than the server. So you can download the .xprs files you just created with eXpress from the server onto your computer. You will also need to download R to run DESeq2, and I’d also recommend installing RStudio, which provides a graphical interface that makes working with R scripts much easier. They can be found here:
The R DESeq2 library also must be installed. To install this package, start the R console and enter:
- The most important information comes out as “-replaceoutliers-results.csv” there we can see adjusted and normal p-values, as well as log2foldchange for all of the genes.
- par(mar) manipulation is used to make the most appealing figures, but these values are not the same for every display or system or figure. Much documentation is available online on how to manipulate and best use par() and ggplot2 graphing parameters: DESeq2 Manual
directory <- "~/Documents/redspruce_samples/mira+trinity_clustered" outputPrefix <- "Spruce_deseq2" setwd(directory) #Imoport eXpress files express_file_1 <- data.frame(read.table("results_RSambient.xprs",sep="\t", header = TRUE)) express_file_2 <- data.frame(read.table("results_RScotreated.xprs",sep="\t", header = TRUE)) express_file_3 <- data.frame(read.table("results_RSelevated.xprs",sep="\t", header = TRUE)) #Grab the target_id and tot_count columns from eXpress files subset1 <- data.frame(express_file_1$target_id, express_file_1$tot_counts) subset2 <- data.frame(express_file_2$target_id, express_file_2$tot_counts) subset3 <- data.frame(express_file_3$target_id, express_file_3$tot_counts) #Sort the files by target_id sorted1 <- subset1[order(express_file_1$target_id),] sorted2 <- subset2[order(express_file_2$target_id),] sorted3 <- subset3[order(express_file_3$target_id),] write.table(sorted1, "results_RSambient.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE) write.table(sorted2, "results_RScotreated.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE) write.table(sorted3, "results_RSelevated.txt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE) sampleFiles<- c("results_RSambient.txt", "results_RScotreated.txt", "results_RSelevated.txt") sampleNames <- c("Spruce_ambient","Spruce_cotreated","Spruce_elevated") sampleCondition <- c("ambient","cotreated","elevated") sampleTable <- data.frame(sampleName = sampleNames, fileName = sampleFiles, condition = sampleCondition) treatments = c("ambient","cotreated","elevated") library("DESeq2") ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ condition) colData(ddsHTSeq)$condition <- factor(colData(ddsHTSeq)$condition, levels = treatments) #guts dds <- DESeq(ddsHTSeq) res <- results(dds) # copied from: https://benchtobioinformatics.wordpress.com/category/dexseq/ # order results by pvalue value (most significant to least) res= subset(res, pvalue<0.1) res <- res[order(res$pvalue),] # should see DataFrame of baseMean, log2Foldchange, stat, pval, padj # save data results and normalized reads to csv resdata <- merge(as.data.frame(res), as.data.frame(counts(dds,normalized =TRUE)), by = 'row.names', sort = FALSE) names(resdata) <- 'target_id' write.csv(resdata, file = paste0(outputPrefix, "-results-with-normalized.csv")) # send normalized counts to tab delimited file for GSEA, etc. write.table(as.data.frame(counts(dds),normalized=T), file = paste0(outputPrefix, "_normalized_counts.txt"), sep = '\t') # produce DataFrame of results of statistical tests mcols(res, use.names = T) write.csv(as.data.frame(mcols(res, use.name = T)),file = paste0(outputPrefix, "-test-conditions.csv")) # replacing outlier value with estimated value as predicted by distrubution using # "trimmed mean" approach. recommended if you have several replicates per treatment # DESeq2 will automatically do this if you have 7 or more replicates ddsClean <- replaceOutliersWithTrimmedMean(dds) ddsClean <- DESeq(ddsClean) tab <- table(initial = results(dds)$pvalue < 0.1, cleaned = results(ddsClean)$pvalue < 0.1) addmargins(tab) write.csv(as.data.frame(tab),file = paste0(outputPrefix, "-replaceoutliers.csv")) resClean <- results(ddsClean) resClean = subset(res, pvalue<0.1) resClean <- resClean[order(resClean$padj),] write.csv(as.data.frame(resClean),file = paste0(outputPrefix, "-replaceoutliers-results.csv")) #################################################################################### # Exploritory data analysis of RNAseq data with DESeq2 # # these next R scripts are for a variety of visualization, QC and other plots to # get a sense of what the RNAseq data looks like based on DESEq2 analysis # # 1) MA plot # 2) rlog stabilization and variance stabiliazation # 3) variance stabilization plot # 4) heatmap of clustering analysis # 5) PCA plot # # #################################################################################### # MA plot of RNAseq data for entire dataset # http://en.wikipedia.org/wiki/MA_plot # genes with padj < 0.1 are colored Red plotMA(dds, ylim=c(-8,8),main = "RNAseq experiment") dev.copy(png, paste0(outputPrefix, "-MAplot_initial_analysis.png")) dev.off() # transform raw counts into normalized values # DESeq2 has two options: 1) rlog transformed and 2) variance stabilization # variance stabilization is very good for heatmaps, etc. rld <- rlogTransformation(dds, blind=T) vsd <- varianceStabilizingTransformation(dds, blind=T) # save normalized values write.table(as.data.frame(assay(rld),file = paste0(outputPrefix, "-rlog-transformed-counts.txt"), sep = '\t')) write.table(as.data.frame(assay(vsd),file = paste0(outputPrefix, "-vst-transformed-counts.txt"), sep = '\t')) # plot to show effect of transformation # axis is square root of variance over the mean for all samples par(mai = ifelse(1:4 <= 2, par('mai'),0)) px <- counts(dds)[,1] / sizeFactors(dds) ord <- order(px) ord <- ord[px[ord] < 150] ord <- ord[seq(1,length(ord),length=50)] last <- ord[length(ord)] vstcol <- c('blue','black') matplot(px[ord], cbind(assay(vsd)[,1], log2(px))[ord, ],type='l', lty = 1, col=vstcol, xlab = 'n', ylab = 'f(n)') legend('bottomright',legend=c(expression('variance stabilizing transformation'), expression(log(n/s))), fill=vstcol) dev.copy(png,paste0(outputPrefix, "-variance_stabilizing.png")) dev.off() # clustering analysis # excerpts from http://dwheelerau.com/2014/02/17/how-to-use-deseq2-to-analyse-rnaseq-data/ library("RColorBrewer") library("gplots") distsRL <- dist(t(assay(rld))) mat <- as.matrix(distsRL) rownames(mat) <- colnames(mat) <- with(colData(dds), paste(condition, sampleNames, sep=" : ")) #Or if you want conditions use: #rownames(mat) <- colnames(mat) <- with(colData(dds),condition) hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) dev.copy(png, paste0(outputPrefix, "-clustering.png")) heatmap.2(mat, trace = "none", col = rev(hmcol), margin = c(18,18)) dev.off() #Principal components plot shows additional but rough clustering of samples library("genefilter") library("ggplot2") library("grDevices") rv <- rowVars(assay(rld)) select <- order(rv, decreasing=T)[seq_len(min(500,length(rv)))] pc <- prcomp(t(assay(vsd)[select,])) # set condition condition <- treatments scores <- data.frame(pc$x, condition) (pcaplot <- ggplot(scores, aes(x = PC1, y = PC2, col = (factor(condition)))) + geom_point(size = 5) + ggtitle("Principal Components") + scale_colour_brewer(name = " ", palette = "Set1") + theme( plot.title = element_text(face = 'bold'), legend.position = c(.9,.2), legend.key = element_rect(fill = 'NA'), legend.text = element_text(size = 10, face = "bold"), axis.text.y = element_text(colour = "Black"), axis.text.x = element_text(colour = "Black"), axis.title.x = element_text(face = "bold"), axis.title.y = element_text(face = 'bold'), panel.grid.major.x = element_blank(), panel.grid.major.y = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), panel.background = element_rect(color = 'black',fill = NA) )) ggsave(pcaplot,file=paste0(outputPrefix, "-ggplot2.pdf")) # scatter plot of rlog transformations between Sample conditions # nice way to compare control and experimental samples head(assay(rld)) # plot(log2(1+counts(dds,normalized=T)[,1:2]),col='black',pch=20,cex=0.3, main='Log2 transformed') plot(assay(rld)[,1:3],col='#00000020',pch=20,cex=0.3, main = "rlog transformed") plot(assay(rld)[,2:4],col='#00000020',pch=20,cex=0.3, main = "rlog transformed") plot(assay(rld)[,6:5],col='#00000020',pch=20,cex=0.3, main = "rlog transformed") # plot(assay(rld)[,7:8],col='#00000020',pch=20,cex=0.3, main = "rlog transformed") # plot(assay(rld)[,9:10],col='#00000020',pch=20,cex=0.3, main = "rlog transformed") # plot(assay(rld)[,11:12],col='#00000020',pch=20,cex=0.3, main = "rlog transformed") # plot(assay(rld)[,13:14],col='#00000020',pch=20,cex=0.3, main = "rlog transformed") # plot(assay(rld)[,15:16],col='#00000020',pch=20,cex=0.3, main = "rlog transformed") # heatmap of data library("RColorBrewer") library("gplots") # 1000 top expressed genes with heatmap.2 select <- order(rowMeans(counts(ddsClean,normalized=T)),decreasing=T)[1:1000] my_palette <- colorRampPalette(c("blue",'white','red'))(n=1000) heatmap.2(assay(vsd)[select,], col=my_palette, scale="row", key=T, keysize=1, symkey=T, density.info="none", trace="none", cexCol=0.6, labRow=F, main="1000 Top Expressed Genes Heatmap") dev.copy(png, paste0(outputPrefix, "-HEATMAP.png")) dev.off() # top 2000 genes based on row variance with heatmap3 #library(heatmap3) #colsidecolors = c("darkgreen","darkgreen", # "mediumpurple2","mediumpurple2", # "mediumpurple2","mediumpurple2", # "darkgreen","darkgreen", # "mediumpurple2","mediumpurple2", # "darkgreen","darkgreen", # "mediumpurple2","mediumpurple2", # "mediumpurple2","mediumpurple2") #rv <- rowVars(assay(vsd)) #select <- order(rv, decreasing=T)[seq_len(min(2000,length(rv)))] #my_palette <- colorRampPalette(c("blue", "white", "red"))(1024) #png(paste0(outputPrefix, "-heatmap3.png")) #heatmap3(assay(vsd)[select,],col=my_palette, # labRow = F,cexCol = 0.8,margins=c(6,6)) #dev.off() mar.old <- par('mar') print(mar.old) par(mar=c(5.1, 4.1, 2.0, 1.0)) png(paste0(outputPrefix, "-Dispersion-plot.png")) plotDispEsts(dds, main="Dispersion Estimates") par(mar=mar.old) dev.off() sessionInfo()