RNA-seq Tutorial (with Reference Genome)

This tutorial will serve as a guideline for how to go about analyzing RNA sequencing data when a reference genome is available. We will be going through quality control of the reads, alignment of the reads to the reference genome, conversion of the files to raw counts, analysis of the counts with DeSeq2, and finally annotation of the reads using Biomart. Most of this will be done on the BBC server unless otherwise stated.

The packages we’ll be using can be found here:    Page by Dister Deoss

Sickle

STAR

HTSeq

DESeq2

biomaRt

 

The data we will be using are comparative transcriptomes of soybeans grown at either ambient or elevated Olevels. Each condition was done in triplicate, giving us a total of six samples we will be working with. The paper that these samples come from (which also serves as a great background reading on RNA-seq) can be found here:

The Bench Scientist’s Guide to statistical Analysis of RNA-Seq Data

The samples we will be using are described by the following accession numbers; SRR391535, SRR391536, SRR391537, SRR391538, SRR391539, and SRR391541. They can be found in results 13 through 18 of the following NCBI search:

http://www.ncbi.nlm.nih.gov/sra/?term=SRP009826

The script for downloading these .SRA files and converting them to fastq can be found in

/common/RNASeq_Workshop/Soybean/Quality_Control as the file fastq-dump.sh. The fastq files themselves are also already saved to this same directory. 

module load sratoolkit/2.8.1
fastq-dump SRR391535

 

Quality Control on the Reads Using Sickle:

Step one is to perform quality control on the reads using Sickle. We are using unpaired reads, as indicated by the “se” flag in the script below. The -f flag designates the input file, -o is the output file, -q is our minimum quality score and -l is the minimum read length.  The trimmed output files are what we will be using for the next steps of our analysis.

sickle se -f SRR391535.fastq -t sanger -o trimmed_SRR391535.fastq -q 35 -l 45

The script for running quality control on all six of our samples can be found in

/common/RNASeq_Workshop/Soybean/Quality_Control as the file sickle_soybean.sh. The output trimmed fastq files are also stored in this directory.

 

Alignment of Trimmed Reads Using STAR:

For this next step, you will first need to download the reference genome and annotation file for Glycine max (soybean). The files I used can be found at the following link:

Phytozome – Glycine max

You will need to create a user name and password for this database before you download the files. Once you’ve done that, you can download the assembly file Gmax_275_v2 and the annotation file Gmax_275_Wm82.a2.v1.gene_exons. Having the correct files is important for annotating the genes with Biomart later on.

Now that you have the genome and annotation files, you will create a genome index using the following script:

STAR --runMode genomeGenerate --genomeDir /common/RNASeq_Workshop/Soybean/gmax_genome/ --genomeFastaFiles /common/RNASeq_Workshop/Soybean/gmax_genome/Gmax_275_v2 --sjdbGTFfile /common/RNASeq_Workshop/Soybean/gmax_genome/Gmax_275_Wm82.a2.v1.gene_exons --sjdbGTFtagExonParentTranscript Parent --sjdbOverhang 100 --runThreadN 8

You will likely have to alter this script slightly to reflect the directory that you are working in and the specific names you gave your files, but the general idea is there. Indexing the genome allows for more efficient mapping of the reads to the genome.

The assembly file, annotation file, as well as all of the files created from indexing the genome can be found in

/common/RNASeq_Workshop/Soybean/gmax_genome

Now that you have your genome indexed, you can begin mapping your trimmed reads with the following script:

STAR --genomeDir /common/RNASeq_Workshop/Soybean/gmax_genome/ --readFilesIn trimmed_SRR391535.fastq --runThreadN 8 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix SRR391535

The –genomeDir flag refers to the directory in which your indexed genome is located. The output we get from this are .BAM files; binary files that will be converted to raw counts in our next step.

The script for mapping all six of our trimmed reads to .bam files can be found in

/common/RNASeq_Workshop/Soybean/STAR_HTSEQ_mapping as the file star_soybean.sh. The .bam output files are also stored in this directory. 

 

Convert BAM Files to Raw Counts with HTSeq:

Finally, we will use HTSeq to transform these mapped reads into counts that we can analyze with R. “-s” indicates we do not have strand specific counts. “-r” indicates the order that the reads were generated, for us it was by alignment position. “-t” indicates the feature from the annotation file we will be using, which in our case will be exons. “-i” indicates what attribute we will be using from the annotation file, here it is the PAC transcript ID. We identify that we are pulling in a .bam file (“-f bam”) and proceed to identify, and say where it will go.

htseq-count -s no -r pos —t exon -i pacid -f bam SRR391535Aligned.sortedByCoord.out.bam /common/RNASeq_Workshop/Soybean/gmax_genome/Gmax_275_Wm82.a2.v1.gene_exons > SRR391535-output_basename.counts

The script for converting all six .bam files to .count files is located in

/common/RNASeq_Workshop/Soybean/STAR_HTSEQ_mapping as the file htseq_soybean.sh. The .count output files are saved in 

/common/RNASeq_Workshop/Soybean/STAR_HTSEQ_mapping/counts

 

Analysis of Counts with DESeq2:

For the remaining steps I find it easier to to work from a desktop rather than the server. So you can download the .count files you just created from the server onto your computer. You will also need to download R to run DESeq2, and I’d also recommend installing RStudio, which provides a graphical interface that makes working with R scripts much easier. They can be found here:

R Download

RStudio

The R DESeq2 library also must be installed. To install this package, start the R console and enter:

source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")

The R code below is long and slightly complicated, but I will highlight major points. This script was adapted from here and here, and much credit goes to those authors. Some important notes:

  • The most important information comes out as “-replaceoutliers-results.csv” there we can see adjusted and normal p-values, as well as log2foldchange for all of the genes.
  • par(mar) manipulation is used to make the most appealing figures, but these values are not the same for every display or system or figure. Much documentation is available online on how to manipulate and best use par() and ggplot2 graphing parameters: DESeq2 Manual
# Load DESeq2 library
library("DESeq2")
# Set the working directory
directory <- "~/Documents/counts2"
setwd(directory)

# Set the prefix for each output file name
outputPrefix <- "Gmax_DESeq2"

sampleFiles<- c("SRR391535-output_basename.counts",
                "SRR391536-output_basename.counts",
                "SRR391537-output_basename.counts",
                "SRR391538-output_basename.counts",
                "SRR391539-output_basename.counts",
                "SRR391541-output_basename.counts")
sampleNames <- c("Leaf tissue 1","Leaf tissue 2","Leaf tissue 3","Leaf tissue 4","Leaf tissue 5","Leaf tissue 6")
sampleCondition <- c("ambient","ambient","elevated","elevated","elevated","ambient")
sampleTable <- data.frame(sampleName = sampleNames, fileName = sampleFiles, condition = sampleCondition)

treatments = c("ambient","elevated")

library("DESeq2")
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                       directory = directory,
                                       design = ~ condition)
colData(ddsHTSeq)$condition <- factor(colData(ddsHTSeq)$condition,
                                      levels = treatments)

#guts
dds <- DESeq(ddsHTSeq)
res <- results(dds)

# copied from: https://benchtobioinformatics.wordpress.com/category/dexseq/
# order results by padj value (most significant to least)
res= subset(res, padj<0.05)
res <- res[order(res$padj),]
# should see DataFrame of baseMean, log2Foldchange, stat, pval, padj

# save data results and normalized reads to csv
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds,normalized =TRUE)), by = 'row.names', sort = FALSE)
names(resdata)[1] <- 'gene'
write.csv(resdata, file = paste0(outputPrefix, "-results-with-normalized.csv"))

# send normalized counts to tab delimited file for GSEA, etc.
write.table(as.data.frame(counts(dds),normalized=T), file = paste0(outputPrefix, "_normalized_counts.txt"), sep = '\t')

# produce DataFrame of results of statistical tests
mcols(res, use.names = T)
write.csv(as.data.frame(mcols(res, use.name = T)),file = paste0(outputPrefix, "-test-conditions.csv"))

# replacing outlier value with estimated value as predicted by distrubution using
# "trimmed mean" approach. recommended if you have several replicates per treatment
# DESeq2 will automatically do this if you have 7 or more replicates

ddsClean <- replaceOutliersWithTrimmedMean(dds)
ddsClean <- DESeq(ddsClean)
tab <- table(initial = results(dds)$padj < 0.05,
             cleaned = results(ddsClean)$padj < 0.05)
addmargins(tab)
write.csv(as.data.frame(tab),file = paste0(outputPrefix, "-replaceoutliers.csv"))
resClean <- results(ddsClean)
resClean = subset(res, padj<0.05)
resClean <- resClean[order(resClean$padj),]
write.csv(as.data.frame(resClean),file = paste0(outputPrefix, "-replaceoutliers-results.csv"))

####################################################################################
# Exploratory data analysis of RNAseq data with DESeq2
#
# these next R scripts are for a variety of visualization, QC and other plots to
# get a sense of what the RNAseq data looks like based on DESEq2 analysis
#
# 1) MA plot
# 2) rlog stabilization and variance stabiliazation
# 3) variance stabilization plot
# 4) heatmap of clustering analysis
# 5) PCA plot
#
#
####################################################################################

# MA plot of RNAseq data for entire dataset
# http://en.wikipedia.org/wiki/MA_plot
# genes with padj < 0.1 are colored Red
plotMA(dds, ylim=c(-8,8),main = "RNAseq experiment")
dev.copy(png, paste0(outputPrefix, "-MAplot_initial_analysis.png"))
dev.off()

# transform raw counts into normalized values
# DESeq2 has two options:  1) rlog transformed and 2) variance stabilization
# variance stabilization is very good for heatmaps, etc.
rld <- rlogTransformation(dds, blind=T)
vsd <- varianceStabilizingTransformation(dds, blind=T)

# save normalized values
write.table(as.data.frame(assay(rld),file = paste0(outputPrefix, "-rlog-transformed-counts.txt"), sep = '\t'))
write.table(as.data.frame(assay(vsd),file = paste0(outputPrefix, "-vst-transformed-counts.txt"), sep = '\t'))

# plot to show effect of transformation
# axis is square root of variance over the mean for all samples
par(mai = ifelse(1:4 <= 2, par('mai'),0))
px <- counts(dds)[,1] / sizeFactors(dds)[1]
ord <- order(px)
ord <- ord[px[ord] < 150]
ord <- ord[seq(1,length(ord),length=50)]
last <- ord[length(ord)]
vstcol <- c('blue','black')
matplot(px[ord], cbind(assay(vsd)[,1], log2(px))[ord, ],type='l', lty = 1, col=vstcol, xlab = 'n', ylab = 'f(n)')
legend('bottomright',legend=c(expression('variance stabilizing transformation'), expression(log[2](n/s[1]))), fill=vstcol)
dev.copy(png,paste0(outputPrefix, "-variance_stabilizing.png"))
dev.off()

# clustering analysis
# excerpts from http://dwheelerau.com/2014/02/17/how-to-use-deseq2-to-analyse-rnaseq-data/
library("RColorBrewer")
library("gplots")
distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds), paste(condition, sampleNames, sep=" : "))
#Or if you want conditions use:
#rownames(mat) <- colnames(mat) <- with(colData(dds),condition)
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
dev.copy(png, paste0(outputPrefix, "-clustering.png"))
heatmap.2(mat, trace = "none", col = rev(hmcol), margin = c(13,13))
dev.off()

#Principal components plot shows additional but rough clustering of samples
library("genefilter")
library("ggplot2")
library("grDevices")

rv <- rowVars(assay(rld))
select <- order(rv, decreasing=T)[seq_len(min(500,length(rv)))]
pc <- prcomp(t(assay(vsd)[select,]))

# set condition
condition <- treatments
scores <- data.frame(pc$x, condition)

(pcaplot <- ggplot(scores, aes(x = PC1, y = PC2, col = (factor(condition))))
+ geom_point(size = 5)
+ ggtitle("Principal Components")
+ scale_colour_brewer(name = " ", palette = "Set1")
+ theme(
  plot.title = element_text(face = 'bold'),
  legend.position = c(.9,.2),
  legend.key = element_rect(fill = 'NA'),
  legend.text = element_text(size = 10, face = "bold"),
  axis.text.y = element_text(colour = "Black"),
  axis.text.x = element_text(colour = "Black"),
  axis.title.x = element_text(face = "bold"),
  axis.title.y = element_text(face = 'bold'),
  panel.grid.major.x = element_blank(),
  panel.grid.major.y = element_blank(),
  panel.grid.minor.x = element_blank(),
  panel.grid.minor.y = element_blank(),
  panel.background = element_rect(color = 'black',fill = NA)
))

ggsave(pcaplot,file=paste0(outputPrefix, "-ggplot2.pdf"))

# scatter plot of rlog transformations between Sample conditions
# nice way to compare control and experimental samples
head(assay(rld))
# plot(log2(1+counts(dds,normalized=T)[,1:2]),col='black',pch=20,cex=0.3, main='Log2 transformed')
plot(assay(rld)[,1:3],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,2:4],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,6:5],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")

# heatmap of data
library("RColorBrewer")
library("gplots")
# 1000 top expressed genes with heatmap.2
select <- order(rowMeans(counts(ddsClean,normalized=T)),decreasing=T)[1:1000]
my_palette <- colorRampPalette(c("blue",'white','red'))(n=1000)
heatmap.2(assay(vsd)[select,], col=my_palette,
          scale="row", key=T, keysize=1, symkey=T,
          density.info="none", trace="none",
          cexCol=0.6, labRow=F,
          main="1000 Top Expressed Genes Heatmap")
dev.copy(png, paste0(outputPrefix, "-HEATMAP.png"))
dev.off()

The .csv output file that you get from this R code should look something like this:

Screen Shot 2015-07-13 at 1.11.41 PM

Below are some examples of the types of plots you can generate from RNAseq data using DESeq2:

Rplot-logfoldchangeGmax_DESeq2-Dispersion-plot

Rplot-heatmap1Rplot-pca

 

Merging Data and Using Biomart:

To continue with analysis, we can use the .csv files we generated from the DeSEQ2 analysis and find gene ontology. This next script contains the actual biomaRt calls, and uses the .csv files to search through the Phytozome database. If you are trying to search through other datsets, simply replace the “useMart()”  command with the dataset of your choice. Again, the biomaRt call is relatively simple, and this script is customizable in which values you want to use and retrieve.

Biomart Manual

The biomaRt user’s guide

# Install biomaRt package
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

# Load biomaRt library
library("biomaRt")

# Convert final results .csv file into .txt file
results_csv <- "Gmax_DESeq2-replaceoutliers-results.csv"
write.table(read.csv(results_csv), gsub(".csv",".txt",results_csv))
results_txt <- "Gmax_DESeq2-replaceoutliers-results.txt"

# List available biomaRt databases
# Then select a database to use
listMarts("phytozome_mart", host="phytozome.jgi.doe.gov",path="/biomart/martservice")
ensembl=useMart("phytozome_mart", host="phytozome.jgi.doe.gov", path="/biomart/martservice")

# List available datasets in database
# Then select dataset to use
listDatasets(ensembl)
ensembl = useMart("phytozome_mart", host="phytozome.jgi.doe.gov",path="/biomart/martservice", dataset="phytozome")

# Check the database for entries that match the IDs of the differentially expressed genes from the results file
a <- read.table(results_txt, head=TRUE)

b <- getBM(attributes=c('transcript_id',
 'chr_name1',
 'gene_name1',
 'gene_chrom_start',
 'gene_chrom_end',
 'kog_desc'),
 filters=c('pac_transcript_id'),
 values=a$X,
 mart=ensembl)
biomart_results <- "Gmax_DESeq2-Biomart.txt"
write.table(b,file=biomart_results)

After fetching data from the Phytozome database based on the PAC transcript IDs of the genes in our samples, a .txt file is generated that should look something like this:

Screen Shot 2015-07-20 at 6.54.17 PM

Finally, we want to merge the deseq2 and biomart output.

# Merge the deseq2 and biomart output
m <- merge(a, b, by="X")
write.csv(as.data.frame(m),file = "Gmax_DEseq2-Biomart-results_merged.csv")

We get a merged .csv file with our original output from DESeq2 and the Biomart data:

Screen Shot 2015-07-20 at 7.05.21 PM

Visualizing Differential Expression with IGV:

To visualize how genes are differently expressed between treatments, we can use the Broad Institute’s Interactive Genomics Viewer (IGV), which can be downloaded from here: IGV

We will be using the .bam files we created previously, as well as the reference genome file in order to view the genes in IGV. IGV requires that .bam files be indexed before being loaded into IGV. There is a script file located in

/common/RNASeq_Workshop/Soybean/STAR_HTSEQ_mapping/bam_files called bam_index.sh that will accomplish this. The .bam files themselves as well as all of their corresponding index files (.bai) are located here as well. The reference genome file is located at 

/common/RNASeq_Workshop/Soybean/gmax_genome/Gmax_275_v2

You will need to download the .bam files, the .bai files, and the reference genome to your computer. Once you have IGV up and running, you can load the reference genome file by going to Genomes -> Load Genome From File… in the top menu. Now you can load each of your six .bam files onto IGV by going to File -> Load from File… in the top menu. Be sure that your .bam files are saved in the same folder as their corresponding index (.bai) files. Once you have everything loaded onto IGV, you should be able to zoom in and out and scroll around on the reference genome to see differentially expressed regions between our six samples.

Screen Shot 2015-07-13 at 2.04.59 PM

The differentially expressed gene shown is located on chromosome 10, starts at position 11,454,208, and codes for a transferrin receptor and related proteins containing the protease-associated (PA) domain. This information can be found on line 142 of our merged csv file. You can search this file for information on other differentially expressed genes that can be visualized in IGV!