EDGE-pro tutorial

This tutorial will serve as an introduction to analysis of prokaryote RNASeq data with an associated reference genome. The tools used for this analysis strictly consist of freeware and open-source software – any bioinformatician can perform the following analysis without any licenses!

Software download links:

SRA Toolkit
FastQC
Sickle
EDGE-pro
R
RStudio
DESeq2
Blast2Go
Cytoscape

RNASeq Workshop directory

/common/RNASeq_Workshop/Listeria

SRA Toolkit: Download raw reads in fastq format

The first step is to retrieve the biological sequence data from the Sequence Read Archive. The data that we will be using is from this experiment, which analyzes two strains of Listeria monocytogenes (10403S and DsigB) with two replicates per strain, resulting in a total of four raw read files. We will use the fastq-dump utility from the SRA toolkit to download the raw files by accession number into fastq format. This format is necessary because the software used to perform quality control, Sickle, requires fastq files as input.

/opt/bioinformatics/sratoolkit.2.3.5-2-centos_linux64/bin/fastq-dump.2.3.5.2 SRR034450
# repeat for SRR034451, SRR034452, SRR034453

The converted reads are located in the /common/RNASeq_Workshop/Listeria/Quality_Control/ directory. The script to convert the reads is located at: /common/RNASeq_Workshop/Listeria/Quality_Control/SRA_to_fastq.sh. Please note that this directory is write protected, meaning that the script will need to be run from a different directory. This is also the case for all scripts referenced throughout this tutorial – copy the scripts to a different directory before running them!

Since the fastqdump operation can be resource intensive, we will be submitting this script as a job to the cluster. Before doing so, replace the email in the line: #$ -M firstname.lastname@uconn.edu with your own email. Then submit the script with the qsub command:

qsub SRA-to-fastq.sh

For more information on using the BBC cluster and interacting with SGE please refer to this tutorial: http://bioinformatics.uconn.edu/understanding-the-bbc-cluster-and-sge/

FastQC

FastQC can be used to give an impression of the quality of the data before any further analysis such as quality control. We will run FastQC over the command line on just one of the .fastq files for demonstration purposes.

fastqc SRR034450.fastq

Once the program finishes running, the output will include an html formatted statistics file with the name SRR034450_fastqc.html. Copy this file to your desktop and open it with a web browser to view the contents, which will contain summary graphs and data such as the ‘Per base sequence quality graph’ below. 

Screen Shot 2015-07-23 at 11.21.22 AM

Sickle: Quality Control on raw reads

The next step is to perform quality control on the reads using sickle. Since our reads are all unpaired reads, we indicate this with the se option in the sickle command. The -f flag designates the input file, -o the output file, and if desired -q the minimum quality (sickle defaults to 20) and -l the minimum read length. The -t flag designates the type of read. Unfortunately, despite the reads being Illumina reads, the average quality did not meet sickle’s minimum for Illumina reads, hence the sanger option.

sickle se -f SRR034450.fastq -t sanger -o SRR034450_trimmed.fastq
# repeat for SRR034451.fastq, SRR034452.fastq, SRR034453.fastq

The trimmed quality control fastq files are located in the /common/RNASeq_Workshop/Listeria/EDGE-pro directory. The files were moved there after running sickle for organization purposes — EDGE-pro requires the trimmed files as input. The script to perform the quality control is located at /common/RNASeq_Workshop/Listeria/Quality_Control/sickle_QC.sh.

EDGE-pro: Gene expression

Before we get started with EDGE-pro, we need to retrieve the Listeria reference genome and its protein and rna tables. By searching the NCBI genome database, we learn that the EGD-e strain is the reference genome. We will use NCBI’s ftp website: ftp://ftp.ncbi.nih.gov/ to download the files. Since Listeria is a bacterial genome, navigate to the genome directory then bacteria directory. Note that there are multiple genomes for Listeria — navigate to the EGD-e assembly: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Listeria_monocytogenes_EGD_e_uid61583/

The reference genome is stored in the .fna file, protein table in the .ptt file, and rna table in the .rnt file. We will use the wget command line utility to download these files.

wget http://genome2d.molgenrug.nl/Bacteria/Listeria_monocytogenes_EGD_e_uid61583/NC_003210.fna
wget http://genome2d.molgenrug.nl/Bacteria/Listeria_monocytogenes_EGD_e_uid61583/NC_003210.ptt
wget http://genome2d.molgenrug.nl/Bacteria/Listeria_monocytogenes_EGD_e_uid61583/NC_003210.rnt

Now we can run EDGE-pro to generate gene expression levels. The -g option takes a reference genome, -p flag a protein table, -r flag an rnt table, -u flag a fastq file (the trimmed file from sickle). The -o flag will be the prefix before each of the EDGE-pro output files, and the -s flag is the directory where the EDGE-pro executables are located.

/opt/bioinformatics/EDGE_pro_v1.3.1/edge.pl -g NC_003210.fna -p NC_003210.ptt -r NC_003210.rnt -u SRR034450_trimmed.fastq -o SRR034450.out -s /opt/bioinformatics/EDGE_pro_v1.3.1
# repeat for SRR034451_trimmed.fastq, SRR034452_trimmed.fastq, SRR034453_trimmed.fastq

The output files are located in the /common/RNASeq_Workshop/Listeria/EDGE-pro directory. The script to run EDGE-pro is located at /common/RNASeq_Workshop/Listeria/EDGE-pro/EDGE-pro.sh

EDGE-pro to DESeq

The output we are interested in are the SRR03445X.out.rpkm_0 files. EDGE-pro comes with an accessory script to convert the rpkm files to a count table that DESeq2, the differential expression analysis R package, can take as input.

/opt/bioinformatics/EDGE_pro_v1.3.1/additionalScripts/edgeToDeseq.perl SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0

Please note that EDGE-pro may sometimes create a second row for a gene with different count data than the first row. For our analysis with DESeq we desire each of the rows to be labeled with a gene ID, and this means that the list of genes must be unique. A python script was written to remove the duplicate row with the smallest total count. Writing this script is beyond the scope of this tutorial, but if you are interested, the script with comments is located at /common/RNASeq_Workshop/Listeria/EDGE-pro_to_DESeq2/trim_epro2deseq.py. This python script is executable globally so that you do not need to copy it to your own directory.  We will run the script as:

python /common/RNASeq_Workshop/Listeria/EDGE-pro_to_DESeq2/trim_epro2deseq.py deseqFile

where deseqFile is the intermediate output from edgeToDeseq.perl. The final, trimmed output table (a tab delimited file) is located at /common/RNASeq_Workshop/Listeria/EDGE-pro_to_DESeq2/Listeria_deseqFile. The script to generate the file is located at /common/RNASeq_Workshop/Listeria/EDGE-pro_to_DESeq2/to_DESeq2.sh and contains the commands to generate both the deseqFile and final Listeria_deseqFile.

Analysis with DESeq

This step requires the R language and an IDE such as RStudio installed on a local machine. The R DESeq2 library also must be installed. To install this package, start the R console and enter:

source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")

If any dependencies fail, install them using the command: install.packages(PackageName, repos='http://cran.rstudio.com/')

Before running this script make sure to set the working directory and path to your Listeria_deseqFile (after copying it from the server to your local machine). The script was adapted slightly from Dave Wheeler’s comprehensive tutorial on analysis with DESeq2 located here. The only changes were a few bug fixes, adding an outputPrefix variable to allow easy modification of the output file names in the code for future use, and adding filtering by adjusted p value.

The most important information comes out as -replaceoutliers-results.csv. This file only contains the genes that have adjusted p values less than 0.05. These genes are the differentially expressed genes we are interested in. Depending on the experiment, this file can be adjusted to include p values less than 0.10 or a different value.

# Load DESeq2 library
library("DESeq2")

# Set the working directory
directory <- "~/Documents/R/DESeq2/"
setwd(directory)

# Set the prefix for each output file name
outputPrefix <- "Listeria_DESeq2"

# Location of deseq ready count table (EDGE-pro output)
deseqFile <- "~/Documents/R/DESeq2/Listeria_deseqFile"

# Read the table
countData <- read.table(deseqFile, header = T)

# Replace accession numbers with meaningful names
names(countData) <- c("10403S Rep1","DsigB Rep1","10403S Rep2","DsigB Rep2")

# Create table with treatment information
sampleNames <- colnames(countData)
sampleCondition <- c("10403S","DsigB","10403S","DsigB")
colData <- data.frame(condition = sampleCondition)
row.names(colData) = sampleNames
treatments = c("10403S","DsigB")

# Create DESeqDataSet: countData is the count table, colData is the table with treatment information
# One experimental condition
ddsFromMatrix <- DESeqDataSetFromMatrix(countData = countData,
 colData = colData,
 design = ~ condition)
colData(ddsFromMatrix)$condition <- factor(colData(ddsFromMatrix)$condition, levels = treatments)
dds <- DESeq(ddsFromMatrix)
res <- results(dds)

# filter results by p value
# order results by padj value (most significant to least)
res= subset(res, padj<0.05)
res <- res[order(res$padj),]
# should see DataFrame of baseMean, log2Foldchange, stat, pval, padj

# save data results and normalized reads to csv
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds,normalized =TRUE)), by = 'row.names', sort = FALSE)
names(resdata)[1] <- 'gene'
write.csv(resdata, file = paste0(outputPrefix, "-results-with-normalized.csv"))

# send normalized counts to tab delimited file for GSEA, etc.
write.table(as.data.frame(counts(dds),normalized=T), file = paste0(outputPrefix, "_normalized_counts.txt"), sep = '\t')

# produce DataFrame of results of statistical tests
mcols(res, use.names = T)
write.csv(as.data.frame(mcols(res, use.name = T)),file = paste0(outputPrefix, "-test-conditions.csv"))

# replacing outlier value with estimated value as predicted by distrubution using
# "trimmed mean" approach. recommended if you have several replicates per treatment
# DESeq2 will automatically do this if you have 7 or more replicates
ddsClean <- replaceOutliersWithTrimmedMean(dds)
ddsClean <- DESeq(ddsClean)
tab <- table(initial = results(dds)$padj < 0.05,
 cleaned = results(ddsClean)$padj < 0.05)
addmargins(tab)
write.csv(as.data.frame(tab),file = paste0(outputPrefix, "-replaceoutliers.csv"))
resClean <- results(ddsClean)

# filter results by p value
resClean = subset(res, padj<0.05)
resClean <- resClean[order(resClean$padj),]
write.csv(as.data.frame(resClean),file = paste0(outputPrefix, "-replaceoutliers-results.csv"))

Here’s how your results file should look:

Screen Shot 2015-07-11 at 3.57.53 PM

####################################################################################
# Exploritory data analysis of RNAseq data with DESeq2
#
# these next R scripts are for a variety of visualization, QC and other plots to
# get a sense of what the RNAseq data looks like based on DESEq2 analysis
#
# 1) MA plot
# 2) rlog stabilization and variance stabiliazation
# 3) variance stabilization plot
# 4) heatmap of clustering analysis
# 5) PCA plot
#
#
####################################################################################
# MA plot of RNAseq data for entire dataset
# http://en.wikipedia.org/wiki/MA_plot
# genes with padj < 0.1 are colored Red
plotMA(dds, ylim=c(-8,8),main = "RNAseq experiment")
dev.copy(png, paste0(outputPrefix, "-MAplot_initial_analysis.png"))
dev.off()

# transform raw counts into normalized values
# DESeq2 has two options: 1) rlog transformed and 2) variance stabilization
# variance stabilization is very good for heatmaps, etc.
rld <- rlogTransformation(dds, blind=T)
vsd <- varianceStabilizingTransformation(dds, blind=T)

# save normalized values
write.csv(as.data.frame(assay(rld)),file = paste0(outputPrefix, "-rlog-transformed-counts.txt"))
write.csv(as.data.frame(assay(vsd)),file = paste0(outputPrefix, "-vst-transformed-counts.txt"))
 
# plot to show effect of transformation
# axis is square root of variance over the mean for all samples
par(mai = ifelse(1:4 <= 2, par('mai'),0))
px <- counts(dds)[,1] / sizeFactors(dds)[1]
ord <- order(px)
ord <- ord[px[ord] < 150]
ord <- ord[seq(1,length(ord),length=50)]
last <- ord[length(ord)]
vstcol <- c('blue','black')
matplot(px[ord], cbind(assay(vsd)[,1], log2(px))[ord, ],type='l', lty = 1, col=vstcol, xlab = 'n', ylab = 'f(n)')
legend('bottomright',legend=c(expression('variance stabilizing transformation'), expression(log[2](n/s[1]))), fill=vstcol)
dev.copy(png,paste0(outputPrefix, "-variance_stabilizing.png"))
dev.off()

#Principal components plot shows clustering of samples
library("genefilter")
library("ggplot2")
library("grDevices")

rv <- rowVars(assay(rld))
select <- order(rv, decreasing=T)[seq_len(min(500,length(rv)))]
pc <- prcomp(t(assay(vsd)[select,]))

# set condition
condition <- treatments
scores <- data.frame(pc$x, condition)

(pcaplot <- ggplot(scores, aes(x = PC1, y = PC2, col = (factor(condition)))) 
+ geom_point(size = 5)
+ ggtitle("Principal Components")
+ scale_colour_brewer(name = " ", palette = "Set1")
+ theme(
 plot.title = element_text(face = 'bold'),
 legend.position = c(.85,.87),
 legend.key = element_rect(fill = 'NA'),
 legend.text = element_text(size = 10, face = "bold"),
 axis.text.y = element_text(colour = "Black"),
 axis.text.x = element_text(colour = "Black"),
 axis.title.x = element_text(face = "bold"),
 axis.title.y = element_text(face = 'bold'),
 panel.grid.major.x = element_blank(),
 panel.grid.major.y = element_blank(),
 panel.grid.minor.x = element_blank(),
 panel.grid.minor.y = element_blank(),
 panel.background = element_rect(color = 'black',fill = NA)
))

ggsave(pcaplot,file=paste0(outputPrefix, "-ggplot2.png"))

# scatter plot of rlog transformations between Sample conditions
# nice way to compare control and experimental samples
# uncomment plots depending on size of array
# head(assay(rld))
plot(log2(1+counts(dds,normalized=T)[,1:2]),col='black',pch=20,cex=0.3, main='Log2 transformed')
plot(assay(rld)[,1:2],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")
plot(assay(rld)[,3:4],col='#00000020',pch=20,cex=0.3, main = "rlog transformed")

# heatmap of data
library("RColorBrewer")
library("gplots")
# 30 top expressed genes with heatmap.2
select <- order(rowMeans(counts(ddsClean,normalized=T)),decreasing=T)[1:30]
my_palette <- colorRampPalette(c("blue",'white','red'))(n=30)
heatmap.2(assay(vsd)[select,], col=my_palette,
 scale="row", key=T, keysize=1, symkey=T,
 density.info="none", trace="none",
 cexCol=0.6, labRow=F,
 main="TITLE")
dev.copy(png, paste0(outputPrefix, "-HEATMAP.png"))
dev.off()

Selected tutorial results:

Listeria_DESeq2-MAplot_initial_analysis Listeria_DESeq2-ggplot2Listeria_DESeq2-clusteringListeria_DESeq2-HEATMAPListeria_DESeq2-Dispersion-plot

The R script and program output is available in the /common/RNASeq_Workshop/Listeria/R_DESeq2_output directory.

Blast2Go Analysis

Now that we have found the differentially expressed genes in this experiment, it’s time to perform analysis with BLAST on the genes. To be able to use the Blast2Go, we need a fasta file of gene sequences. To accomplish this efficiently we’ll be using the BioPython python module which interfaces with the NCBI Entrez search. Python and BioPython are installed on the server so you don’t have to worry about installing them locally.

This python script takes the filtered -replaceoutliers-results.csv file and parses the gene ids into a list, then queries the NCBI protein database for those gene ids, and if a sequence exists, downloads the sequence and writes it to “seqs_to_blast.fasta”. Note that EDGE-pro finds gene expression of both amino acids and rna genes — for this example, we will only be concerned with proteins. The script will print an error message if a gene is not found in the database.

'''
download DE expressed genes with biopython interface to entrez
usage: python listeria_entrez.py someone@somewhere.someaddress -replaceoutliers-results.csv out.fasta
'''

from Bio import Entrez, SeqIO
import sys

# command line parameters
email = sys.argv[1]
csv = sys.argv[2]
out = sys.argv[3]

# initialize variables
Entrez.email = email
gene_ids = []
database = 'protein'
gene_sequences = []

# extract gene ids from table
for line in open(csv):
    csv_row = line.replace('\"', '').split(',')[0]
    if csv_row.strip() != '':
        gene_ids.append(csv_row)

for gene in gene_ids:
    # use the esearch utility to query the ncbi protein database with gene id
    # return the first sequence
    handle = Entrez.esearch(db = database, retmax = 1, term = gene)
    record = Entrez.read(handle)
    handle.close()
    # try to read search results, throw exception if no results
    try:
        # use the efetch utility to get a record of the esearch results
        fetch_handle = Entrez.efetch(db = database, id = record['IdList'][0], rettype = 'fasta', retmode='text')
        fetch_record = SeqIO.read(fetch_handle, 'fasta')
        fetch_handle.close()
        # add sequence to list
        gene_sequences.append(fetch_record)
        print('Gene: ' + gene + ' was successfully downloaded from the NCBI protein database')
    except IndexError:
        print('Gene: ' + gene + ' was not found in the NCBI protein database')

# open out file and write sequences to file
output_handle = open(out, 'w')
SeqIO.write(gene_sequences, output_handle, 'fasta')
output_handle.close()

Before running the script, copy your replaceoutliers-results.csv file to the server and note its path. You can then run the code as python listeria_entrez.py <email> <path-to-csv> <outputfile> over the command line. The script and output are located in the /common/RNASeq_Workshop/Listeria/download_DE_Genes/ directory.

Copy seqs_to_blast.fasta over to your local machine and open Blast2Go. Navigate to the File menu then –> Load –> Load Sequences –> Create a new project and select your fasta file. Blast2Go should create a nice table from the file.

Screen Shot 2015-07-11 at 3.15.03 PM

We are going to run 3 analyses on the sequences: blastp, mapping, and interproscan (only on the PFam database). Since blast takes several hours to run, it’s advisable to run it first.

To run blasting,  select only the first 10 sequences with the checkbox. Click the orange blast button and a window will pop-up. Make sure that the blast program is blastp and the blast DB is Bacteria. Other parameters are shown in figures below. Hit next to proceed after parameters are all set as shown in figure above.

In Advanced Configuration, make sure word size is set to 3 before you hit next

In the interProScan Configuration, enter your email address and note that only HMMPfam is checked. Click next if parameters are all set, then hit run and let Blast2Go do its thing!

After blast finishes, the next longest step is mapping. Again, select only the first 10 sequences. Click the green mapping button, then hit run.

Finally, we’re going to run interproscan. Click the purple interpro button, select ONLY the HMMPFam database, then hit run. This step should take only a minute. Now let’s export the final sequence table, which should look something like the screenshot below. Orange indicates that blast finished successfully, and green indicates mapping was finished successfully. Sequences 4 and 5 were blasted but we can conclude no mapping was available.

Screen Shot 2015-07-11 at 3.20.24 PM

 

Navigate to the File menu then –> Export –> Export Sequence Table and choose a location! The Blast2Go project file, blast XML results, and sequence table are all available in /common/RNASeq_Workshop/Listeria/Blast2Go_out

Cytoscape

Unfortunately, there are no free networks available for download for Listeria genes. Therefore this tutorial will use the data available in the /Cytoscape_v3.2.1/sampleData directory.

  • Loading a Simple Network

Navigate to the File menu then –> Import –> Network –> File –> sampleData/galFiltered.sif and load the network. There’s a lot you can play around with, refer to this tutorial for more information!

2015-07-12 23_38_05-Session_ New Session

  • Importing your data

Navigate to the File menu then –> Import –> Table –> File –> sampleData/galExpData.csv. Ensure that the “Import Data as: Node Table Columns” option is selected. Columns can be selected/deselected for import by clicking on them. Data will be loaded automatically based on matching records — in this case, when the gene column matches the currently loaded network. To finish importing the data, hit OK. Now the network has gene expression data!

2015-07-12 23_56_56-Session_ New Session

Normally, before loading expression data a network would be imported from public databases. To search the public databases from within Cytoscape, navigate to the File menu then –> Import –> Network –> Public Databases.