RNA-seq for Non-model Species (Red Spruce)

This tutorial will serve as a guideline for RNA sequence analysis of a non-model organism. We will be using a needle transcriptome from red spruce as our example.  The RNA-Seq data  consists of Illumina short reads as well as 454 reads.

We will be going through:

  • Quality control of the libraries
  • De novo assembly of the reads
  • Clustering of the assembled transcripts from each assembly
  • Mapping the reads back to assembled/clustered transcriptome
  • Generating raw counts
  • GFOLD to analyze differential gene expression between two treatments (elevated and ambient CO2).

The packages we’ll be using can be found here:

The data consists of three libraries under three different treatments;

  1. ambient CO
  2. elevated CO2
  3. one that had been cotreated for both conditions

De Novo non model species directory


Each step in this work flow is composed into a directory

|-- Reads
|-- QualityControl
|-- Assembly
|-- Clustered
|-- CodingRegions
|-- Index
|-- Align
|-- Count
|-- gfold
`-- blast2go


Raw Reads 

Two types of sequencing methods were used to collect the RNA-Seq data, namely Illumina and 454. Where Illumina reads are in the form of fastq, where as 454 reads are in the form of fasta and quality score (qual) files.

|-- 454/
|   |-- control_454.fasta
|   |-- control_454.qual
|   |-- cotreated_454.fasta
|   |-- cotreated_454.qual
|   |-- elevated_454.fasta
|   `-- elevated_454.qual
`-- Illumina/
    |-- ambient.fastq
    |-- cotreated.fastq
    `-- elevated.fastq


Quality control 

Quality control of Illumina reads using Sickle:

Step one is to perform quality control on the reads, and we will be using Sickle for the Illumina reads. Using the Sickle program we will trim our unpaired raw reads, to keep a minimum quality score value of 30 and a minimum length of 35.

sickle se -f ../../Reads/Illumina/ambient.fastq -t illumina -o ambient.trimmed.fastq -q 30 -l 35
sickle se -f ../../Reads/Illumina/elevated.fastq -t illumina -o elevated.trimmed.fastq -q 30 -l 35
sickle se -f ../../Reads/Illumina/cotreated.fastq -t illumina -o cotreated.trimmed.fastq -q 30 -l 35

The trimmed output files are what we will be using for the next steps of our analysis.

se   unpaired reads / single-end reads
-f   input file 
-o   output file
-q   minimum quality score
-l   minimum read length
-t   Type of quality score value, e.g. (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)) (required)

The full script for trimming of raw reads using sickle can be found at: /common/RNASeq_Workshop/RedSpruce/QualityControl/Illumina/sickle.sh
Once the run is completed the directory will contain the following files that has been trimmed, which will be used in the further analysis:

     |-- ambient.trimmed.fastq
     |-- cotreated.trimmed.fastq
     |-- elevated.trimmed.fastq
     `-- sickle_267830.out


Quality control of 454 Reads using SolexaQA:

Prior to running the quality control on the 454 reads we need to create a fastq file using our raw reads. We do have the sequence  file in ‘fasta’ format and the quality score in the ‘qual’ format file. Using the python scripts (454_to_fastq.py), we will combine the sequence data and the quality score into a single file which will be a ‘fastq’ file. The newly created fastq files will be placed in the Reads directory, with the fasta and the qual files.

python 454_to_fastq_2.py ../../Reads/454/elevated_454.fasta ../../Reads/454/elevated_454.qual
python 454_to_fastq_v2.py ../../Reads/454/control_454.fasta ../../Reads/454/control_454.qual
python 454_to_fastq_v2.py ../../Reads/454/cotreated_454.fasta ../../Reads/454/cotreated_454.qual


The python script, will convert the 454 raw reads to Illumina reads (known as Solexa in the past). Where it will create fastq files (elevated_454.fastq, control_454.fastq, cotreated_454.fastq) and the new files will be written in to the /RedSpruce/Reads/454  directory folder.  We have prepared a script that can be submitted using the qsub command, and it can be found at /common/RNASeq_Workshop/RedSpruce/QualityControl/454/make454_fastq.sh
Once it executes, it will create the following files in the /Reads/454 directory with the rest of the raw files:

|-- 454/
    |-- elevated_454.qual
    |-- elevated_454.fastq
    |-- elevated_454.fasta
    |-- cotreated_454.qual
    |-- cotreated_454.fastq
    |-- cotreated_454.fasta
    |-- control_454.qual
    |-- control_454.fastq
    `-- control_454.fasta


Next we use SolexaQA to perform quality control of the 454 reads. Initially we will be using the dynamictrim to trim the reads using a chosen threshold for the quality (if not defined the default value for p = 0.05). Before running the command, we will create three directories, which the trimmed reads will be copied into using the following;

#Will create the following directories in the current folder
DIR_LIST="elevated control cotreated"

for dir in $DIR_LIST; do
        if [ ! -d "$dir" ]; then
                mkdir -p $dir

SolexaQA dynamictrim ../../Reads/454/elevated_454.fastq --torrent -d elevated
SolexaQA dynamictrim ../../Reads/454/control_454.fastq --torrent -d control
SolexaQA dynamictrim ../../Reads/454/cotreated_454.fastq --torrent -d cotreated
Usage: SolexaQA dynamictrim  [Options]
Options Explained:
-p|--probcutoff     probability value (between 0 and 1) at which base-calling error is considered too high (default; p = 0.05)
-d|--directory      path to directory where output files are saved
-t|--torrent        Ion Torrent fastq file

Three output files are created per input FASTQ file. These have the name of the original FASTQ file with the following suffixes appended:

*.trimmed — a FASTQ file containing the trimmed reads.
*.segments — This tab delimited text file displays the proportion of all reads that contain a segment of a given length.
*.segments.hist.pdf — a histogram showing the distribution of the longest contiguous segment of each read in which the quality of each base is greater than the user-defined quality cutoff (defaults to P = 0.05).


Next lengthsort will be use to sort the reads based on the length. Here we use length cutoff value as 50 bp.

SolexaQA lengthsort elevated/elevated_454.fastq.trimmed -l 50
SolexaQA lengthsort elevated/elevated_454.fastq.trimmed -l 50
SolexaQA lengthsort cotreated/cotreated_454.fastq.trimmed -l 50
Usage: SolexaQA lengthsort  [options]

Options explained:
-l|--length       length cutoff [defaults to 25 nucleotides]


The lengthsort program will generate *.single *.discarded *.summary.txt *.summary.txt.pdf files, where the *.single file will contain the reads which passed the length cut off value we have set. If you have paired-end reads it will be sorted in to two paired read files. (more information can be found in the SolexaQA website)

Once both SolexaQA commands have been executed, it will produce the following files in the corresponding directories:

      |-- control/
      |   |-- control_454.fastq.trimmed
      |   |-- control_454.fastq.trimmed.discard
      |   |-- control_454.fastq_trimmed.segments
      |   |-- control_454.fastq_trimmed.segments_hist.pdf
      |   |-- control_454.fastq.trimmed.single
      |   |-- control_454.fastq.trimmed.summary.txt
      |   `-- control_454.fastq.trimmed.summary.txt.pdf
      |-- cotreated/
      |   |-- cotreated_454.fastq.trimmed
      |   |-- cotreated_454.fastq.trimmed.discard
      |   |-- cotreated_454.fastq_trimmed.segments
      |   |-- cotreated_454.fastq_trimmed.segments_hist.pdf
      |   |-- cotreated_454.fastq.trimmed.single
      |   |-- cotreated_454.fastq.trimmed.summary.txt
      |   `-- cotreated_454.fastq.trimmed.summary.txt.pdf
      `-- elevated/
          |-- elevated_454.fastq.trimmed
          |-- elevated_454.fastq.trimmed.discard
          |-- elevated_454.fastq_trimmed.segments
          |-- elevated_454.fastq_trimmed.segments_hist.pdf
          |-- elevated_454.fastq.trimmed.single
          |-- elevated_454.fastq.trimmed.summary.txt
          `-- elevated_454.fastq.trimmed.summary.txt.pdf


The resulting *.single files will be used for the next step.
We are using MIRA for the next step with the 454 trimmed reads, and the program expect the input files to have a proper extension where it can identify the data type, for that we are going to rename the *.single file to *.single.fastq files. We have included the copying step into the script, SolexaQA.sh and it is located at /common/RNASeq_Workshop/RedSpruce/QualityControl/454. Where;

for dir in $DIR_LIST; do
        for f in $dir/*.single; do
                cp $f "$dir/$(basename "$f").fastq"

Once the extension has been corrected, we will be using the *.single.fastq files for the next step.

De novo assembly 

In De novo assembly section, we will be woking in the /Assembly directory. Where Illumina quality controlled reads will be assembled using Trinity and the 454 quality controlled reads will be assembled using the MIRA. For each assembly there will be separate directories called /Trinity and /MIRA.
De novo assembly of Illumina reads using Trinity:
Now we assemble all three quality controlled Illumina libraries with Trinity which is a de novo transcriptome assembler. This is most appropriate to use on a read set where no reference genome exists or where the reference genome is not high quality. Assembly requires a great deal of memory (RAM) and can take a few days to run if the read set is large. The BBC cluster has a high memory node you can connect to access this. In this example, we are assembling short reads using the highmem.q which has a 514 GB RAM. The script trinity.sh can be found at /common/RNASeq_Workshop/RedSpruce/Assembly/Trinity.

module load trinity/2.4.0

Trinity --seqType fq \
        --max_memory 256G \
        --single ../../QualityControl/Illumina/ambient.trimmed.fastq,../../QualityControl/Illumina/cotreated.trimmed.fastq,../../QualityControl/Illumina/elevated.trimmed.fastq \
        --min_contig_length 300 \
        --CPU 16 \

If you do not specify a output directory, the out put files will be directed to the trinity default folder called trinity_out_dir/.

        |-- chrysalis/
        |-- inchworm.K25.L25.DS.fa
        |-- inchworm.K25.L25.DS.fa.finished
        |-- inchworm.kmer_count
        |-- insilico_read_normalization/
        |-- jellyfish.kmers.fa
        |-- jellyfish.kmers.fa.histo
        |-- output_tutorial.txt
        |-- partitioned_reads.files.list
        |-- partitioned_reads.files.list.ok
        |-- pipeliner.35650.cmds
        |-- read_partitions/
        |-- recursive_trinity.cmds
        |-- recursive_trinity.cmds.completed
        |-- recursive_trinity.cmds.ok
        |-- single.fa
        |-- single.fa.ok
        |-- single.fa.read_count
        `-- Trinity.fasta


There will be number of files created and the assembled file will be called Trinity.fasta. We will be using this for our next step.

To check the quality of your Trinity assembly we can use a utility program provided with the trinity software called TrinityStats.pl. To look at the contig summery we use a script called “trinity_stats.sh” located at, /common/RNASeq_Workshop/RedSpruce/Assembly/Trinity in the cluster.

perl $TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta > TrinitySummary.txt

If you submit the using a script, the result will be piped in to the output file call “TrinitySummary.txt” and it will look like:

## Counts of transcripts, etc.
Total trinity 'genes':	63856
Total trinity transcripts:	114946
Percent GC: 42.14

Stats based on ALL transcript contigs:

	Contig N10: 3636
	Contig N20: 2760
	Contig N30: 2230
	Contig N40: 1835
	Contig N50: 1492

	Median contig length: 662
	Average contig: 1022.21
	Total assembled bases: 117498506

## Stats based on ONLY LONGEST ISOFORM per 'GENE':

	Contig N10: 3583
	Contig N20: 2710
	Contig N30: 2170
	Contig N40: 1761
	Contig N50: 1389

	Median contig length: 582
	Average contig: 940.82
	Total assembled bases: 60076924


De novo assembly of 454 reads using MIRA:

Trinity is specialized for short read data so different assemblers will be used for the Roche 454. In this case, we will use MIRA [MIRA documentation] which also requires substantial memory and potentially time to complete the assemly. All three quality controlled 454 libraries will be assembled togehter.

In order to run mira, one must first construct a configuration file or a so called manifest file (Mira_454.conf), which is given bellow for this exercise. The configuration file contains two parts; The First part contains the general information, and the Second part contains information about the sequencing data.
The script can be found in the following path : /common/RNASeq_Workshop/RedSpruce/Assembly/Mira

# First part: defining some basic things
# We just give a name to the assembly : Mira_454
# and tell MIRA it should assemble a EST/RNASeq de-novo in accurate mode
# As special parameters, we want to use 8 threads (not : number of threads) in parallel
project = Mira_454
job = est,denovo,accurate
parameters = -GE:not=8

#The second part defines the sequencing data MIRA should load and assemble,
#and the data is logically divided into readgroups
readgroup = Control_454
data = /Common/RNASeq_Workshop/RedSpruce/QualityControl/454/control/control_454.fastq.trimmed.single.fastq
technology = 454

readgroup = Elevated_454
data = /Common/RNASeq_Workshop/RedSpruce/QualityControl/454/elevated/elevated_454.fastq.trimmed.single.fastq
technology = 454

readgroup = Cotreated_454
data = /Common/RNASeq_Workshop/RedSpruce/QualityControl/454/cotreated/cotreated_454.fastq.trimmed.single.fastq
technology = 454

For the MIRA run, the files need to be loaded into a local disk space in the cluster. Depending on the queue you are using in BBC, you can assign two different locations for the data to be loaded into.

all.q             /scratch
highmem.q         /scratch OR /scratch2

For the purpose of this run we will define a temporary space, called WORKING_DIR pointing to local location in the cluster. So we will load the data to this space, and once the run is completed we are going to copy the data back to our destination directory, which will be defined by the DESTINATION_DIR.

Following script mira_454.sh will be located at: /common/RNASeq_Workshop/RedSpruce/Assembly/Mira

module load mira/4.9.5


if [ ! -d "$WORKING_DIR" ]; then
        mkdir -p $WORKING_DIR




mira Mira_454.conf > Mira_454_log.txt


MIRA assembly run will create a directory called Mira_454_assembly and will contain unto four sub-directories.

Mira_454_d_results: This will have the output files of the assembly in different formats
Mira_454_d_info: This will contain the information and statics files of the final assembly
Mira_454_d_tmp: Contains tmp files and temporary assembly files
Mira_454_d_chkpt: This contain the checkpoint files needed to resume the job if it crashed or stopped


The Mira_454_d_info directory will contain the assembled files in fasta format.

   |-- Mira/
       |-- Mira_454.conf
       |-- mira_454.sh
       |-- Mira_454_assembly/
       |    |-- Mira_454_d_chkpt/
       |    |-- Mira_454_d_info/
       |    |-- Mira_454_d_results/
       |    |   |-- Mira_454_out.caf
       |    |   |-- Mira_454_out.maf
       |    |   |-- Mira_454_out.padded.fasta
       |    |   |-- Mira_454_out.padded.fasta.qual
       |    |   |-- Mira_454_out.tcs
       |    |   |-- Mira_454_out.txt
       |    |   |-- Mira_454_out.unpadded.fasta
       |    |    `-- Mira_454_out.unpadded.fasta.qual
       |     `-- Mira_454_d_tmp/
       `-- Mira_454_log.txt

MIRA will write two kind to fasta files: padded and unpadded. The padded format will contain the gap information in the form of asterisks and in the unpadded version these gaps will be removed. For our next step we will be using the unpadded version (Mira_454_out.unpadded.fasta).



Clustering of Illumina and 454 reads using vsearch:

To obtain a set of unique genes from both runs, we will cluster the two resulting assemblies together. First, the two assembies will be combined into one file using the Unix command cat, which refers to concatanate.

cat ../Assembly/Trinity/trinity_out_dir/Trinity.fasta ../Assembly/Mira/Mira_454_assembly/Mira_454_d_results/Mira_454_out.unpadded.fasta > combine.fasta 

Once the files are combined, we will use vsearch to find redundancy between the assembled transcripts and create a single output known as a centroids file. The threshold for clustering in this example is set to 80% identity. The following script combine.sh will be located at /common/RNASeq_Workshop/RedSpruce/Clustered

vsearch --threads 8 --log LOGFile \
	--cluster_fast combine.fasta \
	--id 0.80 --centroids centroids.fasta --uc clusters.uc


Usage: vsearch [OPTIONS]
--threads INT               number of threads to use, zero for all cores (0)
--log FILENAME              write messages, timing and memory info to file
--cluster_fast FILENAME     cluster sequences after sorting by length
--id REAL                   reject if identity lower, accepted values: 0-1.0
--centroids FILENAME        output centroid sequences to FASTA file
--uc FILENAME               specify filename for UCLUST-like output


After the run, it will contain the following files; where the centroids.fasta file will contain the unique genes from the 454 and Illumina assemblies.

|-- centroids.fasta
|-- clusters.uc
|-- combine.fasta
|-- combine.sh
`-- LOGFile


For the next step we will be using the centroids.fasta file.


Identifying the Coding Regions 

Identifying coding regions using TransDecoder:

Now that we have our reads assembled and clustered together into the single centroids file, we can use TransDecoder to determine optimal open reading frames from the assembly (ORFs). Assembled RNA-Seq transcripts may have 5′ or 3′ UTR sequence attached and this can make it difficult to determine the CDS in non-model spieces.
First we will use the TransDecoder utility to run on the centroid fasta file, as follows:

TransDecoder.LongOrfs -t ../Clustered/centroids.fasta


Usage: Transdecoder.LongOrfs [Options]

Options explained:
-t                             transcripts.fasta

By default it will identify ORFs that are at least 100 amino acids long. (you can change this by using -m parameter). It will produce a folder called centroids.fasta.transdecoder_dir, and will include all the temporary files in that directory.

|-- centroids.fasta.transdecoder_dir/
    |-- longest_orfs.pep
    |-- longest_orfs.gff3
    |-- longest_orfs.cds
    |-- base_freqs.dat.ok
    `-- base_freqs.dat


Next step is to, identify ORFs with homology to known proteins via blast or pfam searches. This will maximize the sensitivity for capturing the ORFs that have functional significance. It done using the following code:

hmmscan --cpu 8 --domtblout pfam.domtblout /common/Pfam/Pfam-B.hmm centroids.fasta.transdecoder_dir/longest_orfs.pep


It will create a file called pfam.domtblout in the main directory.
The final step is to predict the likely coding regions using the Pfam (or Blast) search results, using the following:

TransDecoder.Predict -t ../Clustered/centroids.fasta --retain_pfam_hits pfam.domtblout --cpu 8


The full script (Transdecoder.sh) for identifying the coding regions using the TransDecoder is located at: /common/RNASeq_Workshop/RedSpruce/CodingRegions/Transdecoder.sh

After the run is completed it will create the following files:

|-- centroids.fasta.transdecoder_dir/
|   |-- longest_orfs.pep
|   |-- longest_orfs.gff3.inx
|   |-- longest_orfs.gff3
|   |-- longest_orfs.cds.top_longest_5000.nr80.clstr
|   |-- longest_orfs.cds.top_longest_5000.nr80
|   |-- longest_orfs.cds.top_longest_5000
|   |-- longest_orfs.cds.top_500_longest.ok
|   |-- longest_orfs.cds.top_500_longest
|   |-- longest_orfs.cds.scores.selected
|   |-- longest_orfs.cds.scores.ok
|   |-- longest_orfs.cds.scores
|   |-- longest_orfs.cds.eclipsed_removed.gff3
|   |-- longest_orfs.cds.best_candidates.gff3
|   |-- longest_orfs.cds
|   |-- hexamer.scores.ok
|   |-- hexamer.scores
|   |-- base_freqs.dat.ok
|   `-- base_freqs.dat
|-- pfam.domtblout
|-- centroids.fasta.transdecoder.pep
|-- centroids.fasta.transdecoder.gff3
|-- centroids.fasta.transdecoder.cds
`-- centroids.fasta.transdecoder.bed

In the next step, for creating an index, we will be using the centroids.fasta.transdecoder.cds file.

Creating A Index 

Indexing the reads using bowtie2:

Before aligning the reads we need to create a index for the created transcoder assembly. We will be using Bowtie2 to build the index using the following command:

bowtie2-build ../CodingRegions/centroids.fasta.transdecoder.cds centroids_transdecoder_Index


Usage: bowtie2-build [options] [reference_in] [bt2_index_base]  
main arguments
reference_in            comma-separated list of files with ref sequences
bt2_index_base          base name for the index files to write

--threads          # of threads, By default bowtie2-build is using only one thread. Increasing the number of threads will speed up the index building considerably in most cases.

The script (bowtie2.sh) for creating the index can be found at /common/RNASeq_Workshop/RedSpruce/Index/bowtie2.sh

|-- centroids_transdecoder_Index.1.bt2
|-- centroids_transdecoder_Index.2.bt2
|-- centroids_transdecoder_Index.3.bt2
|-- centroids_transdecoder_Index.4.bt2
|-- centroids_transdecoder_Index.rev.1.bt2
`-- centroids_transdecoder_Index.rev.2.bt2


Aligning the Reads to the Index 

Aligning the reads using bowtie2:

Once the index has been created, we will use bowtie2 to align our three trimmed reads, using the following command:

bowtie2 --threads 8 -x ../Index/centroids_transdecoder_Index ../QualityControl/Illumina/elevated.trimmed.fastq -S elevated.sam
bowtie2 --threads 8 -x ../Index/centroids_transdecoder_Index ../QualityControl/Illumina/ambient.trimmed.fastq -S ambient.sam
bowtie2 --threads 8 -x ../Index/centroids_transdecoder_Index ../QualityControl/Illumina/cotreated.trimmed.fastq -S cotreated.sam


Usage: bowtie2 [Options] -x [bt2_base_index] {reads / options}
Main arguments:
-x [bt2_base_index]     The basename of the index for the reference genome (i.e. with out the *.bt2 extension)
-S                      Output file name of the SAM alignment

--threads               Number of processors 

The script bowtie2_Align.sh used to align using the bowtie2 can be found at /common/RNASeq_Workshop/RedSpruce/Align/. Once aligned it will produce three SAM files.

|-- ambient.sam
|-- cotreated.sam
`-- elevated.sam


Converting SAM to BAM format:
SAM format file is a large human readable file. In a computational point of view, accessing this data for processing will take time. For faster processing this need to be converted into computer friendly binary format called the BAM files. We use samtools to convert the SAM file in to BAM files. Using the view command the it will be converted into BAM format and will be sorted using the sort command as shown bellow:

samtools view -uhS elevated.sam | samtools sort - elevated_sorted
samtools view -uhS ambient.sam | samtools sort - ambient_sorted
samtools view -uhS cotreated.sam | samtools sort - cotreated_sorted


Usage: samtools [command] [options] in.sam
view     prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format 

-h      Include the header in the output
-S      Indicate the input was in SAM format
-u      Output uncompressed BAM. This option saves time spent on compression/decompression and is thus preferred when the output is piped to another samtools command

Usage: samtools [command] [-o out.bam]
sort    Sort alignments by leftmost coordinates

-o      Write the final sorted output to FILE, rather than to standard output.


The script (sam2bam.sh) can be found at /common/RNASeq_Workshop/RedSpruce/Align. Once the conversion is done it will produce the following files which will be binary format.

|-- elevated_sorted.bam
|-- cotreated_sorted.bam
`-- ambient_sorted.bam


Aligned Read Counts 

Extracting Raw Read Counts with eXpress:

Once it has been aligned, we will use the eXpress software, to get the counts of differential expression.

express ../CodingRegions2/centroids.fasta.transdecoder.cds ../Align/elevated_sorted.bam -o elevated_express
express ../CodingRegions2/centroids.fasta.transdecoder.cds ../Align/ambient_sorted.bam -o ambient_express
express ../CodingRegions2/centroids.fasta.transdecoder.cds ../Align/cotreated_sorted.bam -o cotreated_express


Usage:  express [options] [target_seqs.fa] [hits.(sam/bam)]

Required arguments:
 target_seqs.fa     target sequence file in fasta format
 hits.(sam/bam)     read alignment file in SAM or BAM format

-o [ --output-dir ] arg (=.)        write all output files to this directory


The script (express.sh) can be found at /common/RNASeq_Workshop/RedSpruce/Count/express.sh directory. The program will run and will create a directory for each run and will include the output files inside the directories.

|-- elevated_express\
|   |-- results.xprs
|   `-- params.xprs
|-- cotreated_express\
|   |-- results.xprs
|   `-- params.xprs
`-- ambient_express\
    |-- results.xprs
    `-- params.xprs



Converting counts to Gfold format:

Before analyzing for the differential expression using the gfold package, we need to convert our counts to Gfold count format. The program expects, 5 data columns, which are “Gene Symbol”, “Gene Name”, “Read Count”, “Gene exon length”, “FPKM”. To do this we will read the results.xprs file and grab the columns which have the required fields, and will rewrite them to a new file, where it will be used as the input file for the Gfold program.


express_file_1 <- data.frame(read.table("../Count/ambient_express/results.xprs",sep="\t", header = TRUE))
express_file_2 <- data.frame(read.table("../Count/cotreated_express/results.xprs",sep="\t", header = TRUE))
express_file_3 <- data.frame(read.table("../Count/elevated_express/results.xprs",sep="\t", header = TRUE))

subset1 <- data.frame(express_file_1$target_id, express_file_1$target_id, express_file_1$tot_counts, express_file_1$length,express_file_1$fpkm)
subset2 <- data.frame(express_file_2$target_id, express_file_2$target_id, express_file_2$tot_counts, express_file_2$length,express_file_2$fpkm)
subset3 <- data.frame(express_file_3$target_id, express_file_3$target_id, express_file_3$tot_counts, express_file_3$length,express_file_3$fpkm)

sorted1 <- subset1[order(express_file_1$target_id),]
sorted2 <- subset2[order(express_file_2$target_id),]
sorted3 <- subset3[order(express_file_3$target_id),]

subset1 <- rename(subset1,c("express_file_1.target_id"="GeneSymbol","express_file_1.target_id.1"="GeneName","express_file_1.tot_counts"="Read Count","express_file_1.length" ="Gene_exon_length","express_file_1.fpkm"="RPKM"))
subset2 <- rename(subset2,c("express_file_2.target_id"="GeneSymbol","express_file_2.target_id.1"="GeneName","express_file_2.tot_counts"="Read Count","express_file_2.length" ="Gene_exon_length","express_file_2.fpkm"="RPKM"))
subset3 <- rename(subset3,c("express_file_3.target_id"="GeneSymbol","express_file_3.target_id.1"="GeneName","express_file_3.tot_counts"="Read Count","express_file_3.length" ="Gene_exon_length","express_file_3.fpkm"="RPKM"))

write.table(sorted1, "ambient.read_cnt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted2, "cotreated.read_cnt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(sorted3, "elevated.read_cnt", sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)


We have created a R script (eXpress2gfold.R), which will be called, through eXpress2gfold.sh script. The script is located at /common/RNASeq_Workshop/RedSpruce/gfold/eXpress2gfold.sh .
The above script will produce the following gfold read count files as the output, which will be used in analyzing the differential expression.

|-- ambient.read_cnt
|-- cotreated.read_cnt
`-- elevated.read_cnt


Differential expression using Gfold:

Using the formatted count files, we will use Gfold to analyze the differential expression of the three samples. Gfold program is very useful when no replicates are available. Gfold generalizes the fold change by, considering the log fold change value for each gene, and assigns it as a Gfold value. It overcomes the short comping of the p-value, that measure the expression under different conditions, by measuring a relative expression value. It also overcomes the deficiency of low read counts. The following code finds the differentially expressed genes between the two groups of samples.

gfold diff -s1 cotreated -s2 elevated -suf .read_cnt -o cotreated_VS_elevated.diff
gfold diff -s1 cotreated -s2 ambient -suf .read_cnt -o cotreated_VS_ambient.diff
gfold diff -s1 elevated -s2 ambient -suf .read_cnt -o elevated_VS_ambient.diff
Usage: gfold [jobs] [options]
diff     Calculate GFOLD value and other statics. It accepts the count values as input. 

Options explained:
-s1       prefix for read_count of 1st group separated by comma.
-s2       prefix for read_count of 2nd group separated by comma.
-suf      suffix of count file
-o        output file and for diff job, there will be two output files, where second will have .ext extension.
-norm     normalization method. 'Count' stands for normalization by total number of mapped reads.


The prepared script for running Gfold is called, gfold.sh and it is located at: /common/RNASeq_Workshop/RedSpruce/gfold. The job will create following files:

|-- cotreated_VS_ambient.diff
|-- cotreated_VS_ambient.diff.ext
|-- cotreated_VS_elevated.diff
|-- cotreated_VS_elevated.diff.ext
|-- elevated_VS_ambient.diff
`-- elevated_VS_ambient.diff.ext


The first few lines of the output file of cotreated_VS_ambient.diff will look like:

# This file is generated by gfold V1.1.4 on Wed Jul 19 13:51:02 2017
# Normalization constants :
#    cotreated  52838582        1.0984
#    elevated   55466744        1
# The GFOLD value could be considered as a reliable log2 fold change.
# It is positive/negative if the gene is up/down regulated.
# A gene with zero GFOLD value should never be considered as 
# differentially expressed. For a comprehensive description of 
# GFOLD, please refer to the manual.
#GeneSymbol   GeneName   GFOLD(0.01)  E-FDR   log2fdc     1stRPKM   2ndRPKM
Gene.10065    Gene.10065   -0.15334      1     -0.2063     84.977    63.876
Gene.10096    Gene.10096    2.46607      1      2.5501     25.495   129.567
Gene.10097    Gene.10097    1.90102      1      2.072      27.710   101.209
Gene.10138    Gene.10138    1.38044      1      1.4319     55.704   130.366
Gene.10198    Gene.10198   -0.24752      1     -0.2928    121.876    86.279
Gene.10423   Gene.10423     2.83385      1      2.91202     26.908   175.725
Gene.10440   Gene.10440     0            1     -0.05898     55.542    46.239
Gene.10487   Gene.10487     2.39166      1      2.56462      7.236    37.198


The output file contains 7 columns, namely “GeneSymbol”, “GeneName”, “GFOLD(0.01)”,
“E-FDR”, “log2fdc”, “1stRPKM”, “2ndRPKM”. From that, we will focus on the GFOLD column for each gene. The GFOLD value, can be considered as a reliable log2 fold value indicating the change. The positive value indicate a up regulated and a negative value indicate a down regulated gene.

Functional annotation of assembled transcripts with BLAST2GO:

Before proceed to the Functional annotation, we need to select the most differential expressed genes from GFOLD output and extract corresponded nucleotide sequences as input file of Blast2go. Python scirpt ExtractSeqence.py will do the job. The script takes three arguments, a “diff” file generated from GFOLD, a f”asta” file containning either protein sequences or nucleotide sequences, and a number n indicating first n gene IDs users want to select. The script ranks genes in diff file based on GFOLD value. In this tutorial, the first 5 Gene IDs are selected to extract sequences from the cds file. 5 non-identical Sequences are successfully found and stored in a output fasta file.

python ExtractSequence.py ../gfold/cotreated_VS_ambient.diff ../CodingRegions/centroids.fasta.transdecoder.cds 5

The Script ExtractSequence.py can be found at /common/RNASeq_Workshop/RedSpruce/blast2go/. The directory after running is shown below:

|-- ExtractSequence.py
`-- ExtractedSq.fasta

Blast2GO is an interactive bioinformatics platform for functional annotation and analysis of genomic datasets. Blast2GO incorporates NCBI BLAST and various public databases to identify sequences. It also allows user to easily visualize sequenced genomes and associated functional interpretation.  Its interactive interface makes it a great tool for starter in bioinformatics. To download and intall free version Blast2GO, go to its official website and follow the instruction. In this tutorial, we will perform function annotation on assembled transcripts that found to be deferentially expressed between treatments. We are going to run 4 analyses on the sequences: blast, mapping, annotation,  interproscan (only on the PFam database) and graphs. The general work flow is shown figure below.

To load the output fasta file, Copy seqs_to_blast.fasta over to your local machine and open Blast2Go. Navigate to the File menu then –> Load –> Load Sequences –> Create a new project and select your fasta file. Blast2Go should create a nice table from the file.

Basic Local Alignment Search Tool(BLAST) is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. In this tutorial, We are going to run NCBI blast. Make sure all sequences are selected. The blast setting are shown below. Since we use nucleotide sequence of a non-model species  in this example, Blastx and swiss_prot data base are selected in order to retrieve more accurate results from query. Click question mark after each parameter to learn details of each setting. Successfully blasted sequences are labeled in pink color. Check your BLAST settings based on figures below. The blast run will take about one hour. You can access results from BLAST run is located in /common/RNASeq_Workshop/RedSpruce/blast2go/results/

Once the blast is done you can look at the BLAST results. Right click on the orange color finished sequences, and select “Show Blast Results” to show the individual result for each sequence.


For each individual BLAST result, information on the e-value, UniProt codes and more information is given.

InterPro provides a one-stop-shop for protein classification, where all the signatures produced by the different member databases are placed into entries within the InterPro database. Signatures which represent equivalent domains, sites or families are put into the same entry and entries can also be related to one another. Additional information such as a description, consistent names and Gene Ontology (GO) terms are associated with each entry, where possible. To perform InterPro, click the purple symbol on the top and select only Pfam on the pop-up screen. This step can be executed concurrently with Blast, the previous step.

Blast2go performs different mapping steps to link all Blast Hits to the functional information stored in the Gene Ontology database. The go database contains several million functional annotated gene products for hundreds of different species. All annotations are associated to an evidence code which provides information about the quality of this functional assignment. Click the green symbol and hit run to perform mapping. Next step is annotation, where , keep the default settings and run it. Successfully mapped and annotated sequences will turn blue.

To check detailed results of each step, right click the sequence that you selected. For mapped and annotated sequences, we can visualize biological process for each GO term, right click selected sequence -> show GO description -> Choose a GO term -> click show. Figure below shows biological process generated from GO:008150 of sequence Gene.26346.

The results will help us to understand functionality of this deferentially expressed gene set. The b2g file after running is located at /common/RNASeq_Workshop/RedSpruce/blast2go/results/