This tutorial will serve as an example of how to use free and open-source genome assembly and secondary scaffolding tools to generate high quality assemblies of bacterial sequence data. The bacterial sample used in this tutorial will be referred to simply as “Species” since it is live data. This data is paired-end data, meaning that there are forward and reverse reads, which we will designate as Sample_R1.fastq and Sample_R2.fastq, respectively.
Software download links:
Assembly tutorial directory
Sickle: Quality control on raw reads
The first step is to perform quality control on the reads using sickle. To run the program we will use the
sickle command. Since our reads are paired-end reads, we indicate this with the
pe option. The
-f flag designates the input file containing the forward reads,
-r the input file containing the reverse reads,
-o the output file containing the trimmed forward reads,
-p the output file containing the trimmed reverse reads, and
-s the output file containing trimmed singles. The
-q flag designates the minimum quality,
-l the minimum read length, and
-t designates the type of read.
module load sickle/1.33
sickle pe -f /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Sample_R1.fastq -r /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Sample_R2.fastq -t sanger -o Sample_1.fastq -p Sample_2.fastq -s Sample_s.fastq -q 30 -l 45
The trimmed quality control files are located in
/UCHC/PublicShare/Tutorials/Assembly_Tutorial/Quality_Control and the script to perform the quality control is located at
ABySS: de novo sequence assembler
ABySS is the first assembly program we will use to assemble our trimmed reads. Since our reads are paired-end reads, to run the assembler we will use the
abyss-pe command. We will use the parameters
k for the size of the kmer,
name for the output file prefix,
in for the paths to the forward/reverse trimmed reads, and
se for the path to the singles file,
np for number of processors, which in this case should be as same as number of processors declared in the header of your shell script.
module load abyss/2.1.4
abyss-pe np=8 k=31 name=Sample_Kmer31 in='/UCHC/PublicShare/Tutorials/Assembly_Tutorial/Quality_Control/Sample_1.fastq /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Quality_Control/Sample_2.fastq' se='/UCHC/PublicShare/Tutorials/Assembly_Tutorial/Quality_Control/Sample_s.fastq'
# repeat for k=35, k=41, etc
The kmers used in this example can be viewed as a starting point to get an idea of what kmer would best assemble the data. The assembly output files are located in
/UCHC/PublicShare/Tutorials/Assembly_Tutorial/Assembly/ABySS and the script to perform assembly is located at
/UCHC/PublicShare/Tutorials/Assembly_Tutorial/Assembly/Sample_assembly.sh. Note that this script also includes the assembly commands for SOAP and SPAdes.
SOAPdenovo: de novo sequence assembler
SOAPdenovo is another de novo sequence assembler. Unlike the other assemblers, SOAP uses a config file to pass information about the sequences into the program. The configuration file is shown below. Notable fields include average insert size and read length, which differ depending on the sequencing technology, and q1, q2, and q; the paths to the forward, reverse and singles trimmed reads.
#maximal read length max_rd_len=250 [LIB] #average insert size avg_ins=550 #if sequence needs to be reversed reverse_seq=0 #in which part(s) the reads are used asm_flags=3 #use only first 250 bps of each read rd_len_cutoff=250 #in which order the reads are used while scaffolding rank=1 # cutoff of pair number for a reliable connection (at least 3 for short insert size) pair_num_cutoff=3 #minimum aligned length to contigs for a reliable read location (at least 32 for short insert size) map_len=32 # path to genes q1=/UCHC/PublicShare/Tutorials/Assembly_Tutorial/Quality_Control/Sample_1.fastq q2=/UCHC/PublicShare/Tutorials/Assembly_Tutorial/Quality_Control/Sample_2.fastq q=/UCHC/PublicShare/Tutorials/Assembly_Tutorial/Quality_Control/Sample_s.fastq
To run the assembler we will use the
SOAPdenovo-63mer command with the
all option (to perform kmer graph construction, contig error correction, mapping of reads to contigs, and scaffolding),
-s for the path to the config file,
-K for the size of the kmer,
-o for the output prefix, 1 for assembly log, and 2 for assembly errors.
module load SOAP-denovo/2.04 SOAPdenovo-63mer all -s /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Assembly/Sample.config -K 31 -R -o graph_Sample_31 1>ass31.log 2>ass31.err # repeat for k=35, k=41, etc
The assembly output files are located in
SPAdes: de Bruijn graph based assembler
The last assembler we will run is SPAdes. SPAdes is different from the other assemblers in that it generates a final assembly from multiple kmers. A list of kmers is automatically selected by SPAdes using the maximum read length of the input data, and each individual kmer contributes to the final assembly. To run SPAdes we will use the
spades.py command with the
--careful option to minimize the number of mismatches in the contigs,
-o for the output folder,
-1 for the path to the forward reads,
-2 for the path to the reverse reads, and
-s for the path to the singles reads. If desired, a list of kmers can be specified with the
-k flag which will override automatic kmer selection.
module load SPAdes/3.13.0 spades.py --careful -o SPAdes_out -1 /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Quality_Control/Sample_1.fastq -2 /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Quality_Control/Sample_2.fastq -s /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Quality_Control/Sample_s.fastq
The assembly output files are located in
QUAST: assembly statistics
Now that we have several assemblies, it’s time to analyze the quality of each assembly. ABySS and SOAPdenovo both have their own statistics output, but for consistency, we will be using the program QUAST. The statistics we are most interested in are number of contigs, total length, and N50. A good assembly would have a low number of contigs, a total length that makes sense for the species, and a high N50 value. To run quast on all of our final assembly files we will run the following commands, with the only parameters used being the name of the assembly file(s) and output directory.
module load quast/5.0.2 # ABySS statistics quast.py /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Assembly/ABySS/Sample_Kmer*-scaffolds.fa -o ABySS
# SOAPdenovo statistics quast.py /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Assembly/SOAP/graph_Sample_*.scafSeq -o SOAP
# SPAdes statistics quast.py /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Assembly/SPAdes/scaffolds.fasta -o SPAdes
From the data, it’s clear that SPAdes performed the best. SPAdes generated only 59 contigs as compared to ~200 from SOAP and ~300 from ABySS. Additionally, the largest contig size and N50 values were the highest. Finally, the total number of base pairs was closest to the number of base pairs in a different strain of this bacteria that has already been sequenced. We will proceed to secondary scaffolding with this assembly, located in
QUAST’s output consists of a folder containing results in multiple formats within each of the three assembly directories. The script to run QUAST is located at
SSPACE is a script able to extend and scaffold pre-assembled contigs. SSPACE requires a library file containing the paths to the paired end reads, average insert size, and type of data. This file is located at
We will run SSPACE using a perl command with the parameters
-l for the species library,
-s for the fasta file containing assembled scaffolds,
-b for the output prefix, and
-T for the number of threads.
module load SSPACE/3.0 SSPACE_Standard_v3.0.pl -l /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Species_library.txt -s /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Assembly/SPAdes/scaffolds.fasta -b SSPACE -T 16
The output file is located at
/UCHC/PublicShare/Tutorials/Assembly_Tutorial/Scaffolding/SSPACE/Sample_SSPACE.final.scaffolds.fasta. The script to run SSPACE is located at
We then will run QUAST on this file to compare it with previous assemblies. This time, we will run QUAST over the command line without a submit script, since it is only one line.
cd /UCHC/PublicShare/Tutorials/Assembly_Tutorial/QUAST module load quast/5.0.2 quast.py /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Scaffolding/SSPACE/SSPACE.final.scaffolds.fasta -o SSPACE
AlignGraph on close relation (different strain of species)
AlignGraph is the final step in this assembly pipeline. From the documentation, “AlignGraph is a software that extends and joins contigs or scaffolds by reassembling them with help provided by a reference genome of a closely related organism.” By using a reference genome of a closely related organism, it can improve the assembly.
To run AlignGraph we first need to convert the raw reads from fastq format to fasta format. There are many ways to do this, but one of the most efficient ways is to use a
sed command to parse out the reads from the fastq file:
sed -n '1~4s/^@/>/p;2~4p' /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Sample_R1.fastq > Sample_R1.fasta
sed -n '1~4s/^@/>/p;2~4p' /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Sample_R2.fastq > Sample_R2.fasta
Then we will run AlignGraph using the
AlignGraph command and the parameters
--read1 for the forward read in fasta format,
--read2 for the reverse read in fasta format,
--contig for the path to the assembly we are rescaffolding, and
--genome for the path to the reference genome we are using for rescaffolding. The genome we are using is named AlignGraph_genome.fasta, again to protect the live data.
Additionally, we have to define the
--distanceHigh parameters. From the documentation, distanceLow is the maximum of [insert size – 1000, insert size] and distanceHigh [insert size + 1000]. The insert size of this dataset is 550, giving us a distanceLow of 550 and distanceHigh of 1550. Finally, we define the output file names using
--remainingContigs. –remainingContigs will contain the final assembly.
module load AlignGraph/v1
AlignGraph --read1 /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Scaffolding/Sample_R1.fasta --read2 /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Scaffolding/Sample_R1.fasta --contig /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Scaffolding/SSPACE/SSPACE.final.scaffolds.fasta --genome /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Scaffolding/AlignGraph_genome.fasta --distanceLow 550 --distanceHigh 1550 --extendedContig Species_extendedContigs.fa --remainingContig Species_remainingContigs.fa
The output file is located at
/UCHC/PublicShare/Tutorials/Assembly_Tutorial/Scaffolding/AlignGraph/Sample_remainingContigs.fa. The script to run AlignGraph is located at
cd /UCHC/PublicShare/Tutorials/Assembly_Tutorial/QUAST module load quast/5.0.2 quast.py /UCHC/PublicShare/Tutorials/Assembly_Tutorial/Scaffolding/AlignGraph/Species_remainingContigs.fa -o AlignGraph
Unfortunately, this dataset was not improved by AlignGraph with this specific genome, but this tutorial still illustrates the general idea.