How can Kmer estimation help to find genome sizes?
For a given sequence of length L, and a kmer size of k, the total kmer’s possible will be given by ( L – k ) + 1
e.g. In a sequence of length of 14, and a kmer length of 8, the number of kmer’s generated will be:
GATCCTACTGATGC ( L = 14 ) on decomposition of kmers of length k = 8, Total number of kmer's generated will be n = ( L  k ) + 1 = (14  8 ) + 1 = 7 GATCCTAC, ATCCTACT, TCCTACTG, CCTACTGA, CTACTGAT, TACTGATG, ACTGATGC
For shorter fragment sizes, as in the above example, the Total number of kmers estimated is n = 7, and it is not close to actual fragment size of L which is 14 bps. But for larger fragment size, the total number of kmer’s (n) provide a good approximation to the actual genome size. The following table tries to illustrate the approximation:
k=18  
Genome Sizes 
Total Kmers of k=18 
% error in genome estimation 

L  N=(LK)+1  
100  83  17  
1000  983  1.7  
10000  9983  0.17  
100000  99983  0.017  
1000000  999983  0.0017  1MB genome size 
So for a genome size of 1 Mb, the error between estimation and reality is only .0017%. Which is a very good approximation of actual size.
In the case of, 10 copies (C) of GATCCTACTGATGC sequence, then the total no of kmer’s (n) of length k = 8 will be 70.
n = [( L  k ) + 1 ] * C = [(14  8 ) + 1] * 10 = 70
To get the actual genome size, we simply need to divide the total by the number of copies:
= n / C = 70 / 10 = 7
That will help us to understand, that we never sequence a single copy of genome but a population. Hence we end up sequencing C copies of genome. This is also referred as coverage in sequencing. To obtain actual genome size (N), divide the total kmers (n) by coverage (C).
N = n / C
kmer Distribution of a Typical Real World Genome
Major issue that is faced in a real world genome sequencing projects is a nonuniform coverage of genome. This can be accounted to technical and biological variables.
example: biased amplification of certain genomic regions during PCR (a step in preparation of sequencing libraries) and presence of repetitive sequences in genome.
kmer size:
The size of kmers should be large enough allowing the kmer to map uniquely to the genome (a concept used in designing primer/oligo length for PCR). Too large kmers leads to overuse of computational resources.
In the first step, kmer frequency is calculated to determine the coverage of genome achieved during sequencing. There are software tools like Jellyfish that helps in finding the kmer frequency in sequencing projects. The kmer frequency follows a pseudonormal distribution (actually it is a Poisson distribution) around the mean coverage in histogram of kmer counts.
Once the kmer frequencies are calculated a histogram is plotted to visualize the distribution and to calculate mean coverage.
Figure 1: (A2) An example of kmer histogram. The xaxis (V1), is the frequency or the number of times a given kmer is observed in the sequencing data. The yaxis (V2), is the total number of kmers with a given frequency.
The first peak is (in red region) primarily due to rare and random sequencing errors in reads. The values in graph can be trimmed to remove reads with sequencing errors.
With the assumption that kmers are uniquely mapped to genome, they should be present only once in a genome sequence. So their frequency will reflect the coverage of the genome.
For calculation purposes we use mean coverage which is 14 in above graph. The area under the curve will represent the total number of kmers.
So the genome estimation will be:
N = Total no. of kmers/Coverage = Area under curve /mean coverage(14)
Methodology
The study was conducted to estimate the genome size of the species with low coverage short read data to validate existing estimates (flow cytometry sourced) or produce a new computational estimate. The genome size is calculated from short subsequences (kmers) of Illumina short read data. A larger kmer size need to be considered for the genome estimation.
Reads should be quality controlled before the genome estimate is provided. Numerous programs exist for this purpose but Sickle (https://github.com/ucdavisbioinformatics/sickle) is the application of choice in this tutorial. We require a minimum Phredscaled quality value of 25 to for estimates.
Upon performing quality control, the kmer distribution was calculated using the Jellyfish kmer counting program. Then a histogram was constructed to perform the genome estimation using the same program.
The R statistical package is used to plot the binned distributions for the selected kmer lengths. Initally the full data set is plotted and initial data points are often very high number due to the low frequency data points, and thus it should be avoided. Once the peak position is determined the total number of kmers in the distribution is calculated, and then the genome size can be estimated using the peak position. In the ideal situation (or theoretically) the peak shape should be a Poisson distribution. In order to come up with a kmer size a number of kmer sizes are selected and genome estimation is done, where we see a regular distribution of the genome size.
Data sets used in this tutorial are available on BBC cluster:
Path: /common/Tutorial/Genome_estimation Data sets: sample_read_1.fastq sample_read_2.fastq Script: GenomeEstimationScript.sh
Tutorial Outline
 Count kmer occurrence using Jellyfish 2.2.6
 Generate histogram using R
 Determine the single copy region and the total kmers
 Determine peak position and genome size
 Compare the peak shape with Poisson distribution
1. Count kmer occurrence using Jellyfish 2.2.6
jellyfish count t 8 C m 19 s 5G o 19mer_out minqualchar=? /common/Tutorial/Genome_estimation/sample_read_1.fastq /common/Tutorial/Genome_estimation/sample_read_2.fastq
options used in the counting kmers:
t treads=unit32 Number of treads to be used in the run. eg: 1,2,3,..etc. C bothstrands Count both strands m merlen=unit32 Length of the kmer s size=unit32 Hash size / memory allocation o output=string Output file name minqualitychar Base quality value. Version 2.2.3 of Jellyfish uses the “Phred” score, where "?" = 30
If you need to look at a detailed usage of ‘count’ in jellyfish, type the following after loading the jellyfish module.
jellyfish <cmd> [options] jellyfish count help or jellyfish count fullhelp
In order to run the program Jellyfish in the bbc cluster, compose the following script and save it or you can copy GenomeEstimationScript.sh from the bbc cluster, which is located at the following path. (More on the commands used in the script can be found in here)
#!/bin/bash #$ N jellyfish #$ M user_email_id #$ q highmem.q #$ m bea #$ S /bin/bash #$ cwd #$ pe smp 8 #$ o JellyFish_$JOB_ID.out #$ e Jellyfish_$JOB_ID.err module load jellyfish/2.2.6 jellyfish count t 8 C m 19 s 5G o 19mer_out minqualchar=? /common/Tutorial/Genome_estimation/sample_read_1.fastq /common/Tutorial/Genome_estimation/sample_read_2.fastq
It will create a out put file called 19mer_out. Then using the above created file (19mer_out), data points for a histogram will be created using the following command:
jellyfish histo o 19mer_out.histo 19mer_out
If you open the 19mer_out.histo file, it will have the frequency counting of each kmer with the kmer length of 19.
1 13016694 2 3159677 3 3938273 4 5416130 5 7140173 6 8860956 7 10461902 8 11938782 9 13277372 10 14419501 11 15230762 12 15594907 13 15416499 14 14655690 15 13442589 16 11834646 17 10049240 18 8239408 19 6533641 20 5047445 21 3808930 22 2817665 23 2069050 24 1522365 25 1134562
2. Plot results using R
To visualize and to plot the data we use R Studio. Using the following command, we will load the data from the out_19mer.histo file in to a data frame called dataframe19.
dataframe19 < read.table("19mer_out.histo") #load the data into dataframe19 plot(dataframe19[1:200,], type="l") #plots the data points 1 through 200 in the dataframe19 using a line
This will create a line graph using the first 200 data points and the graph will look like the following:
In general, very low frequency kmers represent high numbers that would skew the y –axis. If we look at the data points in the histogram file, we can see that the very first data point has a exceptionally high value than the next (second) data point. So we remove just the first line and replot using R. From now on we will disregard the first data point for our calculations.
plot(dataframe19[2:200,], type="l")
3. Determine the single copy region and the total number of kmers
To determine the cut off points in the single copy region, we need to see the actual data point positions in the graph. In the initial examination the peak starts from the 2^{nd} data point onward. So we will disregard the first data point in determining the single copy region.
Now using R we will replot the graph to determine the single copy region. Then we will include the points for clarity on the same graph.
plot(dataframe19[2:100,], type="l") #plot line graph points(dataframe19[2:100,]) #plot the data points from 2 through 100
According to the graph, the single copy region would be between points 2 and 28.
Assuming the total number of data points is 9325, we can now calculate the total kmers in the distribution.
sum(as.numeric(dataframe19[2:9325,1]*dataframe19[2:9325,2]))
It will produce the results as:
[1] 3667909811
4. Determine peak position and genome size
From the plotted graph we can get an idea where the peak position lies, and it should be between 520 data points. Now to determine the peak, thus, we need to look at the actual data points in that region. Using the below command we will examine the actual data points between 5 and 20.
data[10:20,]
V1 V2
5 5 7140173
6 6 8860956
7 7 10461902
8 8 11938782
9 9 13277372
10 10 14419501
11 11 15230762
12 12 15594907
13 13 15416499
14 14 14655690
15 15 13442589
16 16 11834646
17 17 10049240
18 18 8239408
19 19 6533641
In this case the peak is at, 12. So the Genome Size can be estimated as:
sum(as.numeric(dataframe19[2:9325,1]*dataframe19[2:9325,2]))/12
Where the genome size is:
[1] 305659151
~ 305 Mb
It would be interesting to see the proportionality of the single copy region compared to the total genome size. In this data set the single copy region is between data point 2 and 28, So the size of single copy region can be roughly calculated as:
sum(as.numeric(dataframe19[2:28,1]*dataframe19[2:28,2]))/12
[1] 213956126
~ 213 Mb
The proportion can be calculated as:
(sum(as.numeric(dataframe19[2:28,1]*dataframe19[2:28,2]))) / (sum(as.numeric(dataframe19[2:9325,1]*dataframe19[2:9325,2])))
[1] 0.6999827
~ 70 %
5. Compare the peak shape with Poisson distribution
Since we have a nice curve, we can compare our curve to a theoretical curve, which is the Poission distribution.
singleC < sum(as.numeric(dataframe19[2:28,1]*dataframe19[2:28,2]))/12 poisdtb < dpois(1:100,12)*singleC plot(poisdtb, type='l', lty=2, col="green") lines(dataframe19[1:100,12] * singleC, type = "l", col=3)#, Ity=2) lines(dataframe19[1:100,],type= "l")
Likewise the procedure can be iterated across different number of kmers.
Kmer length  19  21  23  25  27  29  31 
total No of fields  9741  9653  9436  9325  9160  8939  8818 
Total Kmer count  3.66E+09  4.4E+09  4.53E+09  4.67E+09  4.26E+09  4.44E+09  3.98E+09 
Genome size  3.05E+08  2.9E+08  3.02E+08  3.14E+08  3.04E+08  3.41E+08  3.06E+08 
single copy region  2.13E+08  2E+08  2.08E+08  2.25E+08  2.21E+08  2.36E+08  2.3E+08 
Proportion  0.69998  0.69403  0.688915  0.716151  0.725814  0.690277  0.750801 