Population genomics – qc_and_read

1 Read Mapping and Variant Calling

Author: Lizel Potgieter, adapted by Amrei Binzer-Panchal

In this part of the exercise, we will have a look at the data, do basic quality control and mapping of the reads.

If you don’t know how to use a tool, check out the help function. This is easily done by using the -h or –help flags after the tool name e.g. “fastqc -h”

For this course we will assume that you have access to a HPC server. If you don’t have access to a server or a powerful personal computer, you can use for example Galaxy during the workshop.

2 Introduction

Read mapping and variant calling is the most important aspect of population genomics analyses. Appropriate quality control and the selection of applicable pipelines is therefore crucial to obtain trustworthy data for reliable and reproducible analyses. In this session, we will use simulated data to test the reliability of the methods.

In preparation for this course, I have created a forward simulation of a population of 20 individuals from the first chromosome from Lentinula edodes (publication here). This assembly was produced from third generation sequencing. Using simulated data allows us to test the precision and recall of our methods.

After each section, there are some basic questions I’d like you to think about and answer. These will not be graded, and are merely points to ensure that you have understood everything we are covering in each section.

3 The data

3.1 Download

You can download the data here.

3.2 Description

Our data was simulated to mimic the error profile of an Illumina HiSeq 2500.

The reference sequence: lentinula_scaffold_1.fa
Forward and reverse reads for 20 simulated individuals SE001 to SE020 in a separate folder: full_pop_fastq
General feature file that contains the annotation: lentinula_scaffold_1.gff3
Newich files for phylogenetic tree: Lentinula_chr1.RealTree.nwk

4 Read Quality Assessment

Tools: FastQC, MultiQC

FastQC and MultiQC

Before we continue the analysis of our data we need to perform Quality Control (QC) on our raw sequence data. We will use FastQC to generate quality reports from our fastq files, and we will use MultiQC to aggregate our reports.

The quality of short reads has improved drastically over the years, so the reports we’ll get from our simulated data are realistic representations of recently sequenced genomes. For some older sequencing runs, the quality may not be as good, and you may have to employ additional software to control for worse quality and the presence of adapters.

There are several different tools to pre-process raw data. FastP is one such fastq pre-processing tool which has functions for quality control, trimming of adapters, filtering by quality, and read pruning.

Other tools for trimming poor quality sequences, are Trimmomatic or scripts within the BBMap package. We won’t be touching on trimming and repairing mismatched reads in this tutorial.

4.1 FastQC

FastQC is a quality control tool for high throughput sequencing data.

Hands-on

For a few samples, generate the FastQC report. Read the help page to identify which options you should use.

solution:

fastqc $infile -o $infile_fastqc

4.2 MultiQC

MultiQC is like FastQC, but for several samples at once. It can also detect and summarize output reports from a variety of tools to give you an overview of the performance of tools you used on your data.

Hands-on

solution:

multiqc $infile1 $infile2

Questions

What kind of reads do we have?
Why does the quality of the reads decrease slightly at the end of the read?
Is this kind of data expected from a real experiment?
What do you expect to see if there is primer contamination?
What is a deviation from the expected GC possibly indicative of?
How does GC content affect genome assembly and read mapping?

5 Read Mapping

Tools: BWA, Samtools, GATK

To perform variant calling, the reads must be mapped to a reference genome. To accomplish this, we employ the Burrows-Wheeler Alignment (BWA) tool.

“BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.

For all the algorithms, BWA first needs to construct the FM-index for the reference genome (the index command). Alignment algorithms are invoked with different sub-commands: aln/samse/sampe for BWA-backtrack, bwasw for BWA-SW and mem for the BWA-MEM algorithm.” https://bio-bwa.sourceforge.net/bwa.shtml

We will be using the BWA-MEM2 function in Galaxy. You will need to use the lentinula_scaffold_1.fa reference sequence, create an index, and select the paired reads function to map the reads to reference sequence.

First, the reference genome must be indexed.

samtools faidx

Next, the forward and reverse reads are mapped to the reference genome. Here, try to use a for loop to automate the process. Let me know if you get stuck!

bwa mem $reference_genome $read1 $read2 > $outfile.sam

Have a look at the contents of one of the .sam files. Can you see the various characteristics of the file I spoke about during my introductory lecture? To save some space, convert the .sam to a .bam file with Samtools

samtools view

Remove all the .sam files in your directory

rm

Sort the contents of the .bam files

samtools sort

Remove duplicates from the sorted .bam files

samtools rmdup

Index the sorted .bam files that have had duplicates removed

samtools index

Now that the pre-processing has been done, we will move over to GATK to call variants. Here it is important to set ploidy to the organism you’re working on. We’re working on a haploid genome in this workshop

gatk HaplotypeCaller

Here you will have an individual GVCF file for each individual. These need to be combined to a single GVCF

gatk CombineGVCFs

The combined GVCF, must be genotyped

gatk GenotypeGVCFs

Now you have a raw VCF that can be filtered in subsequent steps. In this practical, we will practice plotting some of this really tidy data, and compare it to a real world VCF to see the extent of noise in the dataset.

Questions

At which position is the first variant detected?
What does the QUAL field indicate? Why are some quality scores so much lower than others?
What kind of filtering criteria would you consider with this data?
If you had a diploid species, how would the VCF differ from the one you have produced today?
How would you benchmark your pipeline?

If you are feeling particularly adventurous and want to make full use of the fact that this was a simulated dataset, please see the first optional exercise. If you have had more than enough of read mapping, don’t worry about this exercise!

6 Genome Assembly and Quality Assessment

Tools: SPAdes, QUAST

It’s all well and good if you have a reference genome to work with, but if you don’t you need to assemble your own reference genome. At the moment, this is mostly done with third generation sequencing as the reads are longer, and telomere-to-telomere assemblies are fairly easily attainable. However, in a lot of studies, there is only Illumina data available. In this part of the workshop, you will create a de novo assembly for one of the isolates and assess the quality

SPAdes

The current version of SPAdes works with reads produced by Illumina or IonTorrent and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. You can also provide additional contigs that will be used as long reads. SPAdes (v3.15.0) supports paired-end reads, mate-pairs and unpaired reads. SPAdes can take as input several paired-end and mate-pair libraries simultaneously. Note, that SPAdes was initially designed for small genomes. It was tested on bacterial (both single-cell Multiple displacement amplification (MDA) and standard isolates), fungal and other small genomes. SPAdes is not intended for larger genomes (e.g. mammalian size genomes). For such purposes you can use it at your own risk Reference: https://cab.spbu.ru/software/spades/

python spades.py -1 $infile_read1 $infile_read2 -o $infile_assembly

Try change the k-mer sizes, and see how that affects your assembly. To get quality statistics of your assembly, use QUAST on your final assembly

quast

To just quickly see how many scaffolds are present in your assembly, use this neat little bash trick

grep ">" $infile | wc -l

Questions

How many contigs have been built?
What is the mean, min and max length of the contigs?
What is the N50 of the contigs? How do you interpret N50 and L50?
What is the effect on the assembly when you vary the k-mer parameter?
What is the advantage of reference guided assembly over de novo assembly?