Author: Lizel Potgieter, adapted by Amrei Binzer-Panchal
2 Introduction
A variant call format (VCF) is a type of file that stores all the information regarding variant and non-variant sites of individuals mapped to a reference assembly. This assembly can be a single chromosome, or a collection of scaffolds- there really is no limit!
3 Data we will be using
To get familiar with the kind of tools and data one typically uses in bioinformatics analyses, we have selected a case from literature.
To get around storage and computational time restrictions, you can do the following anslyses on one scaffold. Redo some of the analyses and statistics that are presented in the supplementary material. Are there any regions under selection in your scaffold?
4 Dowload the data
Download the data from the above links. Make sure you download the data set, as well as the corresponding R scripts.
Hands-on
Look at the data set to see what files there are, what they contain and how they are structured (hint: start at the README file).
5 Tools for VCF analyses
There are a multitude of tools available for the analysis of VCFs. It is almost entirely up to the researcher how they wish to interact with these files. Some bioinformaticians prefer to analyse files in the CLI and only plot the final data in R or python, while others prefer to do everything with R packages. For this part of the workshop, you will work with bits of both aspects.
To plot data, we will use ggplot2, and some functions of the dplyr package.
Hands-on
Load the packages in R (for example in R-studio).
solution:
library("vcfR")
***** *** vcfR *** *****
This is vcfR 1.14.0
browseVignettes('vcfR') # Documentation
citation('vcfR') # Citation
***** ***** ***** *****
library("ggplot2")library("dplyr")
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
6.1 Metadata
To ensure that your plots are easy to understand, it is important to have some metadata stored in the data you wish to plot. This can be sampling location, identity to ancestral genetic population, sampling year, and so on.
The data we will be using has really neat identifiers as the individual names start with the population. Good sample names facilitate analyses, automation and therefore reduce the chance of human error (as careful as you are, mistakes can happen very easily with large amounts of data and automation ensures that mistakes can easily be caught and corrected!) Have a look in the online data to see if you can see where this comes from.
Hands-on
Read in the data. Which file contains the information we need?
In the supplementary data, the authors forgot to add the location of the kai population, here we just assigned it to NEF. Let us see if that was the correct designation.
ggplot(input_data, aes(x=ind, y =bio01, fill=Population)) +geom_bar(stat ="identity") +facet_wrap(~Location, scales ="free_x")
Hands-on
What is plotted here? What else would you like to plot to get to know more about your data?
solution:
The key to what the different columns in Environmental_variation.csv mean is in the file README_phenig.txt.
Column bio01, plotted here, is the annual mean temperature at the sample site.
7 Assessing quality of the VCF
We have a special version of VCFtools that supports haploid data (https://github.com/jydu/vcftools). Everything remains the same as the original version of VCFtools, with added support for haploid datasets that becomes important for some of the summary statistics. Have a look at the VCFtools manual to see the range of possible functions within this software (https://vcftools.sourceforge.net/man_latest.html).
In VCFtools, it is important to remember to use the –recode option if you would like to output a new VCF that is filtered or contains a subset of positions or individuals. –out only sets the name of the output file. Use concise names so that it is easy for you to see what’s in the output.
Let us start by having a look at the sequencing depth of various aspects of the VCF. From the manual, figure out what each function means. Use less to see what’s in each of these output files.
vcftools-haploid--vcf--depth
vcftools-haploid--vcf--site-depth
vcftools-haploid--vcf--site-mean-depth
Next, we want to see what the quality per site is
vcftools-haploid--vcf--site-quality
I prefer analysing these outputs in R. For that, download the output files to your computer through FileZilla and import them into R. From there, decide on upper and lower boundaries to filter the variants on your scaffold.
Keep only the scaffold your group is working on in your folder
vcftools-haploid--vcf--chr--recode
For downstream analyses, we need to remove all indels from the VCF
vcftools-haploid--vcf--remove-indels--recode
For most analyses, we only consider bi-allelic sites
Fst. Here, create a file of individuals from 6 populations (2 SWF, 2 NEF, Sweden, Poland) that you will use as pop1 and pop2 to do a pairwise Fst analysis to determine what the fixation index is.
PCA that will also be plotted in R. For the PCA, you will plot the eigenvalues to illustrate what each component explains in your data, as well as PC1 and PC2, Pc2 and PC3, and PC3 and PC4.
plink--vcf--pca
If there is a region that is under selection within your data, have a look at the gff3 file to see the genes that are present within the region.
With these kinds of analyses, you are very well on the way to finding regions that are under selection by combining Tajima’s D and Fst. You can show which regions have higher SNP density.