Population genomics – vcf

1 VCF Practical

Author: Lizel Potgieter, adapted by Amrei Binzer-Panchal

2 Introduction

A variant call format (VCF) is a type of file that stores all the information regarding variant and non-variant sites of individuals mapped to a reference assembly. This assembly can be a single chromosome, or a collection of scaffolds- there really is no limit!

3 Data we will be using

To get familiar with the kind of tools and data one typically uses in bioinformatics analyses, we have selected a case from literature.

To get around storage and computational time restrictions, you can do the following anslyses on one scaffold. Redo some of the analyses and statistics that are presented in the supplementary material. Are there any regions under selection in your scaffold?

4 Dowload the data

Download the data from the above links. Make sure you download the data set, as well as the corresponding R scripts.

Hands-on

Look at the data set to see what files there are, what they contain and how they are structured (hint: start at the README file).

5 Tools for VCF analyses

There are a multitude of tools available for the analysis of VCFs. It is almost entirely up to the researcher how they wish to interact with these files. Some bioinformaticians prefer to analyse files in the CLI and only plot the final data in R or python, while others prefer to do everything with R packages. For this part of the workshop, you will work with bits of both aspects.

6 R

One popular R package for VCF analyses is vcfR.

To plot data, we will use ggplot2, and some functions of the dplyr package.

Hands-on

Load the packages in R (for example in R-studio).

solution:

library("vcfR")


   *****       ***   vcfR   ***       *****
   This is vcfR 1.14.0 
     browseVignettes('vcfR') # Documentation
     citation('vcfR') # Citation
   *****       *****      *****       *****

library("ggplot2")
library("dplyr")


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

6.1 Metadata

To ensure that your plots are easy to understand, it is important to have some metadata stored in the data you wish to plot. This can be sampling location, identity to ancestral genetic population, sampling year, and so on.

The data we will be using has really neat identifiers as the individual names start with the population. Good sample names facilitate analyses, automation and therefore reduce the chance of human error (as careful as you are, mistakes can happen very easily with large amounts of data and automation ensures that mistakes can easily be caught and corrected!) Have a look in the online data to see if you can see where this comes from.

Hands-on

Read in the data. Which file contains the information we need?

solution:

input_data <- read.csv("/Users/amrei/Desktop/NGS_data_Lizel/doi_10.5061_dryad.7d7wm37wr__v4/Environmental_variation.csv") 
str(input_data)

'data.frame':   275 obs. of  24 variables:
 $ pop          : chr  "bil" "bil" "bil" "bil" ...
 $ ind          : chr  "Bil1Phenig4" "BilPhenig1" "BilPhenig10" "BilPhenig11" ...
 $ Latitude     : num  63.1 63.1 63.1 63.1 63.1 ...
 $ longitude    : num  16.6 16.6 16.6 16.6 16.6 ...
 $ bio01        : num  1.81 1.81 1.81 1.81 1.81 ...
 $ bio02        : num  8.32 8.32 8.32 8.32 8.32 ...
 $ bio03        : num  25.9 25.9 25.9 25.9 25.9 ...
 $ bio04        : num  845 845 845 845 845 ...
 $ bio05        : num  17.4 17.4 17.4 17.4 17.4 17.4 17.4 17.4 17.4 17.4 ...
 $ bio06        : num  -14.8 -14.8 -14.8 -14.8 -14.8 -14.8 -14.8 -14.8 -14.8 -14.8 ...
 $ bio07        : num  32.2 32.2 32.2 32.2 32.2 32.2 32.2 32.2 32.2 32.2 ...
 $ bio08        : num  12.6 12.6 12.6 12.6 12.6 ...
 $ bio09        : num  -3.98 -3.98 -3.98 -3.98 -3.98 ...
 $ bio10        : num  12.6 12.6 12.6 12.6 12.6 ...
 $ bio11        : num  -8.35 -8.35 -8.35 -8.35 -8.35 -8.35 -8.35 -8.35 -8.35 -8.35 ...
 $ bio12        : int  668 668 668 668 668 668 668 668 668 668 ...
 $ bio13        : int  94 94 94 94 94 94 94 94 94 94 ...
 $ bio14        : int  31 31 31 31 31 31 31 31 31 31 ...
 $ bio15        : num  32.8 32.8 32.8 32.8 32.8 ...
 $ bio16        : int  240 240 240 240 240 240 240 240 240 240 ...
 $ bio17        : int  109 109 109 109 109 109 109 109 109 109 ...
 $ bio18        : int  240 240 240 240 240 240 240 240 240 240 ...
 $ bio19        : int  123 123 123 123 123 123 123 123 123 123 ...
 $ mean_NHx_2009: num  3.27e-12 3.27e-12 3.27e-12 3.27e-12 3.27e-12 ...

Using we can use the information in the individual IDs to make two new columns, Population and Location:

input_data <- input_data %>%
  mutate(Population = case_when(
    startsWith(ind, "Bil") ~ "bil",
    startsWith(ind, "BNP") ~ "bnp",
    startsWith(ind, "Her") ~ "her",
    startsWith(ind, "Iss") ~ "iss",
    startsWith(ind, "Kai") ~ "kai",
    startsWith(ind, "Koi") ~ "koi",
    startsWith(ind, "Kot") ~ "kot",
    startsWith(ind, "Luu") ~ "luu",
    startsWith(ind, "Met") ~ "met",
    startsWith(ind, "Mor") ~ "mor",
    startsWith(ind, "Mus") ~ "mus",
    startsWith(ind, "Myl") ~ "myl",
    startsWith(ind, "Pam") ~ "pam",
    startsWith(ind, "Pat") ~ "pat",
    startsWith(ind, "Pet") ~ "pet",
    startsWith(ind, "Puk") ~ "puk",
    startsWith(ind, "Rus") ~ "rus",
    startsWith(ind, "Sig") ~ "sig",
    startsWith(ind, "Skj") ~ "skj",
    startsWith(ind, "eSK") ~ "skj",
    startsWith(ind, "Spk") ~ "spk",
    startsWith(ind, "Sus") ~ "sus",
    startsWith(ind, "Ulv") ~ "ulv",
    startsWith(ind, "Van") ~ "van",
    startsWith(ind, "Ves") ~ "ves"
  ))

 input_data <- input_data %>%
  mutate(Location = case_when(
    startsWith(ind, "Bil") ~ "Sweden",
    startsWith(ind, "BNP") ~ "Poland",
    startsWith(ind, "Her") ~ "SWF",
    startsWith(ind, "Iss") ~ "NEF",
    startsWith(ind, "Kai") ~ "NEF",
    startsWith(ind, "Koi") ~ "NEF",
    startsWith(ind, "Kot") ~ "SWF",
    startsWith(ind, "Luu") ~ "SWF",
    startsWith(ind, "Met") ~ "SWF",
    startsWith(ind, "Mor") ~ "Norway",
    startsWith(ind, "Mus") ~ "SWF",
    startsWith(ind, "Myl") ~ "SWF",
    startsWith(ind, "Pam") ~ "NEF",
    startsWith(ind, "Pat") ~ "NEF",
    startsWith(ind, "Pet") ~ "SWF",
    startsWith(ind, "Puk") ~ "SWF",
    startsWith(ind, "Rus") ~ "Russia",
    startsWith(ind, "Sig") ~ "Norway",
    startsWith(ind, "Skj") ~ "Norway",
    startsWith(ind, "eSK") ~ "Norway",
    startsWith(ind, "Spk") ~ "Norway",
    startsWith(ind, "Sus") ~ "SWF",
    startsWith(ind, "Ulv") ~ "NEF",
    startsWith(ind, "Van") ~ "NEF",
    startsWith(ind, "Ves") ~ "SWF"
))

Hands-on

Where did we find the location information for the samples?

solution:

In the supplementary information available for download together with the publication.

Scroll down to supporting information. Name of the file: mec16369-sup-0001-supinfo.docx.

In the supplementary data, the authors forgot to add the location of the kai population, here we just assigned it to NEF. Let us see if that was the correct designation.

Key:

SWF: South Western Finland
NEF: North Eastern Finland

To change the header of a column in R

 names(df)[names(df) == 'old.var.name'] <- 'new.var.name'

6.2 Explore the metadata

Here is a basic barplot in ggplot.

ggplot(input_data, aes(x=ind, y =bio01, fill=Population)) +
geom_bar(stat = "identity") +
facet_wrap(~Location, scales = "free_x")

Hands-on

What is plotted here? What else would you like to plot to get to know more about your data?

solution:

The key to what the different columns in Environmental_variation.csv mean is in the file README_phenig.txt.

Column bio01, plotted here, is the annual mean temperature at the sample site.

7 Assessing quality of the VCF

We have a special version of VCFtools that supports haploid data (https://github.com/jydu/vcftools). Everything remains the same as the original version of VCFtools, with added support for haploid datasets that becomes important for some of the summary statistics. Have a look at the VCFtools manual to see the range of possible functions within this software (https://vcftools.sourceforge.net/man_latest.html).

In VCFtools, it is important to remember to use the –recode option if you would like to output a new VCF that is filtered or contains a subset of positions or individuals. –out only sets the name of the output file. Use concise names so that it is easy for you to see what’s in the output.

Let us start by having a look at the sequencing depth of various aspects of the VCF. From the manual, figure out what each function means. Use less to see what’s in each of these output files.

vcftools-haploid --vcf --depth

vcftools-haploid --vcf --site-depth

vcftools-haploid --vcf  --site-mean-depth

Next, we want to see what the quality per site is

vcftools-haploid --vcf --site-quality

I prefer analysing these outputs in R. For that, download the output files to your computer through FileZilla and import them into R. From there, decide on upper and lower boundaries to filter the variants on your scaffold.

Keep only the scaffold your group is working on in your folder

vcftools-haploid --vcf --chr --recode

For downstream analyses, we need to remove all indels from the VCF

vcftools-haploid --vcf --remove-indels --recode

For most analyses, we only consider bi-allelic sites

vcftools-haploid --vcf --min-alleles 2 --max-alleles 2 --recode

Look at allele frequency

vcftools-haploid --vcf --freq

Tajima’s D in bins

vcftools-haploid --vcf --TajimaD

Pi per site

vcftools-haploid --vcf --site-pi

Pi per window

vcftools-haploid --vcf --window-pi

Fst. Here, create a file of individuals from 6 populations (2 SWF, 2 NEF, Sweden, Poland) that you will use as pop1 and pop2 to do a pairwise Fst analysis to determine what the fixation index is.

vcftools-haploid --vcf --weir-fst-pop pop1 --weir-fst-pop pop2 -fst-window-size

PCA that will also be plotted in R. For the PCA, you will plot the eigenvalues to illustrate what each component explains in your data, as well as PC1 and PC2, Pc2 and PC3, and PC3 and PC4.

plink --vcf --pca

If there is a region that is under selection within your data, have a look at the gff3 file to see the genes that are present within the region.

With these kinds of analyses, you are very well on the way to finding regions that are under selection by combining Tajima’s D and Fst. You can show which regions have higher SNP density.