Hands-on Lab: Variant Calling

A typical variant calling workflow consists of the following steps:

  • Step 1: Extract DNA from sample (experimental)

  • Step 2: Sequence DNA (experimental)

  • Step 3: Align reads to reference genome (computational)

  • Step 4: Identify genomic variants (computational)

For the purposes of this lab, we will focus on Step 4 and assume the first three steps have already been performed. After Step 3, our data consists of a file of aligned reads in sam format, and a reference genome in fna format. We will be using the samtools software package to do Step 4.

Identify Genomic Variants

First, make a new directory, create a new or copy one of your old job.slurm scripts for this exercise, and collect the following two inputs:

[ls6]$ cp /work/03439/wallen/public/samtools_example/ecoli_reads_aligned.sam ./
[ls6]$ cp /work/03439/wallen/public/samtools_example/ecoli_NC_008253.fna ./

Second, find an appropriate samtools and bcftools packages in the module system and load them into your environment. Record the commands for this step for your job.slurm script.

Tip

This was tested with samtools v1.20 and bcftools v1.21

Then, adapt the following steps be replacing the [ALL_CAPS_FILENAME_PLACEHOLDERS] with appropriate names for this exercise and put place them into your job.slurm script:

Step 4a: First convert aligned reads from sam format to bam format

samtools view -b -S -o [BINARY_ALIGNMENT_MAP] [ORIGINAL_SEQUENCE_ALIGNMENT]

Step 4b: Sort the bam file

samtools sort -o [SORTED_BINARY_ALIGNMENT_MAP] [BINARY_ALIGNMENT_MAP]

Step 4c: Index the sorted bam file (sorting and indexing enables fast, efficient access)

samtools index [SORTED_BINARY_ALIGNMENT_MAP]

Step 4d: Identify genomic variants

bcftools mpileup -f [REFERENCE_GENOME] -o [VARIANT_CALL_FILE] [SORTED_BINARY_ALIGNMENT_MAP]

Note

Typical extensions for the file formats above are:

  • sequence alignment map = .sam

  • binary alignment map = .bam

  • variant call file = .vcf

The job should take less than 10 minutes on one node. Submit the job, monitor the job status in the queue, and look for the results. Make sure you understand which file was generated at each step and what it represents. Make sure you can identify whether this job finished successfully or if ther were any errors.

BONUS

Visualize the genome in an idev session by performing the following:

samtools tview [SORTED_BINARY_ALIGNMENT_MAP] [REFERENCE_GENOME]

Additional Resources