BWA and samtools and variant calling

Here we will use the BWA aligner to map short reads to a reference genome, and then call variants (differences between the reads and the reference).

Getting started

Start up an m1.medium instance running Ubuntu 16.04 on Jetstream.

log in, and then install samtools:

  sudo apt-get -y update && \
  sudo apt-get -y install trimmomatic fastqc python-pip \
  zlib1g-dev ncurses-dev python-dev

Download data

Goal: get the sequence data!

  1. Run:

     mkdir ~/data
     cd ~/data
     curl -O

Map data

Goal: execute a basic mapping

  1. Run the following commands to install bwa:

     curl -L > bwa-0.7.15.tar.bz2
     tar xjvf bwa-0.7.15.tar.bz2
     cd bwa-0.7.15
     sudo cp bwa /usr/local/bin
     echo 'export PATH=$PATH:/usr/local/bin' >> ~/.bashrc
     source ~/.bashrc
  2. Make & change into a working directory:

     mkdir ~/work
     cd ~/work
  3. Copy and gunzip the reference:

     gunzip ecoli-rel606.fa.gz
  4. Prepare it for mapping:

     bwa index ecoli-rel606.fa
  5. Map! (This will take about 2 minutes.)

     bwa mem -t 6 ecoli-rel606.fa ~/data/SRR2584857.fq.gz > SRR2584857.sam
  6. Observe!

     head SRR2584857.sam

Visualize mapping

Goal: make it possible to go look at a specific bit of the genome.

  1. Install samtools:

     sudo apt-get -y install samtools
  2. Convert the SAM file into a BAM file that can be sorted and indexed:

     samtools view -hSbo SRR2584857.bam SRR2584857.sam
  3. Sort the BAM file by position in genome:

     samtools sort SRR2584857.bam SRR2584857.sorted
  4. Index the BAM file so that we can randomly access it quickly:

     samtools index SRR2584857.sorted.bam
  5. Visualize with tview:

     samtools tview SRR2584857.sorted.bam ecoli-rel606.fa

    tview commands of relevance:

    • left and right arrows scroll
    • q to quit
    • CTRL-h and CTRL-l do “big” scrolls
    • g ecoli:3931002 will take you to a specific location.

Call variants!

Goal: find places where the reads are systematically different from the genome.

Now we can call variants using samtools mpileup:

samtools mpileup -uD -f ecoli-rel606.fa SRR2584857.sorted.bam | \
    bcftools view -bvcg - > variants.raw.bcf

This will take a few minutes… and output a file that is not human readable! But we can quickly convert it into the ‘variant call format’ that is human readable:

bcftools view variants.raw.bcf > variants.vcf

Look at the VCF file

The output VCF file contains a list of all the variants that samtools thinks are there. What’s in it?

The official VCF specification is a great read…if you’re suffering from insomnia. Let’s skip this and just take a quick look at the file.

  1. Look at the non-commented lines along with the header:

     grep -v ^## variants.vcf

    The first five columns: CHROM POS ID REF ALT. It’s a little easier to see if you run

     grep -v ^## variants.vcf | less -S

    Use your left and right arrows to scroll, and ‘q’ to quit.

  2. Examine one of the variants with tview:

     samtools tview SRR2584857.sorted.bam ecoli-rel606.fa -p ecoli:920514

    ‘q’ to quit, left arrow to scroll a bit left.

Well, at least that variant looks real…

Look at the VCF file with bedtools.

bedtools docs

  1. Download and build bedtools:

     cd ~/
     curl -O -L
     tar -xzf bedtools-2.26.0.tar.gz
     cd bedtools2
     sudo make install
  2. Go back to work:

     cd ~/work
  3. Download a GFF3 file with annotations for E. coli:

     wget .
  4. Run bedtools intersect:

     bedtools intersect -a ecoli-rel606.gff.gz -b variants.vcf -wa -u

    Documentation for bedtools intersect

Extract reads with samtools.

  1. Execute:

     samtools view SRR2584857.sorted.bam 'ecoli:920514-920514' > out.sam
     wc -l out.sam

and this will give you the coverage of the relevant position.

Discussion points / extra things to cover

  • What are the drawbacks to mapping-based variant calling? What are the positives?
  • Where do reference genomes come from?