Outline:

Basic NGS Data Analysis

🎯 Module Introduction:

In this module, you will be introduced to the fundamental principles of analyzing Next-Generation Sequencing (NGS) data. The course begins with an overview of various NGS technologies and the types of data they generate. Following this, you will learn how to identify Single Nucleotide Polymorphisms (SNPs) within resequencing datasets. The module also covers essential quality control measures, techniques for filtering erroneous data, and basic statistical analyses to interpret the biological significance of the identified SNPs.

.

Section 1: Variant Calling

🏁 Learning Objectives:

  1. To develop proficiency in utilizing the Genome Analysis Toolkit (GATK) workflow for variant calling
  2. To understand the process of preparing raw data to meet the specifications of the GATK workflow
  3. Apply knowledge gained from GATK variant calling workflow to effectively identify and characterize genetic variants, including single nucleotide polymorphisms (SNPs) and small insertions/deletions (indels)

Variant Calling

Tools:

  1. Burrows-Wheeler Aligner (BWA)
    A software package for mapping low-divergent sequences against a large reference genome
  2. Genome Analysis Toolkit (GATK)
    Offers a wide variety of tools with primary focus on variant discovery and genotyping
  3. FastP
    A tool designed to provide fast all-in-one preprocessing for FastQ files
  4. FastQC
    A quality control tool for high throughput sequence data
  5. Picard
    A set of command line tools for manipulating high-throughput sequencing data
  6. SAMtools
    Tool that provides various utilities for manipulating alignments in the SAM and BAM formats

Input Files:

  1. Raw reads
    • Format: FASTQ (.fq, .fastq, .fastq.gz)
    • FASTQ: Used to store both the nucleotide sequence data and quality scores
      • 1st line: Sequence identifier
      • 2nd line: Sequence
      • 3rd line: Quality score identifier
      • 4th line: Base quality scores
  2. Reference
    1. Format:FASTA (.fasta, .fa, .fsa)
    2. FASTA: A text-based format for representing collections of nucleotide or protein sequences. The file consists of several blocks, each containing a header line and sequence lines. Typically, each block contains a sequence of a whole chromosome or a contig.
      • Line 1: A header line starting with a > character, which contains information about the sequence.
      • Line 2: Sequence (can occupy several lines)

Intermediate Files:

  1. Sequence Alignment/Map Format (SAM)
    • a. Format for storing large nucleotide sequence alignments
    • b. A .sam file consists of two sections:
      • Header (lines starting with @): Contains metadata about the reference genome and alignment process.
      • Alignment section: Contains one line per read, describing its alignment to the reference.

      Example .sam content:

      @HD	VN:1.6	SO:coordinate
      @SQ	SN:chr1	LN:248956422
      @PG	ID:bwa	PN:bwa	VN:0.7.17-r1188
      read123	0	chr1	1000	60	100M	*	0	0	ACGTACGT...	TGCAACGT...
      read456	16	chr1	1050	60	100M	*	0	0	TGCAGTCA...	ACGTTGCA...
      

      Explanation of fields in alignment lines (tab-delimited):

      1. QNAME: Query name (e.g., read ID)
      2. FLAG: Bitwise flag describing the read (e.g., strand, pairing)
      3. RNAME: Reference sequence name (e.g., chr1)
      4. POS: 1-based position where alignment starts
      5. MAPQ: Mapping quality (0–60, higher is better)
      6. CIGAR: CIGAR string describing alignment (e.g., 100M = 100 matched bases)
      7. RNEXT: Next read's reference name (for paired-end)
      8. PNEXT: Position of the next read (paired-end)
      9. TLEN: Template length
      10. SEQ: Read sequence
      11. QUAL: Quality scores (ASCII-encoded)
  2. Binary Alignment Map (BAM)
    1. Binary version of the SAM format
    2. To view the file, use:
    3. samtools view [BAMfile]
    4. Specification document (section 4)
  3. BAM Index files
    1. samtools index – indexes SAM/BAM/CRAM files
    2. Syntax:
    3. samtools index [options] [BAMfile]

Output File:

  1. Variant Call Format (VCF)
    1. A file format used to store genetic variants, including their genomic position, reference and alternate alleles, quality scores, and sample-specific genotype data . It only lists positions where a variant is present.
    2. Specification document
    3. VCF format example image

  2. Genomic Variant Call Format (gVCF)
    1. A kind of VCF that includes both variant and non-variant regions. It provides information about regions where no variation was found, which helps in later combining data from multiple samples.
    2. VCF format example image

Section 2: Variant Calling Pipeline

Description

This is a pipeline for variant calling based on best practices for GATK4

Requirements and Preparation

  1. Ubuntu installation on your laptop/workstation
  2. Installation of the necessary tools
    1. BWA, FastP, FastQC, GATK4, Picard, Samtools
  3. Demo dataset
    1. Reference:Ref.fasta
    2. Data:sample-1_1.fq.gz sample-1_2.fq.gz

Pipeline

Reference Data Preparation (Indexing)

Tools: bwa index, samtools faidx, picard CreateSequenceDictionary

Purpose
  • Indexing improves the ability of tools to rapidly access random locations of the file. It allows efficient retrieval of sequence data from the FASTA file.

Commands

The following commands are essential for preparing the reference genome and supporting tools for alignment and variant calling:

BWA Index

bwa index [Reference.fasta]

bwa index: This command generates an index for the reference genome. The index allows the aligner (like bwa) to quickly locate matching sequences during alignment, making the process more efficient.

Output Files: Several index files are generated in the same directory as the FASTA:

  • Ref.fasta.bwt
  • Ref.fasta.pac
  • Ref.fasta.ann
  • Ref.fasta.amb
  • Ref.fasta.sa

File descriptions:

  • .bwt – Burrows-Wheeler Transform of the reference
  • .pac – Packed version of the reference sequence
  • .ann – Annotation file with contig names and lengths
  • .amb – Information about ambiguous bases (Ns)
  • .sa – Suffix array used for fast search

These files allow bwa to perform fast sequence lookups during read alignment.

Samtools faidx

samtools faidx [Reference.fasta]

samtools faidx: This command creates an index file (.fai) for the reference FASTA file. This index is required by tools like GATK to quickly access specific sequences from the reference genome during operations like variant calling.

Output: Ref.fasta.fai

The .fai file is a tab-delimited index

This index is required by GATK and other tools to reference genome positions directly and efficiently.

Example .fai content:

chr1	248956422	52		60	61
 chr2	242193529	253404903	60	61
 chr3	198295559	500657153	60	61

Columns:

  1. Sequence name (e.g., chr1)
  2. Sequence length in bases
  3. Byte offset of the first base in the FASTA file
  4. Line length of sequence lines in the FASTA
  5. Line width including newline characters

Picard CreateSequenceDictionary

java -jar $PICARD CreateSequenceDictionary -REFERENCE [Reference.fasta] -OUTPUT [Reference.dict]

picard CreateSequenceDictionary: This command generates a sequence dictionary file (.dict) from the reference FASTA file. This file is required by tools like GATK for proper handling of BAM files and ensures that genomic sequences can be accessed correctly during analysis.

Output: Ref.dict

The .dict file is a text file containing metadata for each sequence (contig) in the FASTA, such as its name, length, and MD5 checksum, allowing tools to validate and synchronize reference data across workflows.

Example .dict content:

@HD	VN:1.6	SO:coordinate
 @SQ	SN:chr1	LN:248956422	M5:6aef897c3d6ff0c78aff06ac189178dd
 @SQ	SN:chr2	LN:242193529	M5:f98db672eb0993dcfdabafe2a8824adb
 @SQ	SN:chr3	LN:198295559	M5:3e8bb77c8e8e9d1217c335d58efb8c09

Raw Read Preparation

  1. Quality Check

    Tool: FastQC Analysis Modules

  2. Purpose
    • To ensure that subsequent analyses are based on reliable data.

    Syntax
    fastqc [in1.fq] [in2.fq] [...inN.fq]
    Arguments:
    • in1: First input sequence file
    • in2: Second input sequence file
    • ...inN: Additional input files for batch QC
    Command
    fastqc Data/sample-1_1.fq.gz Data/sample-1_2.fq.gz
    Output
    • FastQC Report - identify potential issues such as adapter contamination, overrepresented sequences, and base quality distribution.

  3. Cleaning

    Tool:FastP

    Functionality
    • A fast all-in-one preprocessing tool for removing adapters, filtering, and trimming raw sequencing data

    Syntax
    fastp -i [in1.fq] -I [in2.fq] -o [out1.fq] -O [out2.fq]
    Arguments:
    • -i: Input file for Read 1
    • -I: Input file for Read 2
    • -o: Output cleaned Read 1
    • -O: Output cleaned Read 2
    Command
    fastp -i Data/sample-1_1.fq.gz -I Data/sample-1_2.fq.gz -o Data/sample-1_1-cleaned.fq.gz -O Data/sample-1_2-cleaned.fq.gz
    Output
    • Cleaned reads ready for alignment(*-cleaned.fq.gz)

Alignment to Reference Genome

Tool:bwa mem

    Purpose
    • Maps sequencing reads to a reference genome to identify variation locations, enabling accurate variant calling.

    Syntax
    • bwa mem arguments [Reference.fasta] [in1.fq] [in2.fq] > [aln.sam]
      Arguments:
      • -t INT: number of threads (e.g., -t 8)
      Command
      bwa mem -t 8 Ref/Ref.fasta Data/sample-1_1-cleaned.fq.gz Data/sample-1_2-cleaned.fq.gz > Output/sample.sam
    Output
    • Sequence alignment data / alignment output (SAM format)

Sorting

Tool:SortSam

    Purpose
    1. To organize reads by their genomic position for efficient variant calling.
    2. To reduce file size and enable faster processing of files.
    Syntax
    java -jar $GATK SortSam --INPUT [aln.sam] --OUTPUT [sorted.bam] -SORT_ORDER [coordinate]
    Arguments:
    • --INPUT, -I: Input BAM or SAM file to sort. Required
    • --OUTPUT, -O: Sorted BAM or SAM output file. Required
    • --SORT_ORDER, -SO: Sort order of the output file.
      • coordinate: Sorts primarily according to the SEQ and POS fields of the record.
    Command
    java -jar $GATK SortSam --INPUT Output/sample.sam --OUTPUT Output/sample.sorted.bam -SORT_ORDER coordinate
    Output
    • Sorted aligned reads (.sorted.bam)

Fix mate information

Tool:FixMateInformation

    Purpose
    • To ensure that the paired-end sequencing reads have consistent and correct mate information.
    Syntax
    java -jar $GATK FixMateInformation --INPUT [sorted.bam] --OUTPUT [fxmt.bam] --CREATE_INDEX [TRUE] --VALIDATION_STRINGENCY [LENIENT]
    Arguments:
    • --INPUT, -I: The input files to check and fix. Required
    • --OUTPUT, -O: The output file to write to. Required
    • --CREATE_INDEX: Whether to create an index when writing VCF or coordinate sorted BAM output.
    • --VALIDATION_STRINGENCY: Validation stringency for all SAM files read by this program.
      • LENIENT: Emit warnings but keep going if possible.
    Command
    java -jar $GATK FixMateInformation --INPUT Output/sample.sorted.bam --OUTPUT Output/sample.fxmt.bam --CREATE_INDEX TRUE --VALIDATION_STRINGENCY LENIENT
    Output
    • Mate-fixed aligned reads (.fxmt.bam) and an index (.fxmt.bai).

Add Read Group Information

Tool:AddOrReplaceReadGroups

    Purpose
    1. To distinguish sequencing data sources and ensure accurate variant calling and analysis.
    2. To read more about "read groups", please see this link.
    Syntax
    java -jar $GATK AddOrReplaceReadGroups --INPUT [fxmt.bam] --OUTPUT [addrep.bam] --RGID [ID] --RGPU [Unit] --RGLB [Library] -PL [Platform] -SM [SampleName] -CN [CenterName] --VALIDATION_STRINGENCY [LENIENT] --CREATE_INDEX [TRUE]
    Arguments:
    • --INPUT, -I: Input file (BAM or SAM or a GA4GH url). Required
    • --OUTPUT, -O: Output file (BAM or SAM). Required
    • --RGLB, -LB: Read-Group library. Required
    • --RGPL, -PL: Read-Group platform (e.g., ILLUMINA). Required
    • --RGPU, -PU: Read-Group platform unit (e.g., run barcode). Required
    • --RGSM, -SM: Read-Group sample name. Required
    • --RGCN, -CN: Read-Group sequencing center name. Default: null
    • --RGID, -ID: Read-Group ID. Default: 1
    Command
    java -jar $GATK AddOrReplaceReadGroups --INPUT Output/sample.fxmt.bam --OUTPUT Output/sample.addrep.bam --RGID run100 --RGPU unit1 --RGLB lib1 -PL Illumina -SM sample -CN CN --VALIDATION_STRINGENCY LENIENT --CREATE_INDEX TRUE
    Output
    • Aligned reads with updated read group information (.addrep.bam) and an index (.addrep.bai).

Mark Duplicates

Tool:MarkDuplicates

    Purpose
    • To avoid bias and inaccuracies.
    Syntax
    java -jar $GATK MarkDuplicates --INPUT [addrep.bam] --OUTPUT [mkdup.bam] --CREATE_INDEX [TRUE] --VALIDATION_STRINGENCY [LENIENT] --METRICS_FILE [mkdup.metrics]
    Arguments:
    • --INPUT, -I: Input SAM or BAM files to analyze. Must be coordinate sorted. Required
    • --OUTPUT, -O: Output file with marked duplicates. Required
    • --METRICS_FILE, -M: File to write duplication metrics to. Required
    Command
    java -jar $GATK MarkDuplicates --INPUT Output/sample.addrep.bam --OUTPUT Output/sample.mkdup.bam --VALIDATION_STRINGENCY LENIENT --CREATE_INDEX TRUE --METRICS_FILE Output/sample.mkdup.metrics
    Output
    • Duplicate-marked alignment file (.mkdup.bam)
      Index (.mkdup.bai)
      Duplication metrics file (.mkdup.metrics)

Merging BAM files (Optional)

Tool:samtools merge

Purpose
  • Merging (if there are multiple read pairs per sample)
Syntax
samtools merge [out.bam] [in1.bam] [in2.bam]

Do not forget to index the merged BAM file. Use samtools index [out.bam]

Output
  • Merged BAM file

Variant Calling

Tool:HaplotypeCaller

    Functionality
    • Call SNPs and indels from aligned sequencing reads.
    Syntax
    java -jar $GATK HaplotypeCaller -R [Reference.fasta] -I [mkdup.bam] -O [file.g.vcf] -ERC [GVCF] --bam-output [file.bamout.bam]
    Arguments:
    • --input, -I: BAM/SAM/CRAM file containing reads. Required
    • --output, -O: File to which variants should be written. Required
    • --reference, -R: Reference sequence file. Required
    • --emit-ref-confidence, -ERC: Mode for emitting reference confidence scores.
      Default: NONE | Options: NONE, BP_RESOLUTION, GVCF
    Command
    java -jar $GATK HaplotypeCaller \
    -R Ref/Ref.fasta \
    -I Output/sample.mkdup.bam \
    -O Output/sample.g.vcf \
    -ERC GVCF \
    --bam-output Output/sample.bamout.bam
    Output
    • A set of raw variant calls, including SNPs and indels (.g.vcf)
      GVCF index (.g.vcf.idx)
      BAM file bamout.bam
      BAM file index bamout.bai

Merging GATK GVCFs

Tool:CombineGVCFS

    Purpose
    1. SNP calling produces multiple gVCF files – each file has variant for one sample.
    2. To analyze the data, it is necessary to merge the individual gVCF files.
    Syntax
    java -jar $GATK CombineGVCFS -R [Reference.fasta] -V [in1.g.vcf] -V [in2.g.vcf] -O [merged.g.vcf]
    Arguments:
    • --variant, -V: One or more VCF files containing variants. Required
    • --output, -O: File to which variants should be written. Required
    • --reference, -R: Reference sequence file. Required
    Output
    • The combined GVCF

Genotyping

Tool:GenotypeGVCFs

Purpose
  • Produces genotype calls, filters out non-variant positions
Syntax
java -jar $GATK GenotypeGVCFs -R [Reference.fasta] -V [input.g.vcf] -O [output.vcf]
Arguments:
  • --variant, -V: A VCF file containing variants. Required
  • --output, -O: File to which variants should be written. Required
  • --reference, -R: Reference sequence file. Required
Output
  • The combined, genotyped VCF

VCF Exploration

Motivation

VCF File: Structure


Here is an example VCF, colored using bioSyntax tool.

Lines 1-29 make up the Header Section, Lines 30 and below comprise the Data section Finally in the bottom we see data section (only variant related columns, because the INFO column takes a lot of horizontal space). The actual sample genotype data are in columns starting from the 10th (not shown here).

VCF File: Section

VCF file consists of two parts: header and data section.


QC Annotations

QUAL

QUAL is usually defined as: Phred-scaled probability of having no variant at this position

higher QUAL = lower prob of no variant = higher prob of real variant


QUAL P(no variant) -logP
10 0.1 1
20 0.01 2
30 0.001 3
40 0.0001 4

Phred scaled means take log10 and multiply by -10:


\(QUAL = (βˆ’log P(no variant)) * 10\)



Question: How bad/good is QUAL=30?

Read Depth (DP)


Note : DP shows depth AFTER filtering by UG or HC algorithm. It is not the same as BAM read depth.

Quality-by-Depth (QD)

MQ and MQ0


Recommended minimum MQ = S40

Allele Frequency Annotations

AN, AC, AF


Filtering on Allele Frequency

Base Calling Errors

Alignment Errors

Heterozygosity



Allele frequency vs heterozygosity in 3K data

Figure 1



Figure 2



Figure 3



SNP heterozygosity along genome

Figure 1



Filtering: based on heterozygosity



Filtering: based on Hardy-Weinberg

Why the standard Hardy-Weinberg can be inappropriate


Modified Hardy-Weinberg

Take-away

Variant Calling Pipeline

Description

This is a pipeline for variant calling based on best practices for GATK4

Requirements and Preparation

  1. Ubuntu installation on your laptop/workstation
  2. Installation of the necessary tools
    1. BWA, FastP, FastQC, GATK4, Picard, Samtools
  3. Demo dataset
    1. Reference:Ref.fasta
    2. Data:sample-1_1.fq.gz sample-1_2.fq.gz

Pipeline

Data Preparation and Import

  1. Unpack the sample dataset archive.
  2. Command
    • tar -xzvf sample.tar.gz
    • Extracts the sample.tar.gz archive

    Output
    • sample β”œβ”€β”€ Data β”‚Β Β  β”œβ”€β”€ sample-1_1.fq.gz β”‚Β Β  └── sample-1_2.fq.gz └── Ref └── Ref.fasta 3 directories, 3 files

  3. Create Output directory.
  4. Command
    • cd sample

    • mkdir Output
    • Changes the current directory to sample and creates a new directory named Output
      inside it.

    Output

Reference Data Preparation (Indexing)

  1. BWA Index

  2. Syntax
    bwa index [Reference.fasta]
    Command
    • bwa index Ref/Ref.fasta
    • Indexes the Ref.fasta file using the BWA tool

    Output
    • Files:
    • Ref.fasta.bwt
      Ref.fasta.pac
      Ref.fasta.ann
      Ref.fasta.amb
      Ref.fasta.sa

  3. Samtools faidx

  4. Syntax
    samtools faidx [Reference.fasta]
    Command
    • samtools faidx Ref/Ref.fasta
    • Indexes the Ref.fasta file using samtools

    Output
    • File: Ref.fasta.fai
  5. Picard CreateSequenceDictionary

  6. Syntax
    java -jar $PICARD CreateSequenceDictionary -REFERENCE [Reference.fasta] -OUTPUT [Reference.dict]
    Command
    • java -jar $PICARD CreateSequenceDictionary -REFERENCE Ref/Ref.fasta -OUTPUT Ref/Ref.dict
    • Creates a sequence dictionary for Ref.fasta

    Output
    • File: Ref.dict

Raw Read Preparation

  1. Quality Check

  2. Syntax
    fastqc [in1.fq] [in2.fq] [...inN.fq]
    Command
    • fastqc Data/sample-1_1.fq.gz Data/sample-1_2.fq.gz
    • Runs FastQC analysis on files sample-1_1.fq.gz and sample-1_2.fq.gz

    Output
    • File: FastQC reports
  3. Cleaning

  4. Syntax
    fastp -i [in1.fq] -I [in2.fq] -o [out1.fq] -O [out2.fq]
    Command
    • fastp -i Data/sample-1_1.fq.gz -I Data/sample-1_2.fq.gz -o Data/sample-1_1-cleaned.fq.gz -O Data/sample-1_2-cleaned.fq.gz
    • Clean reads sample-1_1.fq.gz and sample-1_2.fq.gz

    Output
    • File: *-cleaned.fq.gz

Alignment to Reference Genome

Syntax
bwa mem arguments [Reference.fasta] [in1.fq] [in2.fq] > [aln.sam]
Command
  • bwa mem -t 8 Ref/Ref.fasta Data/sample-1_1-cleaned.fq.gz Data/sample-1_2-cleaned.fq.gz > Output/sample.sam
  • Align cleaned reads sample-1_1-cleaned.fq.gz and sample-1_2-cleaned.fq.gz
    to a reference genome Ref.fasta

Output
  • File: sample.sam

Sorting

Tool:SortSam

Syntax
java -jar $GATK SortSam --INPUT [aln.sam] --OUTPUT [sorted.bam] -SORT_ORDER [coordinate]
Command
  • java -jar $GATK SortSam --INPUT Output/sample.sam --OUTPUT Output/sample.sorted.bam -SORT_ORDER coordinate
  • Sorts sample.sam file by coordinate.

Output
  • File: sample.sorted.bam

Fix mate information

Tool:FixMateInformation

Syntax
java -jar $GATK FixMateInformation --INPUT [sorted.bam] --OUTPUT [fxmt.bam] --CREATE_INDEX [TRUE] --VALIDATION_STRINGENCY [LENIENT]
Command
  • java -jar $GATK FixMateInformation --INPUT Output/sample.sorted.bam --OUTPUT Output/sample.fxmt.bam --CREATE_INDEX TRUE --VALIDATION_STRINGENCY LENIENT
  • Fixes mate information in a sorted BAM file sample.sorted.bam .

Output
  • Files:
  • sample.fxmt.bam
    sample.fxmt.bai

Add Read Group Information

Tool:AddOrReplaceReadGroups

Syntax
java -jar $GATK AddOrReplaceReadGroups --INPUT [fxmt.bam] --OUTPUT [addrep.bam] --RGID [ID] --RGPU [Unit] --RGLB [Library] -PL [Platform] -SM [SampleName] -CN [CenterName] --VALIDATION_STRINGENCY [LENIENT] --CREATE_INDEX [TRUE]
Command
  • java -jar $GATK AddOrReplaceReadGroups --INPUT Output/sample.fxmt.bam --OUTPUT Output/sample.addrep.bam --RGID run100 --RGPU unit1 --RGLB lib1 -PL Illumina -SM sample -CN CN --VALIDATION_STRINGENCY LENIENT --CREATE_INDEX TRUE
  • Adds read group information in sample.fxmt.bam file.

Output
  • Files:
  • sample.addrep.bam
    sample.addrep.bai

Mark Duplicates

Tool:MarkDuplicates

Syntax
java -jar $GATK MarkDuplicates --INPUT [addrep.bam] --OUTPUT [mkdup.bam] --CREATE_INDEX [TRUE] --VALIDATION_STRINGENCY [LENIENT] --METRICS_FILE [mkdup.metrics]
Command
  • java -jar $GATK MarkDuplicates --INPUT Output/sample.addrep.bam --OUTPUT Output/sample.mkdup.bam --VALIDATION_STRINGENCY LENIENT --CREATE_INDEX TRUE --METRICS_FILE Output/sample.mkdup.metrics
  • Identifies and marks duplicate reads in sample.addrep.bam file.

Output
  • Files:
  • sample.mkdup.bam
    sample.mkdup.bai
    sample.mkdup.metrics

Merging BAM Files (Optional)

Tool:samtools merge

Syntax
samtools merge [out.bam] [in1.bam] [in2.bam]
Command
  • samtools merge SB4-1A_SHB134-11-44.mkdup.bam \ ./V300088030.4/Output/V300088030.4.mkdup.bam \ ./V300081906.4/Output/V300081906.4.mkdup.bam \ ./V300082057.1/Output/V300082057.1.mkdup.bam

Variant Calling

Tool:HaplotypeCaller

Syntax
java -jar $GATK HaplotypeCaller -R [Reference.fasta] -I [mkdup.bam] -O [file.g.vcf] -ERC [GVCF] --bam-output [file.bamout.bam]
Command
  • java -jar $GATK HaplotypeCaller \
    -R Ref/Ref.fasta \
    -I Output/sample.mkdup.bam \
    -O Output/sample.g.vcf \
    -ERC GVCF \
    --bam-output Output/sample.bamout.bam
  • Performs variant calling on a BAM file sample.mkdup.bam against a reference genome
    Ref.fasta

Output
  • Files:
  • sample.g.vcf
    sample.g.vcf.idx
    sample.bamout.bam
    sample.bamout.bai

Merging GATK GVCFs

Tool:CombineGVCFS

Dataset
Syntax
java -jar $GATK CombineGVCFS -R [Reference.fasta] -V [in1.g.vcf] -V [in2.g.vcf] -O [merged.g.vcf]
Command
  • java -jar $GATK CombineGVCFs \
    -R Ref/chr01.fa \
    -V gvcfs/sample1.g.vcf \
    -V gvcfs/sample2.g.vcf \
    -V gvcfs/sample3.g.vcf \
    -O Output/combined.g.vcf
Output
  • File: Combined GVCFs

Genotyping GVCFs

Tool:GenotypeGVCFs

Syntax
java -jar $GATK GenotypeGVCFs -R [Reference.fasta] -V [input.g.vcf] -O [output.vcf]
Command
  • java -jar $GATK GenotypeGVCFs -R Ref/chr01.fa -V Output/combined.g.vcf -O Output/combined.vcf
Output
  • File: Merged VCF

Input and Output Files

Input Files

File Type File Name
VCF geno-250kb.vcf.gz
gVCF sample1-50kb.g.vcf
BAM kanin-4.sorted.bam


Output Files

Output Type Description
Statistics General summary metrics and counts
Tables Structured result tables for further analysis
Plots Visual representations such as charts and graphs

Getting to know the tool

Installing BCFtools - please follow the Software Installation Guide

  1. To get help directly within the terminal, you can run the command without arguments:
  2. Usage: bcftools [--version|--version-only] [--help] [command] [argument]

    • bcftools #
    • bcftools -h
    • bcftools view
  3. Most used commands:
    • bcftools # : with no arguments: general info and list of commands
    • bcftools query: transform VCF/BCF into user-defined formats
    • bcftools view: VCF/BCF conversion, view, subset and filter VCF/BCF files

    * Adding information to VCF

    • bcftools annotate: annotate and edit VCF/BCF files. E.g. add variant IDs
    • bcftools csq: call variation consequences (like snpEff)

    * Manipulating VCF files

    • bcftools isec: convert VCF/BCF files to different formats and back
    • bcftools reheader: modify VCF/BCF header, change sample names
    • bcftools sort: sort VCF/BCF file
    • bcftools index: needed for random access into VCF file

Initial getting to know the data

So, you got a fresh VCF file from the variant calling pipeline. It may be a gVCF file or a simple VCF file.

Question: What is the difference?

VCF Header

A lot of information about how the VCF was produced and what kind of information may be found in it, is located in the header. To see the header you may run:

bcftools view --header-only geno-250kb.vcf.gz


Based on the information in the header, please do the following exercise.

Hands-on exercise

For both input VCF files, extract the headers and answer the questions:

  1. Find which commands produced the original VCF file (GATKCommandLine).
  2. Find which commands were applied to the original VCF file to create the current file.
  3. Find names of chromosomes or contigs that are present in this (g)VCF.
  4. Does this VCF contain InbreedingCoefficient metrics?
  5. Does this VCF contain QD metrics? What is its type (number, string, etc)?
Overall Stats

To get an overall report containing many basic statistics, run bcftools stats

bcftools stats geno-250kb.vcf.gz > stats.txt
Exercise

Based on the stats file, find out how many of the following are in the file:

BCFtools query command

The main element in the structure of the command is the query string (or format string) that specifies what information you want to get. For each record in the VCF the query string will be evaluated and added as a row in the output

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz
Extract variant QC metrics

To get various QC metrics, include the names of the metrics with the % sign prefix into the query string. There are some metrics, like read depth DP, that are applicable to both variant as a whole or to each sample. To distinguish between these, use prefixes INFO and FORMAT


Example: some variant level QC metrics

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\t%QD\t%INFO/DP\n' file.vcf.gz > variant-DP.txt

For comparison, here is an example of read depths of each sample:

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%FORMAT/DP]\n' file.vcf.gz

Extract variant genotype matrix as a table

Let us create a file by first putting row names, then dumping all the genotype calls:

bcftools query -f 'CHROM\tPOS\tREF\tALT[\t%SAMPLE]\n' geno-250kb.vcf.gz | head -n 1 > out.txt


bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' geno-250kb.vcf.gz >> out.txt


S

Note in the first command we don't use % except in the %SAMPLE - thus, only sample names will be interpolated into the string.

Exploring alignment (BAM) files

We will use SAMtools software to retrieve basic information about the alignment data from BAM files. In order to make sense of some of the following, it would help to be familiar with SAM file structure, which you can read about in SAM format specifications. See also the SAM format wiki page for easier reading.

The following SAMtools commands will be used:

View header and alignments

To view header:


To view alignments in a more familiar environment of MS Excel, you may use samtools view and save the result as a TSV file, which you can then view in Excel by importing as a tab separated file

echo "QUERYNAME,FLAG,REFNAME,POS,MAPQ,CIGAR,RNEXT,POSNEXT,TLEN,SEQ,QUAL" | tr "," "\t" > aln.tsv


samtools view kanin-4.sorted.bam | head -n 20 >> aln.tsv


Note: changing tabs to commas in the whole file using the tr command might not be a very good idea, since the base quality string may contain commas.

Understanding Alignment Metadata

A lot of information about alignment of a read pair to reference is contained in the flags column. Each flags is a boolean true or false value encoded in a bit of the flags byte. See Bitwise flags

To easily retrieve the flags pertaining to a certain read alignment, e.g. the first read in the input BAM file is

FCD0R2NACXX:7:2105:6385:19577#CCACATTC_24       163     Chr7    272768  0       83M     =       273143  458     AGTTGTTTCCTCCCGCCGTGGCCCACGCTTTATCCGCACTGCGCGGAAGCGTGTTGTTGGTTCTTCCACCGTGTTCCGTTTCC  BFHHHHJJIJIJJJIJJJGHJJJIIIIIJJJIIIIJIJJJJGHEDD@BDDDB@BDDDDDDDDDDDDCCDDDDDDDDDDDDDCD  NM:i:1  MD:Z:5C77       MC:Z:83M        AS:i:78 XS:i:78

Its flag column shows value 163

To understand meanings of these abbreviations, start with running samtools flags without arguments

samtools flags


About: Convert between textual and numeric flag representation
Usage: samtools flags FLAGS...

Each FLAGS argument is either an INT (in decimal/hexadecimal/octal) representing
a combination of the following numeric flag values, or a comma-separated string
NAME,...,NAME representing a combination of the following flag names:

   0x1     1  PAIRED         paired-end / multiple-segment sequencing technology
   0x2     2  PROPER_PAIR    each segment properly aligned according to aligner
   0x4     4  UNMAP          segment unmapped
   0x8     8  MUNMAP         next segment in the template unmapped
  0x10    16  REVERSE        SEQ is reverse complemented
  0x20    32  MREVERSE       SEQ of next segment in template is rev.complemented
  0x40    64  READ1          the first segment in the template
  0x80   128  READ2          the last segment in the template
 0x100   256  SECONDARY      secondary alignment
 0x200   512  QCFAIL         not passing quality controls or other filters
 0x400  1024  DUP            PCR or optical duplicate
 0x800  2048  SUPPLEMENTARY  supplementary alignment

The next read is
FCD0R2NACXX:7:2105:6385:19577#CCACATTC_24       83      Chr7    273143  0       83M     =       272768  -458    TCGGCGTAGCTCTGTGACTTGTCCACACCCATTGTATCAGGTGTTTTTATGATTTGAAAAATCCAGTTTATGCCTAGTGTGTG  BDDDBDEEEEEFFFDFFFHHHHJIHHHJJIJGIJJJJJJJJJJJJJJJJJJJJJJJJJJJIIJJJJJJIJJJJJJJJHHHHHF  NM:i:1  MD:Z:30G52 
with the flag 83

Interactive text Viewer

First, make sure the file is indexed

samtools index kanin-4.sorted.bam


Then (No need to pipe to less command!)
samtools tview kanin-4.sorted.bam -p Chr7:6011011


For a more feature-full experience, use a GUI alignment viewer such as IGV.

Module Summary

Understanding the genetic basis of diversity through the discovery of Single Nucleotide Polymorphisms (SNPs) via variant calling is a fundamental step in gene discovery. By identifying and analyzing SNPs, researchers can uncover genetic variations that contribute to differences in traits. This knowledge is critical for pinpointing specific genes associated with those traits, facilitating advancements in fields such as genetics, breeding, and biotechnology. In this module, you have gained an understanding of the Variant Calling workflow. You have explored the essential tools and software used for variant discovery, including FASTQC, BWA, Picard, SamTools, and GATK. You have also had the opportunity to perform variant calling, and you have learned how to analyze and interpret the resulting data files, such as VCF and BAM files.

References

Slatko BE, Gardner AF, Ausubel FM. Overview of Next-Generation Sequencing Technologies. Curr Protoc Mol Biol. 2018 Apr;122(1):e59. doi: 10.1002/cpmb.59. PMID: 29851291; PMCID: PMC6020069. https://doi.org/10.1002/cpmb.59