Compare Performance of Variant Callers

2 minute read



VAR-Seq Workflow

  1. Read preprocessing: filtering, quality trimming
  2. Alignments
  3. Alignment statistics
  4. Variant calling: focus of challenge project
  5. Variant filtering
  6. Variant annotation
  7. Combine results from many samples
  8. Summary statistics of samples

Challenge Project: Performance Comparisons of Variant Callers

  • Run the workflow from start to finish (steps 1-8) on the VAR-Seq data set from on the data set from Lu et al (2012).
  • Challenge project tasks
    • Compare the performance of at least 2 variant callers, e.g. GATK, BCFtools, Octopus and DeepVariant. Include in your comparisons the following analysis/visualization steps (Barbitoff et al 2022; Cooke et al 2021; Li, 2011; Poplin et al 2018).
      1. Report unique and common variants identified by tested variant callers.
      2. Compare the results from (1) with the variants identified by Lu et al, 2012
      3. Plot results from 1.-2. as venn diagrams or similar (e.g. upset plots)
      4. If there is enough time and interest, plot the performance of the variant callers in the form of ROC plots and calculate AUC values. As pseudo ground truth, one can either use the published variants or the union of the variants identified by all methods.

References

  • Barbitoff YA, Abasov R, Tvorogova VE, Glotov AS, Predeus AV (2022) Systematic benchmark of state-of-the-art variant calling pipelines identifies major factors affecting accuracy of coding sequence variant discovery. BMC Genomics 23: 155. PubMed
  • Cooke DP, Wedge DC, Lunter G (2021) A unified haplotype-based method for accurate and comprehensive variant calling. Nat Biotechnol 39: 885–892. PubMed
  • DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43: 491–498. PubMed
  • Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27: 2987–2993. PubMed
  • Lu P, Han X, Qi J, Yang J, Wijeratne AJ, Li T, Ma H (2012) Analysis of Arabidopsis genome-wide variations before and after meiosis and meiotic recombination by resequencing Landsberg erecta and all four products of a single meiosis. Genome Res 22: 508–518. PubMed
  • Poplin R, Chang P-C, Alexander D, Schwartz S, Colthurst T, Ku A, Newburger D, Dijamco J, Nguyen N, Afshar PT, et al (2018) A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 36: 983–987. PubMed
Last modified 2024-05-27: some edits (1c417dd70)