Suggested article: Evaluation of variant identification methods for
whole genome sequencing data in dairy cattle
Background: Advances in human genomics have allowed unprecedented productivity in ter ms of algorithms, software, and literature available for translatin g raw next-generation sequence data into high-quality information. The challenges of variant identifi cation in organisms with lower quality reference genomes are less well documented. We ex plored the consequences of commonly recommended preparatory steps and the effects of single and multi sample variant identification methods using four publicly available software applicat ions (Platypus, HaplotypeCaller, Samtools and UnifiedGenotyper) on whole genome sequence data of 65 key ancestors of Swiss dairy cattle populations. Accuracy of calling next-generation sequence variants was assessed by comparison to the same loci from medium and high-density single nucleotide variant (SNV) arrays.
Results: The total number of SNVs identified varied by software and method, w ith single (multi) sample results ranging from 17.7 to 22.0 (16.9 to 22.0) million variants. Comput ing time varied considerably between software. Preparatory realignment of insertions and deletions and subsequent base quality score recalibration had only minor effects on the number and quality of SNVs identified by different software, but increased c omputing time considerably. Average concordance for single (multi) sample results with high-de nsity chip data was 58.3% (87.0%) and average genotype concordance in correctly identified SNV s was 99.2% (99.2%) across software. The average quality of SNVs identified, measure d as the ratio of transitions to transversions, was higher using single sample methods than multi sample methods. A consensus approach using results of different software generally provided the highest variant quality in terms of transition / transversion ratio.
Conclusions: Our findings serve as a reference for variant identificati on pipeline development in non- human organisms and help assess the implication of preparatory steps i n next-generation sequencing pipelines for organisms with incomplete reference genome s (pipeline code is included). Benchmarking this information should prove particularly useful in processing next-generation sequencing data for use in genome-wide association s tudies and genomic selection