Apologies for the silent period, we are back with a talk on genome alignment.
Title: A MapReduce approach to Genome Alignment
Speaker: Karl Pullicino
Date: Tue 10th April 2018, 4-5pm
Location: ICT Faculty Building, CS seminar room 38, Block B, 1st floor.
Abstract: Recent years brought an enormous growth in DNA sequencing capacity and speed, thanks to the application of next-generation sequencing (NGS) technologies. The alignment of read sequences to a given reference genome is crucial for further diagnostic downstream analysis. Finding the optimal alignment of short DNA reads from a biological sample to a reference human genome, requires big data techniques, since reads’ size are in the region of 200GB. In this dissertation we present three approaches to perform distributed sequence alignment of genomic data. The first one is based on an optimization of the Smith-Waterman algorithm. The other two approaches are based on the MapReduce programming paradigm. MR-BWA presents a novel approach in distributing BWA in a different manner than existing work. BWA is an industry standard software used for genomic reads alignment. MR-BWT-FM presents low level optimizations on suffix array and BWT creation which are used to create a custom FM-Index which in turn is used for distributed genome sequence alignment. Output generated by the application generates insights and charts about the results. We evaluate the performance and correctness of both approaches by comparing our output with that of similar tools, using standard datasets from the 1000 Genomes Project. Performance and correctness results for both distributed approaches are comparable with similar tools, whilst the final custom FM-Index size is smaller than the standard BWA index size. The source code of the software described in this dissertation is publicly available at
https://github.com/kpullu/msc.
Keywords: DNA; genomics; sequence alignment; suffix array; bwt; fm-index; MapReduce; big data; cloud computing