Nick Rozinsky GSoC Introduction

10 views
Skip to first unread message

Николай Роз

unread,
May 15, 2019, 2:50:51 PM5/15/19
to sambamba-...@googlegroups.com
Hello Sambamba community!

My name is Nick Rozinsky, and I am Pjotr Prins' GSoC student under Open Bioinformatics Foundation organization.

My project is aimed on benchmarking of column-based file format for storing genomic data.

Current main bioinformatics file formats (BAM, CRAM) store data in a row-oriented manner, which leads to some complications and slow-downs during reading and processing stages. We want to adress this issue radically - by researching opportunities for storing genomic data in a column-oriented manner, using mark duplicate routine as a main performance indicator.

In the first place, I will benchmark implementation based on existing column-oriented file format - Apache Parquet, which was created for effective use in cluster processing ecosystem Hadoop - and if the metrics will be encouraging, I will implement and benchmark new and simpler file format called cBAM (column BAM), to make file handling easier, since Apache Parquet file layout is fairly complicated and the format itstelf mainly intended to be used in distributed systems, rather than single machines.

If this experiment will be successful, other Sambamba's tools like flagstat and sorting can be significantly accelerated by using the new file format.

Best Regards,
Nick Rozinsky

Nick.R...@gmail.com




Reply all
Reply to author
Forward
0 new messages