Scalable Clustering of Genotype Information using MapReduce — ASN Events

Scalable Clustering of Genotype Information using MapReduce (#67)

Aidan O'Brien 1
  1. CSIRO, Ryde, NSW, Australia

Processing genomic information from whole genome sequence studies pose computational challenges due to the unprecedented data volume generated, which render transitional approaches insufficient. However, by utilising advancements in modern hardware accelerators and data processing we can provide the means for scalable solutions. We therefore aim to provide the interface between standard genomic data formats and advanced and scalable analysis libraries like Mahout.

We achieve an 2-fold speedup by using the scalable k-means MapReduce implementation over the equivalent analysis performed in R, by comparable accuracy. However, the real benefit lies in scaling beyond R's capability to a population-size analysis. We successfully clustered more than 5000 individuals each having more than 15 Million variants.

Using modern compute paradigms is essential to scale to modern genomic research in an efficient sustainable way.