Scalable Clustering of Genotype Information using MapReduce

Scalable Clustering of Genotype Information using MapReduce

Scalable Clustering of Genotype Information using MapReduce (#67)

Aidan O'Brien ¹

CSIRO, Ryde, NSW, Australia

Processing genomic information from whole genome sequence studies pose computational challenges due to the unprecedented data volume generated, which render transitional approaches insufficient. However, by utilising advancements in modern hardware accelerators and data processing we can provide the means for scalable solutions. We therefore aim to provide the interface between standard genomic data formats and advanced and scalable analysis libraries like Mahout.

We achieve an 2-fold speedup by using the scalable k-means MapReduce implementation over the equivalent analysis performed in R, by comparable accuracy. However, the real benefit lies in scaling beyond R's capability to a population-size analysis. We successfully clustered more than 5000 individuals each having more than 15 Million variants.

Using modern compute paradigms is essential to scale to modern genomic research in an efficient sustainable way.

Authors contributing to this presentation.

O'Brien, A

Aidan O'Brien graduated from the University of Queensland with a Bachelor of Biotechnology (Hons) in 2013. Alongside his supervisor, Dr. Timothy Bailey, he created the computational tool GT-Scan, which can assist users in CRISPR guide design. In the months following graduation, he continued work on GT-Scan, which is currently hosted through BRAEMBL. He is now working with big data analysis with Denis Bauer at CSIRO.

Scalable Clustering of Genotype Information using MapReduce (#67)

Add notes

O'Brien, A

Login