CloudSACA: Distributed Suffix Array Construction Algorithms Package on Cloud — ASN Events

CloudSACA: Distributed Suffix Array Construction Algorithms Package on Cloud (#22)

Ahmed Abdelhadi 1 , Ahmed Ali 2 , Ahmed Kandil 1 , Mohamed Abouelhoda 1 2
  1. Biomedical Engineering, Cairo University, Giza, Egypt
  2. Center of Information Systems, Nile University, Giza, Egypt

Suffix array is an important indexing data structure for biological sequence analysis. Since its introduction, there have been a lot of algorithms for constructing it either in memory or on disk. However, there is no such implementation for a distributed algorithm that can run on a computer cluster based on (MPI). Moreover, there is no ready-to-use package that can run on cloud computing platforms. There are so far two distributed suffix array construction algorithms that can run on a computer cluster; Futamura-Aluru-Kurtz and Kulla-Sanders. The latter algorithm has better theoretical time complexity but there is no proof of its superiority in practice. Due to the lack (and difficulty) of any implementation for both algorithms, it is still an open question which algorithm would be the best in practice.

In this paper, we have implemented the two distributed algorithms with a number of variations for them. These algorithms are further optimized with recent parallel distributed algorithms to improve their performance. Regarding the cloud implementation, we have developed a module that can automatically create resources either on Amazon or Azure clouds, depending on user preferences. The module also manages upload of input data, submission of jobs, and transfer of results to a persistent storage. We also conducted experiments to measure the performance of both algorithms for DNA sequences and provided a detailed profiling including the timing of intermediate steps. 

Our package is the first package that supports distributed construction of suffix array over a MPI computer cluster. It is also the first that supports multiple cloud providers. Our implementations is available through command line and web-interface, hiding all technical details for establishing and using cloud resources. Regarding the performance of algorithm, the algorithm of Futamura-Aluru-Kurtz is the more efficient in practice. This is although the algorithm of Kulla-Sanders has better time complexity