Utilising graph databases to analyse the 3D organisation of chromatin (#17)
Background: Biological data often describe relationships between objects that can be represented in graphs such as networks, pathways or ontologies. These dataset are frequently too large to keep in random access memory thus complicating their analysis and use as resources. The recent advent of NoSQL database technologies and their focus on scalability and distributed analysis alike thus present an attractive solution for large biological datasets. Graph databases in particular can store complex data transparently while providing efficient analysis algorithms developed in the field of graph theory. Here, we utilise Neo4j, a graph database, to combine and analyse relationships within and between gene expression and DNA looping datasets.
Results: Integrating ChIP-Seq, RNA-Seq and DNase hypersensitivity data into a common graph enables us to identify gene regulatory circuits of transcription factors that can act as feedback loops by simple graph traversal. In addition, we demonstrate how data aggregation can be used to identify TF modules and illustrate how these modules may contribute to DNA looping interactions.
Conclusions: Graph databases enable the connection of data from different domains and provide powerful algorithms for graph traversal and analysis. While some datasets naturally integrate into a graph framework, other datasets, such as sequential and temporal data, require discretisation first.