Mining for Gene-Disease Associations with MeSH terms in MEDLINE and ArrayExpress — ASN Events

Mining for Gene-Disease Associations with MeSH terms in MEDLINE and ArrayExpress (#76)

Modest von Korff 1 , Bernard Deffarges , Valerie Siefken , Thomas Sander
  1. Actelion Pharmaceuticals Ltd., Allschwil, BASEL, Switzerland

Background

This study examines the possibility to extract meaningful gene-disease associations from public databases. Gene-disease associations (GDA) are of high interest in medicine and in drug discovery. Two fully automatic methods were implemented to mine two databases, PubMed Central and ArrayExpress. A database is queried with a gene name and the retrieved result records are searched with disease-related MeSH terms. The MeSH terms are ranked by their frequency of occurrence in the result records. A test dataset with 38 drugs was compiled to examine the relevance of the described approaches. This was done because a drug provides a triple association to the gene, which is encoding the target of the drug, and to a dissease, which is cured by the drug. A test record contained the drug name, the disease MeSH term (indication) and the gene name of the target protein. For a test, one of the databases was queried with the gene name. The results were searched and the MeSH terms ranked. Finally, the relative rank of the disease MeSH term from the test record was used as figure of merit for the relevance of the gene-disease association. 

Results

A number of 53 test records was derived from the 38 drugs, as for some drugs more than one target was compiled from literature. For mining ArrayExpress a median of 0.675 resulted from the relative ranks of all test records. For mining PubMed Central the median calculated with 0.951.

Conclusions

Mining PubMed Central for relevant gene disease associations was much more successful than mining ArrayExpress. For PubMed Central, the disease MeSH term for the underlying indication of the test record was in the majority of cases between the first five percent of the ranked diseases. This demonstrated that the described method delivered meaningful gene-disease associations.