Estimation of amplicon methylation patterns from bisulphite sequencing data (#254)
Many of the known mechanisms driving gene regulation fall into the category of epigenomic modifications. One example of an epigenomic modification is DNA methylation, in which a cytosine (C) in the genomic DNA sequence can be altered by the addition of a methyl group. Methylation patterns in DNA amplicons are detected by treating with bisulphite, which converts unmethylated cytosines to uracils while leaving methylated cytosines intact. Treated DNA amplicons are sequenced, mapped to a reference genome and methylation patterns inferred.
However, the bisulphite conversion is not 100% efficient, and this introduces errors in the read distribution. A second source of errors is the read errors. In this project we have developed a model for these two sources of errors, based on an incomplete conversion rate and site-dependent read error rates, both of which can be estimated independently. We have also developed an algorithm to estimate the true distribution of the methylation patterns and to predict spurious patterns which appear in the reads solely due to incomplete conversion and read errors. This algorithm is currently being developed as an R Bioconductor package.
As the true distribution is always unknown in the lab, synthetic data has been constructed to test the effectiveness of the algorithm. We have found that the estimated distribution given by the algorithm is closer to the 'true' distribution than the observed read distribution. The algorithm is also effective in predicting spurious patterns. The results of applying the model and the algorithm to data on the methylation patterns of the honey bee amplicons are presented.