Evaluating Two-Factor Experimental Results for RNA-Seq Data using Simulation (#31)
Background: Many programs for the analysis of RNA-Seq data are available in R, and there has been some assessment of whether, and when, they may be expected to give correct answers. Generally, simulations to test the software are based on the Poisson or the negative binomial distribution, and contrast two groups with varying numbers of replications. However, experimenters often use more complex experimental designs. We simulate negative binomial data from a two factor design to assess several R packages. Two different simulation methods are used. Our motivating data set is a two factor experiment with eight patients suffering from Myelodysplastic Syndrome and seven from Chronic Myelomonocytic Leukaemia, and RNA-Seq counts estimated before and after drug treatment. We compare the DE genes used to generate the simulates with the DE genes found from the simulations in the R-packages, DEseq2, PoissonSeq, edgeR, and QuasiSeq.
Results: Since the simulated data are negatively binomially distributed, unsurprisingly the Poisson distribution based methods performed poorly. The number of DE genes in common using one set of simulations varied from 0% to 19%, and in the other from 4% to 37%.
Conclusions: We conclude that for a two-factor experiment with 30 experiments per gene, the negative binomial is sufficiently flexible to account for extra-Poisson variability. For simulated data for 2000 randomly selected genes the task of normalisation made some packages almost entirely uninformative as to which genes were DE, with PoissonSeq performing least well. Normalisation affects parameter estimates, and seemed badly estimated with 2000 genes. Normalisation rates operate on the gene counts, raising or lowering them, and hence, diminish or enhance the looked-for signal. Packages edgeR with tagwise dispersion, and QuasiSeq with tagwise, common and trend dispersion estimates, performed reasonably well. DESeq2 was harder to assess, but performed well when the interest is in looking at ‘top’ genes.
- Dillies, M.-A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L., Laloe, D., Gall, C.L., Schaeffer, B., Crom, S.L., Guedj, M., Jaffrezic, F., Consortium, T.F.S.: A comprehensive evaluation of normalization methods for Illumina high- throughput RNA sequencing data analysis. BRIEFINGS IN BIOINFORMATICS 14(6), 671–683 (2013)
- Donald, M.R., Unnikrishnan, A., Pimanda, J.E., Wilson, S.R.: In: Muggeo, V.M.R., Capursi, V., Boscaino, G., Lovison, G. (eds.) Model Comparisons for RNA-Seq Data, vol. 2, pp. 563–566. Istituto Polygrafico Europeo, Palermo (2013)
- Lele, S.R., Dennis, B., Lutscher, F.: Data cloning: easy maximum likelihood estimation for complex ecological models using Bayesian Markov chain Monte Marlo methods. Ecology letters 10(7), 551–563 (2007)
- Lele, S.R., Nadeem, K., Schmuland, B.: Estimability and likelihood inference for generalized linear mixed models using data cloning. Journal of the American Statistical Association 105(492), 1617–1625 (2010)
- Lunn, D.J., Thomas, A., Best, N., Spiegelhalter, D.: WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing 10(4), 325–337 (2000)
- Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1), 139–140 (2010)
- Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biology 11, 106 (2010)
- Lund, S., Qu, L.: QuasiSeq: Analyzing RNA Sequence Count Tables Using Quasi-likelihood. (2014). R package version 1.0-4. http://CRAN.R-project.org/package=QuasiSeq
- Lund, S., Nettleton, D., McCarthy, D., Smyth, G.: Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical applications in genetics and molecular biology 11(5), 8 (2012)
- Li, J.: Package PoissonSeq: Significance analysis of sequencing data based on a Poisson log linear model (2013). http://CRAN.R-project.org/package=PoissonSeq
- Clarke, K., Somerfield, P., Gorley, R.: Exploratory null hypothesis testing for community data: similarity profiles and biota-environment linkage. Journal of Experimental Marine Biology and Ecology 366, 56–69 (2008)
- Whitaker, D., Christman, M.: Clustsig: Significant Cluster Analysis. (2014). R package version 1.1. http://CRAN.R-project.org/package=clustsig
- Li, J., Tibshirani, R.: Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data. Statistical Methods in Medical Research 22(5), 519–536 (2013)
- Kadota, K., Nishiyama, T., Shimizu, K.: A normalization strategy for comparing tag count data. Algorithms for Molecular Biology 7(1), 5 (2012)
- Landau, W.M., Liu, P.: Dispersion estimation and its effect on test performance in RNA-seq data analysis: A simulation-based comparison of methods. Plos one 8(12), e81415 (2013)
- Esnaola, M., Puig, P., Gonzalez, D., Castelo, R., Gonzalez, J.R.: A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMCBioinformatics 14, 254 (2013)