Outlier detection in Bioinformatics with Mixtures of Gaussian and heavy-tailed distributions

  • Conference paper
  • First Online:
Data Science – Analytics and Applications
  • 3061 Accesses

Zusammenfassung

Starting from approaches in Bioinformatics, we will investigate aspects of Bayesian robustness ideas and compare them to methods from classical robust statistics. Bayesian robustness branches into three aspects, robustifying the prior, the likelihood or the loss function. Our focus will be the the likelihood itself. For computational convenience, normal likelihoods are the standard for many basic analyses ranging from simple mean estimation to regression or discriminatory models. However, similar to classical analyses non-normal data cause problems in the estimation process and are often covered with complex models for the overestimated variance or shrink- age. Most prominently, Bayesian non-parametrics approach this challenge with infinite mixtures of distributions. However, infinite mixture models do not allow an identification of outlying values in “near-Gaussian” scenarios being almost too flexible for such a purpose. The goal of our works is to allow for a robust estimation of parameters of the “main part of the data”, while being able to identify the outlying part of the data and providing a posterior probability for not fitting the main likelihood model. For this purpose, we propose to mix a Gaussian likelihood with heavy-tailed or skewed distributions of a similar structure which can hierarchically be related to the normal distribution in order to allow a consistent estimation of parameters and efficient simulation. We present an application of this approach in Bioinformatics for the robust estimation of genetic array data by mixing Gaussian and student’s t distributions with various degrees of freedom. To this effect, we employ microarray data as a case study for this behaviour, as they are well-known for their complicated, over-dispersed noise behaviour. Our secondary goal is to present a methodology, which helps not only to identify noisy genes but also to recognise whether single arrays are responsible for this behaviour. Although Bioinformatics dropped array technology in favour of sequencing in research, the medical diagnostics has picked up the methodology and thus require appropriate error estimators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 69.99
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover + eBook
EUR 89.99
Price includes VAT (Germany)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info
  • Available as PDF

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

Literatur

  1. S. Frühwirth-Schnatter,Finite Mixture and Markov Switching Models. Springer-Verlag,2006.

    Google Scholar 

  2. J. Banfield and A. Raftery,“Model-based gaussian andnon-gaussian clustering,”Biometrics,vol.49, no.3,pp.803–821, 1993.

    Google Scholar 

  3. K. Wang, S. Ng, and G .Mc Lachlan. (2011) Clustering of timecourse gene expression profilesusing normal mixture models with ar(1) random effects. ar**v:1109.4764.

    Google Scholar 

  4. K. Do, P. Müller, and F. Tang, “Abayesian mixture model for differential gene expression,”Applied Statistics,vol.54, no.3, pp.627–644, 2005.

    Google Scholar 

  5. S. Frühwirth-Schnatter and S. Pyne, “Bayesian inference for finite mixtures of univariate skew-normal and skew-t distributions,”Biostatistics, vol.11, no.2, pp.317–336, 2010.

    Google Scholar 

  6. J. Sethuraman, “A constructive definition of dirichletpriors,”Statistica Sinica,1994.

    Google Scholar 

  7. J. Novak, S. Kim, J. Xu, O. Modlich, D. Volsky, D. Honys, J. Slon- czewski, D. Bell, F. Blattner, E. Blumwald, M. Boerma, M. Co- sio, Z .Gatalica, M. Hajduch, J. Hidalgo, R. McInnes, M. Miller, M. Penkowa, M. Rolph, J. Sottosanto, R. St-Arnaud, M. Szego, D.Twell, and C. Wang, “Generalization of dna microarray dispersion properties: microarray equivalent of t-distribution,”Biology Direct, vol.1, no.27, 2006.

    Google Scholar 

  8. J. Hardin and J. Wilson, “Anoteonoligo nucleotide expression values not being normally distributed,”Biostatistics,vol.10, no.3, pp.446–450, 2009.

    Google Scholar 

  9. A. Posekany, K. Felsenstein, and P. Sykacek, “Biological assessment of robustnoise models in microarray data analysis,”Bioinformatics, vol.27, no.6, pp.807–814, 2011.

    Google Scholar 

  10. J. Bernardo and A. Smith, Bayesian Theory,ser. Series in Probability and Statistics. Wiley, 2000.

    Google Scholar 

  11. F. Schmid and M. Trede, “Simple tests for peakedness, fat tails and leptokurtosis based on quantiles,”Computational Statistics & Data Analysis,vol.43, pp.1–12, 2003.

    Google Scholar 

  12. G .Celeux, M. Hurn, and C. Robert,“Computation a land inferential difficulties with mixture posterior distributions,”Journal of the American Statistical Association,vol.95, no.451, pp.957–970, 2000.

    Google Scholar 

  13. S. Frühwirth-Schnatter, “Markov chain montecarlo estimation of classical and dynamic switch in gand mixture models,”Journal of the American Statistical Association,vol.96, no.453, pp.194–209, 2001.

    Google Scholar 

  14. ——,Dealing with label switching under model uncertainty.Wiley, 2011, pp.193–218.

    Google Scholar 

  15. J. Baek and G. Mc Lachlan, “Mixtures of commont-factor analyzers for clustering high-dimensional microarray data,”Bioinformatics,vol.27, no.9, pp.1269–1276, 2011.

    Google Scholar 

  16. J. Besag, P. Green, D. Higdon, and K. Mengersen, “Bayesian computation and stochastic systems,”Statistical Science,vol.10,no.1,pp. 3–41,1995.

    Google Scholar 

  17. G. Brys, M. Hubert, and A. Struyf, “Robust measures of tail weight,” Computational Statistics & Data Analysis, pp.733–759, 2006.

    Google Scholar 

  18. O. Cappe, C. Robert, and T .Ryden, “Reversible jump, birth-and-death and more general continuous time markov chain monte carlo samplers,” Journal of the Royal Statistical Society. Series B,vol.65, no.3, pp. 679–700, 2003.

    Google Scholar 

  19. S. Choe, M. Boutros, A. Michelson, G. Church, and M. Halfon, “Preferred analysis methods for affymetrix genechips revealed by a wholly defined control dataset,”Genome Biology, vol.6, no.R16, 2005.

    Google Scholar 

  20. M. Cowles and B. Carlin,“Markov chain monte carlo convergence diagnostics: A comparative review,”Journal of the American Statistical Association,vol.91, no.434, pp.883–904, 1996.

    Google Scholar 

  21. J. Dickey and B. Lientz, “The weighted likelihood ratio,sharp hypothesis about chances, the order of a markov chain,”The Annals of Mathematical Statistics,vol.41, no.1, pp.214–226, 1970.

    Google Scholar 

  22. R. Edgar, M. Domrachev, and A. Lash, “Gene expression omnibus: Ncbi gene expression and hybridization array data repository,”Nucleic Acid Research,vol.30, pp.207–210 ,2002.

    Google Scholar 

  23. P. Green,“Reversible jump markov chain monte carlo computation and bayesian model determination,”Biometrika,vol.82, no. 4, pp.711–732, 1995.

    Google Scholar 

  24. W. Huber, A. Heydebreck, H. Sültmann, A. Poustka, and M. Vingron, “Variance stabilization applied to microarray data calibration and to the quantification of differential expression,”Bioinformatics,vol.18, pp.96–104, 2002.

    Google Scholar 

  25. S .Li, H. Zhang, C. Hu, F. Lawrence, K. Gallagher, A. Surapaneni, S. Estrem, J. Calley, G. Varga, E. Dow, and Y .Chen, “Assessment of diet-induced obese rats as an obesity model by comparative functional genomics,”Obesity (Silver Spring),vol.16, no.4, pp.811–818, 2008.

    Google Scholar 

  26. G. Mclachlan and D. Peel,Finite Mixture Models. Wiley Seriesin Probability and Statistics, 2000.

    Google Scholar 

  27. G. McLachlan and D. Peel,Finite Mixture Models. Wiley Seriesin Probability and Statistics, 2000.

    Google Scholar 

  28. T. Park and D. van Dyk,“Partially collapsed gibbs samplers: Illustrations and applications,”Journal of Computational and Graphical Statistics,vol.18, no.2, pp.283–305, 2009.

    Google Scholar 

  29. K. Pearson,“Contributions to the mathematical theory of evolution,” Philosophical transactions of the Royal Society London, A,vol.185, pp.71–110, 1894.

    Google Scholar 

  30. M. Plummer, N. Best, K. Cowles ,and K. Vines, “Coda: Convergence diagnosis and output analysis for mcmc,”R News,vol.6, no.1, pp. 7–11, 2006.

    Google Scholar 

  31. R Development Core Team,R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2011, ISBN 3-900051-07-0. [Online]. Available: https://www.R-project.org/

  32. S. Richardson and P. Green,“On bayesian analysis of mixtures with an unknown number of components,”Journal of the Royal Statistical Society, Series B, vol.59, no.4, pp.731–792, 1997.

    Google Scholar 

  33. C. Robert, “Convergence control methods for markov chain monte carlo algorithms,”Statistical Science,vol.10, no.3, pp.231–253, 1995.

    Google Scholar 

  34. C. Robert and G. Casella, Introducing Monte Carlo Methods in R.Springer-Verlag, 2009.

    Google Scholar 

  35. ——,Monte Carlo Statistical Methods.Springer-Verlag,1999.

    Google Scholar 

  36. G. Roberts and J. Rosenthal, “Two convergence properties of hybrid samplers,”The Annals of Applied Probability,vol.8, no.2, pp.397–407, 1998.

    Google Scholar 

  37. M. Stephens, “Bayesian analysis of mixture models with an unknown number of components: an alternative to reversible jump methods,”The Annals of Statistics,vol.28, no.1, pp.40–74, 2000.

    Google Scholar 

  38. ——,“Bayesian methods for mixtures of normal distributions, ”Ph.D. dissertation, Magdalen College, Oxford, 1997.

    Google Scholar 

  39. D. Talantov, A. Mazumder, J. Yu, T. Briggs, Y. Jiang, J. Backus, D. Atkins, and Y. Wang, “Novelgenes associated with malignant melanoma but not benign melanocytic lesions,”Clin. Cancer Res., vol.11, no.20, pp.7234–7242, 2005.

    Google Scholar 

  40. Tian, F. Zhan, R. Walker, E. Rasmussen, Y. Ma, B. Barlogie, and J. Shaughnessy, “The role of the wnt-signaling antagonist dkk1 in the development of osteolytic lesions in multiple myeloma,”N. Engl. J. Med.,vol.349, no.26, pp.2483–2494, 2003.

    Google Scholar 

  41. Z. Yao, J. Jaeger, W. Ruzzo, C. Morale, M. Emond, U. Francke, D. Milewicz, S. Schwartz, and E. Mulvihill, “A marfan syndrome gene expression phenotype in cultured skin fibroblasts,”BMC Genomics, vol.8, no.39, 2007.

    Google Scholar 

  42. C .Yau and C. Holmes, “Hierarchical bayesian non-parametric mixture models for clustering with variable relevance determination, ”Bayesian Analysis, vol.6, no.2, pp.329–352, 2011.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandra Posekany .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Der/die Autor(en), exklusiv lizenziert durch Springer Fachmedien Wiesbaden GmbH , ein Teil von Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Posekany, A. (2021). Outlier detection in Bioinformatics with Mixtures of Gaussian and heavy-tailed distributions. In: Haber, P., Lampoltshammer, T., Mayr, M., Plankensteiner, K. (eds) Data Science – Analytics and Applications. Springer Vieweg, Wiesbaden. https://doi.org/10.1007/978-3-658-32182-6_10

Download citation

Publish with us

Policies and ethics

Navigation