Outlier detection in Bioinformatics with Mixtures of Gaussian and heavy-tailed distributions

Posekany, Alexandra

doi:10.1007/978-3-658-32182-6_10

Alexandra Posekany⁵

3061 Accesses

Zusammenfassung

Starting from approaches in Bioinformatics, we will investigate aspects of Bayesian robustness ideas and compare them to methods from classical robust statistics. Bayesian robustness branches into three aspects, robustifying the prior, the likelihood or the loss function. Our focus will be the the likelihood itself. For computational convenience, normal likelihoods are the standard for many basic analyses ranging from simple mean estimation to regression or discriminatory models. However, similar to classical analyses non-normal data cause problems in the estimation process and are often covered with complex models for the overestimated variance or shrink- age. Most prominently, Bayesian non-parametrics approach this challenge with infinite mixtures of distributions. However, infinite mixture models do not allow an identification of outlying values in “near-Gaussian” scenarios being almost too flexible for such a purpose. The goal of our works is to allow for a robust estimation of parameters of the “main part of the data”, while being able to identify the outlying part of the data and providing a posterior probability for not fitting the main likelihood model. For this purpose, we propose to mix a Gaussian likelihood with heavy-tailed or skewed distributions of a similar structure which can hierarchically be related to the normal distribution in order to allow a consistent estimation of parameters and efficient simulation. We present an application of this approach in Bioinformatics for the robust estimation of genetic array data by mixing Gaussian and student’s t distributions with various degrees of freedom. To this effect, we employ microarray data as a case study for this behaviour, as they are well-known for their complicated, over-dispersed noise behaviour. Our secondary goal is to present a methodology, which helps not only to identify noisy genes but also to recognise whether single arrays are responsible for this behaviour. Although Bioinformatics dropped array technology in favour of sequencing in research, the medical diagnostics has picked up the methodology and thus require appropriate error estimators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 69.99; Price includes VAT (Germany)

Hardcover + eBook: EUR 89.99; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Genomic Outlier Detection in High-Throughput Data Analysis

Double truncation method for controlling local false discovery rate in case of spiky null

Article 05 June 2024

Outlier detection under star-contoured errors

Article 01 December 2015

Literatur

S. Frühwirth-Schnatter,Finite Mixture and Markov Switching Models. Springer-Verlag,2006.
Google Scholar
J. Banfield and A. Raftery,“Model-based gaussian andnon-gaussian clustering,”Biometrics,vol.49, no.3,pp.803–821, 1993.
Google Scholar
K. Wang, S. Ng, and G .Mc Lachlan. (2011) Clustering of timecourse gene expression profilesusing normal mixture models with ar(1) random effects. ar**v:1109.4764.
Google Scholar
K. Do, P. Müller, and F. Tang, “Abayesian mixture model for differential gene expression,”Applied Statistics,vol.54, no.3, pp.627–644, 2005.
Google Scholar
S. Frühwirth-Schnatter and S. Pyne, “Bayesian inference for finite mixtures of univariate skew-normal and skew-t distributions,”Biostatistics, vol.11, no.2, pp.317–336, 2010.
Google Scholar
J. Sethuraman, “A constructive definition of dirichletpriors,”Statistica Sinica,1994.
Google Scholar
J. Novak, S. Kim, J. Xu, O. Modlich, D. Volsky, D. Honys, J. Slon- czewski, D. Bell, F. Blattner, E. Blumwald, M. Boerma, M. Co- sio, Z .Gatalica, M. Hajduch, J. Hidalgo, R. McInnes, M. Miller, M. Penkowa, M. Rolph, J. Sottosanto, R. St-Arnaud, M. Szego, D.Twell, and C. Wang, “Generalization of dna microarray dispersion properties: microarray equivalent of t-distribution,”Biology Direct, vol.1, no.27, 2006.
Google Scholar
J. Hardin and J. Wilson, “Anoteonoligo nucleotide expression values not being normally distributed,”Biostatistics,vol.10, no.3, pp.446–450, 2009.
Google Scholar
A. Posekany, K. Felsenstein, and P. Sykacek, “Biological assessment of robustnoise models in microarray data analysis,”Bioinformatics, vol.27, no.6, pp.807–814, 2011.
Google Scholar
J. Bernardo and A. Smith, Bayesian Theory,ser. Series in Probability and Statistics. Wiley, 2000.
Google Scholar
F. Schmid and M. Trede, “Simple tests for peakedness, fat tails and leptokurtosis based on quantiles,”Computational Statistics & Data Analysis,vol.43, pp.1–12, 2003.
Google Scholar
G .Celeux, M. Hurn, and C. Robert,“Computation a land inferential difficulties with mixture posterior distributions,”Journal of the American Statistical Association,vol.95, no.451, pp.957–970, 2000.
Google Scholar
S. Frühwirth-Schnatter, “Markov chain montecarlo estimation of classical and dynamic switch in gand mixture models,”Journal of the American Statistical Association,vol.96, no.453, pp.194–209, 2001.
Google Scholar
——,Dealing with label switching under model uncertainty.Wiley, 2011, pp.193–218.
Google Scholar
J. Baek and G. Mc Lachlan, “Mixtures of commont-factor analyzers for clustering high-dimensional microarray data,”Bioinformatics,vol.27, no.9, pp.1269–1276, 2011.
Google Scholar
J. Besag, P. Green, D. Higdon, and K. Mengersen, “Bayesian computation and stochastic systems,”Statistical Science,vol.10,no.1,pp. 3–41,1995.
Google Scholar
G. Brys, M. Hubert, and A. Struyf, “Robust measures of tail weight,” Computational Statistics & Data Analysis, pp.733–759, 2006.
Google Scholar
O. Cappe, C. Robert, and T .Ryden, “Reversible jump, birth-and-death and more general continuous time markov chain monte carlo samplers,” Journal of the Royal Statistical Society. Series B,vol.65, no.3, pp. 679–700, 2003.
Google Scholar
S. Choe, M. Boutros, A. Michelson, G. Church, and M. Halfon, “Preferred analysis methods for affymetrix genechips revealed by a wholly defined control dataset,”Genome Biology, vol.6, no.R16, 2005.
Google Scholar
M. Cowles and B. Carlin,“Markov chain monte carlo convergence diagnostics: A comparative review,”Journal of the American Statistical Association,vol.91, no.434, pp.883–904, 1996.
Google Scholar
J. Dickey and B. Lientz, “The weighted likelihood ratio,sharp hypothesis about chances, the order of a markov chain,”The Annals of Mathematical Statistics,vol.41, no.1, pp.214–226, 1970.
Google Scholar
R. Edgar, M. Domrachev, and A. Lash, “Gene expression omnibus: Ncbi gene expression and hybridization array data repository,”Nucleic Acid Research,vol.30, pp.207–210 ,2002.
Google Scholar
P. Green,“Reversible jump markov chain monte carlo computation and bayesian model determination,”Biometrika,vol.82, no. 4, pp.711–732, 1995.
Google Scholar
W. Huber, A. Heydebreck, H. Sültmann, A. Poustka, and M. Vingron, “Variance stabilization applied to microarray data calibration and to the quantification of differential expression,”Bioinformatics,vol.18, pp.96–104, 2002.
Google Scholar
S .Li, H. Zhang, C. Hu, F. Lawrence, K. Gallagher, A. Surapaneni, S. Estrem, J. Calley, G. Varga, E. Dow, and Y .Chen, “Assessment of diet-induced obese rats as an obesity model by comparative functional genomics,”Obesity (Silver Spring),vol.16, no.4, pp.811–818, 2008.
Google Scholar
G. Mclachlan and D. Peel,Finite Mixture Models. Wiley Seriesin Probability and Statistics, 2000.
Google Scholar
G. McLachlan and D. Peel,Finite Mixture Models. Wiley Seriesin Probability and Statistics, 2000.
Google Scholar
T. Park and D. van Dyk,“Partially collapsed gibbs samplers: Illustrations and applications,”Journal of Computational and Graphical Statistics,vol.18, no.2, pp.283–305, 2009.
Google Scholar
K. Pearson,“Contributions to the mathematical theory of evolution,” Philosophical transactions of the Royal Society London, A,vol.185, pp.71–110, 1894.
Google Scholar
M. Plummer, N. Best, K. Cowles ,and K. Vines, “Coda: Convergence diagnosis and output analysis for mcmc,”R News,vol.6, no.1, pp. 7–11, 2006.
Google Scholar
R Development Core Team,R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2011, ISBN 3-900051-07-0. [Online]. Available: https://www.R-project.org/
S. Richardson and P. Green,“On bayesian analysis of mixtures with an unknown number of components,”Journal of the Royal Statistical Society, Series B, vol.59, no.4, pp.731–792, 1997.
Google Scholar
C. Robert, “Convergence control methods for markov chain monte carlo algorithms,”Statistical Science,vol.10, no.3, pp.231–253, 1995.
Google Scholar
C. Robert and G. Casella, Introducing Monte Carlo Methods in R.Springer-Verlag, 2009.
Google Scholar
——,Monte Carlo Statistical Methods.Springer-Verlag,1999.
Google Scholar
G. Roberts and J. Rosenthal, “Two convergence properties of hybrid samplers,”The Annals of Applied Probability,vol.8, no.2, pp.397–407, 1998.
Google Scholar
M. Stephens, “Bayesian analysis of mixture models with an unknown number of components: an alternative to reversible jump methods,”The Annals of Statistics,vol.28, no.1, pp.40–74, 2000.
Google Scholar
——,“Bayesian methods for mixtures of normal distributions, ”Ph.D. dissertation, Magdalen College, Oxford, 1997.
Google Scholar
D. Talantov, A. Mazumder, J. Yu, T. Briggs, Y. Jiang, J. Backus, D. Atkins, and Y. Wang, “Novelgenes associated with malignant melanoma but not benign melanocytic lesions,”Clin. Cancer Res., vol.11, no.20, pp.7234–7242, 2005.
Google Scholar
Tian, F. Zhan, R. Walker, E. Rasmussen, Y. Ma, B. Barlogie, and J. Shaughnessy, “The role of the wnt-signaling antagonist dkk1 in the development of osteolytic lesions in multiple myeloma,”N. Engl. J. Med.,vol.349, no.26, pp.2483–2494, 2003.
Google Scholar
Z. Yao, J. Jaeger, W. Ruzzo, C. Morale, M. Emond, U. Francke, D. Milewicz, S. Schwartz, and E. Mulvihill, “A marfan syndrome gene expression phenotype in cultured skin fibroblasts,”BMC Genomics, vol.8, no.39, 2007.
Google Scholar
C .Yau and C. Holmes, “Hierarchical bayesian non-parametric mixture models for clustering with variable relevance determination, ”Bayesian Analysis, vol.6, no.2, pp.329–352, 2011.
Google Scholar

Download references

Author information

Authors and Affiliations

Department für Computational Statistics, TU Wien, Wien, Österreich
Alexandra Posekany

Authors

Alexandra Posekany
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandra Posekany .

Editor information

Editors and Affiliations

Informationstechnik & System-Management, Fachhochschule Salzburg, Puch/Salzburg, Austria
Peter Haber
Donau-Universität Krems Center for E-Governance, Krems an der Donau, Austria
Thomas Lampoltshammer
Informationstechnik & System-Management, Fachhochschule Salzburg, Puch/Salzburg, Salzburg, Austria
Manfred Mayr
Campus V, Fachhochschule Vorarlberg GmbH, Dornbirn, Austria
Kathrin Plankensteiner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Posekany, A. (2021). Outlier detection in Bioinformatics with Mixtures of Gaussian and heavy-tailed distributions. In: Haber, P., Lampoltshammer, T., Mayr, M., Plankensteiner, K. (eds) Data Science – Analytics and Applications. Springer Vieweg, Wiesbaden. https://doi.org/10.1007/978-3-658-32182-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-658-32182-6_10
Published: 05 January 2021
Publisher Name: Springer Vieweg, Wiesbaden
Print ISBN: 978-3-658-32181-9
Online ISBN: 978-3-658-32182-6
eBook Packages: Computer Science and Engineering (German Language)

Publish with us

Policies and ethics

Outlier detection in Bioinformatics with Mixtures of Gaussian and heavy-tailed distributions

Zusammenfassung

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Genomic Outlier Detection in High-Throughput Data Analysis

Double truncation method for controlling local false discovery rate in case of spiky null

Outlier detection under star-contoured errors

Literatur

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Outlier detection in Bioinformatics with Mixtures of Gaussian and heavy-tailed distributions

Zusammenfassung

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Genomic Outlier Detection in High-Throughput Data Analysis

Double truncation method for controlling local false discovery rate in case of spiky null

Outlier detection under star-contoured errors

Literatur

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation