Zusammenfassung
Starting from approaches in Bioinformatics, we will investigate aspects of Bayesian robustness ideas and compare them to methods from classical robust statistics. Bayesian robustness branches into three aspects, robustifying the prior, the likelihood or the loss function. Our focus will be the the likelihood itself. For computational convenience, normal likelihoods are the standard for many basic analyses ranging from simple mean estimation to regression or discriminatory models. However, similar to classical analyses non-normal data cause problems in the estimation process and are often covered with complex models for the overestimated variance or shrink- age. Most prominently, Bayesian non-parametrics approach this challenge with infinite mixtures of distributions. However, infinite mixture models do not allow an identification of outlying values in “near-Gaussian” scenarios being almost too flexible for such a purpose. The goal of our works is to allow for a robust estimation of parameters of the “main part of the data”, while being able to identify the outlying part of the data and providing a posterior probability for not fitting the main likelihood model. For this purpose, we propose to mix a Gaussian likelihood with heavy-tailed or skewed distributions of a similar structure which can hierarchically be related to the normal distribution in order to allow a consistent estimation of parameters and efficient simulation. We present an application of this approach in Bioinformatics for the robust estimation of genetic array data by mixing Gaussian and student’s t distributions with various degrees of freedom. To this effect, we employ microarray data as a case study for this behaviour, as they are well-known for their complicated, over-dispersed noise behaviour. Our secondary goal is to present a methodology, which helps not only to identify noisy genes but also to recognise whether single arrays are responsible for this behaviour. Although Bioinformatics dropped array technology in favour of sequencing in research, the medical diagnostics has picked up the methodology and thus require appropriate error estimators.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
Literatur
S. Frühwirth-Schnatter,Finite Mixture and Markov Switching Models. Springer-Verlag,2006.
J. Banfield and A. Raftery,“Model-based gaussian andnon-gaussian clustering,”Biometrics,vol.49, no.3,pp.803–821, 1993.
K. Wang, S. Ng, and G .Mc Lachlan. (2011) Clustering of timecourse gene expression profilesusing normal mixture models with ar(1) random effects. ar**v:1109.4764.
K. Do, P. Müller, and F. Tang, “Abayesian mixture model for differential gene expression,”Applied Statistics,vol.54, no.3, pp.627–644, 2005.
S. Frühwirth-Schnatter and S. Pyne, “Bayesian inference for finite mixtures of univariate skew-normal and skew-t distributions,”Biostatistics, vol.11, no.2, pp.317–336, 2010.
J. Sethuraman, “A constructive definition of dirichletpriors,”Statistica Sinica,1994.
J. Novak, S. Kim, J. Xu, O. Modlich, D. Volsky, D. Honys, J. Slon- czewski, D. Bell, F. Blattner, E. Blumwald, M. Boerma, M. Co- sio, Z .Gatalica, M. Hajduch, J. Hidalgo, R. McInnes, M. Miller, M. Penkowa, M. Rolph, J. Sottosanto, R. St-Arnaud, M. Szego, D.Twell, and C. Wang, “Generalization of dna microarray dispersion properties: microarray equivalent of t-distribution,”Biology Direct, vol.1, no.27, 2006.
J. Hardin and J. Wilson, “Anoteonoligo nucleotide expression values not being normally distributed,”Biostatistics,vol.10, no.3, pp.446–450, 2009.
A. Posekany, K. Felsenstein, and P. Sykacek, “Biological assessment of robustnoise models in microarray data analysis,”Bioinformatics, vol.27, no.6, pp.807–814, 2011.
J. Bernardo and A. Smith, Bayesian Theory,ser. Series in Probability and Statistics. Wiley, 2000.
F. Schmid and M. Trede, “Simple tests for peakedness, fat tails and leptokurtosis based on quantiles,”Computational Statistics & Data Analysis,vol.43, pp.1–12, 2003.
G .Celeux, M. Hurn, and C. Robert,“Computation a land inferential difficulties with mixture posterior distributions,”Journal of the American Statistical Association,vol.95, no.451, pp.957–970, 2000.
S. Frühwirth-Schnatter, “Markov chain montecarlo estimation of classical and dynamic switch in gand mixture models,”Journal of the American Statistical Association,vol.96, no.453, pp.194–209, 2001.
——,Dealing with label switching under model uncertainty.Wiley, 2011, pp.193–218.
J. Baek and G. Mc Lachlan, “Mixtures of commont-factor analyzers for clustering high-dimensional microarray data,”Bioinformatics,vol.27, no.9, pp.1269–1276, 2011.
J. Besag, P. Green, D. Higdon, and K. Mengersen, “Bayesian computation and stochastic systems,”Statistical Science,vol.10,no.1,pp. 3–41,1995.
G. Brys, M. Hubert, and A. Struyf, “Robust measures of tail weight,” Computational Statistics & Data Analysis, pp.733–759, 2006.
O. Cappe, C. Robert, and T .Ryden, “Reversible jump, birth-and-death and more general continuous time markov chain monte carlo samplers,” Journal of the Royal Statistical Society. Series B,vol.65, no.3, pp. 679–700, 2003.
S. Choe, M. Boutros, A. Michelson, G. Church, and M. Halfon, “Preferred analysis methods for affymetrix genechips revealed by a wholly defined control dataset,”Genome Biology, vol.6, no.R16, 2005.
M. Cowles and B. Carlin,“Markov chain monte carlo convergence diagnostics: A comparative review,”Journal of the American Statistical Association,vol.91, no.434, pp.883–904, 1996.
J. Dickey and B. Lientz, “The weighted likelihood ratio,sharp hypothesis about chances, the order of a markov chain,”The Annals of Mathematical Statistics,vol.41, no.1, pp.214–226, 1970.
R. Edgar, M. Domrachev, and A. Lash, “Gene expression omnibus: Ncbi gene expression and hybridization array data repository,”Nucleic Acid Research,vol.30, pp.207–210 ,2002.
P. Green,“Reversible jump markov chain monte carlo computation and bayesian model determination,”Biometrika,vol.82, no. 4, pp.711–732, 1995.
W. Huber, A. Heydebreck, H. Sültmann, A. Poustka, and M. Vingron, “Variance stabilization applied to microarray data calibration and to the quantification of differential expression,”Bioinformatics,vol.18, pp.96–104, 2002.
S .Li, H. Zhang, C. Hu, F. Lawrence, K. Gallagher, A. Surapaneni, S. Estrem, J. Calley, G. Varga, E. Dow, and Y .Chen, “Assessment of diet-induced obese rats as an obesity model by comparative functional genomics,”Obesity (Silver Spring),vol.16, no.4, pp.811–818, 2008.
G. Mclachlan and D. Peel,Finite Mixture Models. Wiley Seriesin Probability and Statistics, 2000.
G. McLachlan and D. Peel,Finite Mixture Models. Wiley Seriesin Probability and Statistics, 2000.
T. Park and D. van Dyk,“Partially collapsed gibbs samplers: Illustrations and applications,”Journal of Computational and Graphical Statistics,vol.18, no.2, pp.283–305, 2009.
K. Pearson,“Contributions to the mathematical theory of evolution,” Philosophical transactions of the Royal Society London, A,vol.185, pp.71–110, 1894.
M. Plummer, N. Best, K. Cowles ,and K. Vines, “Coda: Convergence diagnosis and output analysis for mcmc,”R News,vol.6, no.1, pp. 7–11, 2006.
R Development Core Team,R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2011, ISBN 3-900051-07-0. [Online]. Available: https://www.R-project.org/
S. Richardson and P. Green,“On bayesian analysis of mixtures with an unknown number of components,”Journal of the Royal Statistical Society, Series B, vol.59, no.4, pp.731–792, 1997.
C. Robert, “Convergence control methods for markov chain monte carlo algorithms,”Statistical Science,vol.10, no.3, pp.231–253, 1995.
C. Robert and G. Casella, Introducing Monte Carlo Methods in R.Springer-Verlag, 2009.
——,Monte Carlo Statistical Methods.Springer-Verlag,1999.
G. Roberts and J. Rosenthal, “Two convergence properties of hybrid samplers,”The Annals of Applied Probability,vol.8, no.2, pp.397–407, 1998.
M. Stephens, “Bayesian analysis of mixture models with an unknown number of components: an alternative to reversible jump methods,”The Annals of Statistics,vol.28, no.1, pp.40–74, 2000.
——,“Bayesian methods for mixtures of normal distributions, ”Ph.D. dissertation, Magdalen College, Oxford, 1997.
D. Talantov, A. Mazumder, J. Yu, T. Briggs, Y. Jiang, J. Backus, D. Atkins, and Y. Wang, “Novelgenes associated with malignant melanoma but not benign melanocytic lesions,”Clin. Cancer Res., vol.11, no.20, pp.7234–7242, 2005.
Tian, F. Zhan, R. Walker, E. Rasmussen, Y. Ma, B. Barlogie, and J. Shaughnessy, “The role of the wnt-signaling antagonist dkk1 in the development of osteolytic lesions in multiple myeloma,”N. Engl. J. Med.,vol.349, no.26, pp.2483–2494, 2003.
Z. Yao, J. Jaeger, W. Ruzzo, C. Morale, M. Emond, U. Francke, D. Milewicz, S. Schwartz, and E. Mulvihill, “A marfan syndrome gene expression phenotype in cultured skin fibroblasts,”BMC Genomics, vol.8, no.39, 2007.
C .Yau and C. Holmes, “Hierarchical bayesian non-parametric mixture models for clustering with variable relevance determination, ”Bayesian Analysis, vol.6, no.2, pp.329–352, 2011.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Der/die Autor(en), exklusiv lizenziert durch Springer Fachmedien Wiesbaden GmbH , ein Teil von Springer Nature
About this paper
Cite this paper
Posekany, A. (2021). Outlier detection in Bioinformatics with Mixtures of Gaussian and heavy-tailed distributions. In: Haber, P., Lampoltshammer, T., Mayr, M., Plankensteiner, K. (eds) Data Science – Analytics and Applications. Springer Vieweg, Wiesbaden. https://doi.org/10.1007/978-3-658-32182-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-658-32182-6_10
Published:
Publisher Name: Springer Vieweg, Wiesbaden
Print ISBN: 978-3-658-32181-9
Online ISBN: 978-3-658-32182-6
eBook Packages: Computer Science and Engineering (German Language)