UCSL : A Machine Learning Expectation-Maximization Framework for Unsupervised Clustering Driven by Supervised Learning

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2021)

Abstract

Subtype Discovery consists in finding interpretable and consistent sub-parts of a dataset, which are also relevant to a certain supervised task. From a mathematical point of view, this can be defined as a clustering task driven by supervised learning in order to uncover subgroups in line with the supervised prediction. In this paper, we propose a general Expectation-Maximization ensemble framework entitled UCSL (Unsupervised Clustering driven by Supervised Learning). Our method is generic, it can integrate any clustering method and can be driven by both binary classification and regression. We propose to construct a non-linear model by merging multiple linear estimators, one per cluster. Each hyperplane is estimated so that it correctly discriminates - or predict - only one cluster. We use SVC or Logistic Regression for classification and SVR for regression. Furthermore, to perform cluster analysis within a more suitable space, we also propose a dimension-reduction algorithm that projects the data onto an orthonormal space relevant to the supervised task. We analyze the robustness and generalization capability of our algorithm using synthetic and experimental datasets. In particular, we validate its ability to identify suitable consistent sub-types by conducting a psychiatric-diseases cluster analysis with known ground-truth labels. The gain of the proposed method over previous state-of-the-art techniques is about +1.9 points in terms of balanced accuracy. Finally, we make codes and examples available in a scikit-learn-compatible Python package. https://github.com/neurospin-projects/2021_rlouiset_ucsl/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

References

  1. Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering and representation learning. In: ICLR (2020)

    Google Scholar 

  2. Carey, L.A., Perou, C.M., Livasy, C.A., Dressler, L.G., Cowan, D., et al.: Race, breast cancer subtypes, and survival in the Carolina breast cancer study. JAMA 295(21), 2492–2502 (2006)

    Article  Google Scholar 

  3. Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: ECCV, pp. 139–156 (2018)

    Google Scholar 

  4. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS 33, 9912–9924 (2020)

    Google Scholar 

  5. Chand, G.B., Dwyer, D.B., Erus, G., Sotiras, A., Varol, E., et al.: Two distinct neuroanatomical subtypes of schizophrenia revealed using machine learning. Brain 143(3), 1027–1038 (2020)

    Article  Google Scholar 

  6. Erro, R., Vitale, C., Amboni, M., Picillo, M., et al.: The heterogeneity of early Parkinson’s disease: a cluster analysis on newly diagnosed untreated patients. PLoS One 8(8), e70244 (2013)

    Article  Google Scholar 

  7. Ferreira, D., Verhagen, C., Hernández-Cabrera, J.A., Cavallin, L., et al.: Distinct subtypes of Alzheimer’s disease based on patterns of brain atrophy: longitudinal trajectories and clinical applications. Sci Rep 7, 1–13 (2017)

    Article  Google Scholar 

  8. Honnorat, N., Dong, A., Meisenzahl-Lechner, E., Koutsouleris, N., Davatzikos, C.: Neuroanatomical heterogeneity of schizophrenia revealed by semi-supervised machine learning methods. Schizophr. Res. 214, 43–50 (2019)

    Article  Google Scholar 

  9. Li, J., Zhou, P., **ong, C., Hoi, S.C.H.: Prototypical contrastive learning of unsupervised representations. In: ICLR (2021)

    Google Scholar 

  10. Lundberg, S.M., Erion, G.G., Lee, S.I.: Consistent individualized feature attribution for tree ensembles. In: ICML workshop (2017)

    Google Scholar 

  11. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: NeurIps, pp. 4768–4777 (2017)

    Google Scholar 

  12. Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)

    Google Scholar 

  13. Marquand, A.F., Wolfers, T., Mennes, M., Buitelaar, J., Beckmann, C.F.: Beyond lum** and splitting: a review of computational approaches for stratifying psychiatric disorders. Biol. Psychiatry: Cogn. Neurosci. Neuroimaging 1(5), 433–447 (2016)

    Google Scholar 

  14. Marusyk, A., Polyak, K.: Tumor heterogeneity: causes and consequences. Biochim. Biophys. Acta 1805(1), 105–117 (2010)

    Google Scholar 

  15. McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ar**, prognosis, and diagnosis. Comput. Struct. Biotechnol. J. 19, 949–960 (2021)

    Article  Google Scholar 

  16. Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E., Ameh, F., Achas, M., Adebiyi, E.: Clustering algorithms: their application to gene expression data. Bioinform. Biol. Insights 10, 237–253 (2016)

    Article  Google Scholar 

  17. Planey, C.R., Gevaert, O.: CoINcIDE: a framework for discovery of patient subtypes across multiple datasets. Genome Med. 8(1), 27 (2016)

    Article  Google Scholar 

  18. Radford, A., Metz, L., Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. ar**v:1511.06434 [cs] (2016). ar**v: 1511.06434

  19. Rawat, K.S., Malhan, I.V.: A hybrid classification method based on machine learning classifiers to predict performance in educational data mining. In: ICCCN, pp. 677–684 (2019)

    Google Scholar 

  20. Saito, S., Tan, R.T.: Neural clustering: concatenating layers for better projections. In: ICLR - workshop (2017)

    Google Scholar 

  21. Schulz, M.A., Chapman-Rounds, M., Verma, M., Bzdok, D., Georgatzis, K.: Inferring disease subtypes from clusters in explanation space. Sci. R. 10(1), 1–6 (2020)

    Google Scholar 

  22. Sonpatki, P., Shah, N.: Recursive consensus clustering for novel subtype discovery from transcriptome data. Sci. R. 10(1), 1–6 (2020)

    Google Scholar 

  23. Tager-Flusberg, H., Joseph, R.M.: Identifying neurocognitive phenotypes in autism. Philos. Trans. R. Soc. Lond. B Biol. Sci. 358(1430), 303–314 (2003)

    Article  Google Scholar 

  24. Varol, E., Sotiras, A., Davatzikos, C.: HYDRA: revealing heterogeneity of imaging and genetic patterns through a multiple max-margin discriminative analysis framework. Neuroimage 145, 346–364 (2017)

    Article  Google Scholar 

  25. Wang, Y., et al.: Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records. J. Biomed. Inf. 102, 103364 (2020)

    Article  Google Scholar 

  26. Wen, J., Varol, E., Chand, G., Sotiras, A., Davatzikos, C.: MAGIC: multi-scale heterogeneity analysis and clustering for brain diseases. In: MICCAI. LNCS (2020)

    Google Scholar 

  27. Wu, M.Y., Dai, D.Q., Zhang, X.F., Zhu, Y.: Cancer subtype discovery and biomarker identification via a new robust network clustering algorithm. PLOS ONE 8(6), e66256 (2013)

    Article  Google Scholar 

  28. Wåhlstedt, C., Thorell, L.B., Bohlin, G.: Heterogeneity in ADHD: neuropsychological pathways, comorbidity and symptom domains. J. Abnorm. Child Psychol. 37(4), 551–564 (2009)

    Article  Google Scholar 

  29. Yang, B., Fu, X., Sidiropoulos, N.D., Hong, M.: Towards K-means-friendly spaces: simultaneous deep learning and clustering. In: International Conference on Machine Learning, pp. 3861–3870. PMLR (2017)

    Google Scholar 

  30. Yang, T., et al.: Probing the clinical and brain structural boundaries of bipolar and major depressive disorder. Transl. Psychiatry 11(1), 1–8 (2021)

    Article  Google Scholar 

  31. Yang, Z., Wen, J., Davatzikos, C.: Smile-GANs: Semi-supervised clustering via GANs for dissecting brain disease heterogeneity from medical images. ar**v:2006.15255 (2020)

  32. Zabihi, M., Oldehinkel, M., Wolfers, T., Frouin, V., Goyard, D., et al.: Dissecting the heterogeneous cortical anatomy of autism spectrum disorder using normative models. Biol. Psychiatry: Cogn. Neurosci. Neuroimaging 4(6), 567–578 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Pietro Gori or Antoine Grigis .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 331 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Louiset, R., Gori, P., Dufumier, B., Houenou, J., Grigis, A., Duchesnay, E. (2021). UCSL : A Machine Learning Expectation-Maximization Framework for Unsupervised Clustering Driven by Supervised Learning. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12975. Springer, Cham. https://doi.org/10.1007/978-3-030-86486-6_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86486-6_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86485-9

  • Online ISBN: 978-3-030-86486-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation