Log in

Directional co-clustering

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Co-clustering addresses the problem of simultaneous clustering of both dimensions of a data matrix. When dealing with high dimensional sparse data, co-clustering turns out to be more beneficial than one-sided clustering even if one is interested in clustering along one dimension only. Aside from being high dimensional and sparse, some datasets, such as document-term matrices, exhibit directional characteristics, and the \(L_2\) normalization of such data, so that it lies on the surface of a unit hypersphere, is useful. Popular co-clustering assumptions such as Gaussian or Multinomial are inadequate for this type of data. In this paper, we extend the scope of co-clustering to directional data. We present Diagonal Block Mixture of Von Mises–Fisher distributions (dbmovMFs), a co-clustering model which is well suited for directional data lying on a unit hypersphere. By setting the estimate of the model parameters under the maximum likelihood (ML) and classification ML approaches, we develop a class of EM algorithms for estimating dbmovMFs from data. Extensive experiments, on several real-world datasets, confirm the advantage of our approach and demonstrate the effectiveness of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://pypi.python.org/pypi/coclust.

  2. http://www.dataminingresearch.com/.

  3. http://www.cad.zju.edu.cn/home/dengcai/.

  4. All algorithms, except stochastic variants, provide substantially better results when they are initialized using Skmeans (in almost all situations).

  5. For presentation purposes we omit CEM, EM\(_{\mathbf {b}}\) and CEM\(_{\mathbf {b}}\) from Fig. 4, but from Tables 2 and 3 it is straightforward to verify that our comments still holds for these algorithms.

References

  • Abramowitz M, Stegun IA (1964) Handbook of mathematical functions: with formulas, graphs, and mathematical tables, vol 55. Courier Corporation, North Chelmsford

    MATH  Google Scholar 

  • Ailem M, Role F, Nadif M (2016) Graph modularity maximization as an effective method for co-clustering text data. Knowl Based Syst 109:160–173

    Article  Google Scholar 

  • Ailem M, Role F, Nadif M (2017a) Model-based co-clustering for the effective handling of sparse data. Pattern Recognit 72:108–122

    Article  Google Scholar 

  • Ailem M, Role F, Nadif M (2017b) Sparse poisson latent block model for document clustering. IEEE Trans Knowl Data Eng 29(7):1563–1576

    Article  Google Scholar 

  • Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G (eds) Selected papers of Hirotugu Akaike. Springer, New York, pp 199–213

    Chapter  Google Scholar 

  • Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises–Fisher distributions. J Mach Learn Res 6:1345–1382

    MathSciNet  MATH  Google Scholar 

  • Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821

    Article  MathSciNet  MATH  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE TPAMI 22(7):719–725

    Article  Google Scholar 

  • Bock HH (1979) Simultaneous clustering of objects and variables. In: Tomassone R (ed) Analyse des Données et Informatique. INRIA, Le Chesnay, pp 187–203

    Google Scholar 

  • Bock HH (1994) Information and entropy in cluster analysis. In: Bozdogan H et al (eds) Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: an informational approach. Springer, Dordrecht, pp 115–147

  • Bozdogan H (2000) Akaike’s information criterion and recent developments in information complexity. J Math Psychol 44(1):62–91

    Article  MathSciNet  MATH  Google Scholar 

  • Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q 2(1):73–82

    Google Scholar 

  • Celeux G, Diebolt J (1992) A stochastic approximation type EM algorithm for the mixture problem. Stoch Int J Probab Stoch Process 41(1–2):119–134

    MathSciNet  MATH  Google Scholar 

  • Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332

    Article  MathSciNet  MATH  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38

    MathSciNet  MATH  Google Scholar 

  • Deodhar M, Ghosh J (2010) Scoal: a framework for simultaneous co-clustering and learning from complex data. ACM Trans Knowl Discov Data 4(3):11

    Article  Google Scholar 

  • Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM SIGKDD, pp 269–274

  • Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1–2):143–175

    Article  MATH  Google Scholar 

  • Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: ACM SIGKDD, pp 89–98. ACM

  • Gopal S, Yang Y (2014) Von Mises–Fisher clustering models. In: ICML, pp 154–162

  • Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24:437–458

    MATH  Google Scholar 

  • Govaert G, Nadif M (2013) Co-Clustering. Wiley, New York

    Book  MATH  Google Scholar 

  • Govaert G, Nadif M (2016) Mutual information, phi-squared and model-based co-clustering for contingency tables. Advances in Data Analysis and Classification pp 1–34

  • Hanczar B, Nadif M (2010) Bagging for biclustering: application to microarray data. In: ECML/PKDD, pp 490–505

  • Hartigan JA (1975) Clustering algorithms, 99th edn. Wiley, New York

    MATH  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    Article  MATH  Google Scholar 

  • Labiod L, Nadif M (2011) Co-clustering for binary and categorical data with maximum modularity. In: ICDM, pp 1140–1145

  • Laclau C, Nadif M (2017) Diagonal latent block model for binary data. Stat Comput 27(5):1145–1163

    Article  MathSciNet  MATH  Google Scholar 

  • Li T (2005) A general model for clustering binary data. In: SIGKDD, pp 188–197

  • Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM TCBB 1(1):24–45

    Google Scholar 

  • Mardia KV, Jupp PE (2000) Directional statistics. Wiley series in probability and statistics. Wiley, New York

    MATH  Google Scholar 

  • McLachlan G, Krishnan T (2007) The EM algorithm and extensions. Wiley, New York

    MATH  Google Scholar 

  • McLachlan G, Peel D (2004) Finite mixture models. Wiley, New York

    MATH  Google Scholar 

  • Nadif M, Govaert G (2010) Model-based co-clustering for continuous data. In: ICMLA, pp 175–180

  • Reisinger J, Waters A, Silverthorn B, Mooney RJ (2010) Spherical topic models. In: ICML, pp 903–910

  • Salah A, Nadif M (2017) Social regularized von Mises–Fisher mixture model for item recommendation. Data Min Knowl Discov 31:1–24

    Article  MathSciNet  MATH  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. JMLR 3:583–617

    MathSciNet  MATH  Google Scholar 

  • van Dijk B, van Rosmalen J, Paap R (2009) A Bayesian approach to two-mode clustering. Econometric Institute, Erasmus University Rotterdam, Report no EI 2009-06, pp 1–26

  • Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394

    Article  MathSciNet  MATH  Google Scholar 

  • Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. In: Borra S, Rocci R, Vichi M, Schader M (eds) Advances in classification and data analysis. Springer, Berlin, Heidelberg, pp 43–52

    Google Scholar 

  • Wyse J, Friel N (2012) Block clustering with collapsed latent block models. Stat Comput 22(2):415–428

    Article  MathSciNet  MATH  Google Scholar 

  • Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3):374–384

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aghiles Salah.

Additional information

Availablity: All presented algorithms in the paper are available at https://github.com/dbmovMFs/DirecCoclus.

Appendices

A. Appendix

A.1 Maximum likelihood estimate

The expectation of the classification log-likelihood of dbmovMFs is given by

$$\begin{aligned} \mathbb {E}[L_c(\varTheta |\mathcal {X},{\mathbf {Z}})]= & {} \sum _{h}{\tilde{z}_{.h}\log {\alpha _h}} + \sum _{h}{\tilde{z}_{.h}\log (c_d(\kappa _h))} + \sum _{h,i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h\mu _{hh} x_{ij}}}\nonumber \\ \end{aligned}$$
(9)

where \(\tilde{z}_{.h} = \sum _i \tilde{z}_{ih}\). We first maximize the expectation of the classification log-likelihood (9) with respect to \(\alpha _h\), subject to the constraint \(\sum _h\alpha _h = 1\). The corresponding Lagrangian, up to terms which are not function of \(\alpha _h\), is given by \(L(\varvec{\alpha },\lambda )= \sum _{h}{\tilde{z}_{.h}\log {\alpha _h}} + \lambda _h(1 - \sum _h \alpha _h). \) Taking derivatives with respect to \(\alpha _h\), we obtain \( \frac{\partial L(\varvec{\alpha },\lambda )}{\partial \alpha _h} = \frac{\tilde{z}_{.h}}{\alpha _h} - \lambda . \) Setting this derivative to zero leads to \(\tilde{z}_{.h} = \lambda \alpha _h.\) Summing both sides over all h yields \(\lambda = n\), thereby the maximizing value of the parameter \(\alpha _h\) is given by \( \hat{\alpha }_h = \frac{\tilde{z}_{.h}}{n}. \) In the same manner, to maximize expectation (9) with respect to \(\varvec{\mu }_h^{{\mathbf {w}}}\), subject to the constraint \(||\varvec{\mu }_h^{{\mathbf {w}}}|| = 1\), we form the corresponding Lagrangian by isolating the terms depending on \(\varvec{\mu }_h^{{\mathbf {w}}}\), this leads to

$$\begin{aligned} L(\varvec{\mu },\lambda )= \sum _{h,i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h\mu _{hh} x_{ij}}} + \lambda _h \left( 1 - \sum _j w_{jh}\mu _{hh}^2\right) . \end{aligned}$$

Taking the derivative with respect to \(\mu _{hh}\), we obtain:

$$\begin{aligned} \frac{\partial L(\varvec{\mu },\lambda )}{\partial \mu _h} = \sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}} - 2\lambda w_{.h} \mu _{hh} \end{aligned}$$

where \(w_{.h} = \sum _j w_{jh}\). Setting this derivative to zero, we obtain \(\lambda \mu _{hh} = \frac{\sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}}}{2w_{.h}}.\) Thus, \(\lambda ^2\mu _{hh}^2 = \frac{(\sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}})^2}{4w_{.h}^2}.\) Multiplying both sides by \(w_{.h}\), yields:\( \lambda ^2w_{.h}\mu _{hh}^2 = \frac{w_{.h}(\sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}})^2}{4w_{.h}^2}. \) Since \(w_{.h}\mu _{hh}^2 =1\), we have \(\lambda = \kappa _h\frac{\sqrt{w_{.h} (\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}})^2}}{2 w_{.h}} \) or \(\lambda = \kappa _h \frac{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }{2 w_{.h}}\) where \({\mathbf {r}}_h^{{\mathbf {w}}}\) is a d dimensional vector, i.e, let \(j' = 1,\ldots ,d\), \(r_{hj'}^{{\mathbf {w}}} = r_{h}^{{\mathbf {w}}} = \sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}\) if \(w_{jh} = 1\) and \(r_{hj'}^{{\mathbf {w}}} = 0\), otherwise. Hence, the maximizing value of the parameter \(\mu _{hh}\) is given

$$\begin{aligned} \text{ by } \quad \hat{\mu }_{hh}= \frac{\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}}{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert } = \frac{\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}}{\sqrt{w_{.h} (\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}})^2}}=\pm \frac{1}{\sqrt{w_{.h}}} \end{aligned}$$
(10)

according to whether \(r_h^{{\mathbf {w}}} = \sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}\) is positive or negative.

Next we concentrate on maximizing Eq. (9), with respect to the concentration parameters \(\kappa _h\), subject to the constraint \(\kappa _h>0, \;\forall h\). The Lagrangian up to terms which do not contain \(\kappa _h\) is given by \( L(\kappa )= \sum _{h}{\tilde{z}_{.h}\log (c_d(\kappa _h))} + \sum _{h,i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h\hat{\mu }_{hh} x_{ij}}}. \) Note that, by KKT conditions, the Lagrangian multiplier for the constraint \(\kappa _h>0\) has to be equal to zero. Taking the partial derivative of Eq. (8) with respect to \(\kappa _h\), we obtain

$$\begin{aligned} \frac{\partial L(\kappa )}{\partial \kappa _h} = \tilde{z}_{.h}\frac{c_d'(\kappa _h)}{c_d(\kappa _h)} + \sum _{i,j}{{\tilde{z}_{ih}w_{jh}\hat{\mu }_{hh} x_{ij}}}. \end{aligned}$$

Setting this derivative equal to zero, leads to: \(\frac{c_d'(\kappa _h)}{c_d(\kappa _h)} = - \frac{\hat{\mu }_{hh}\times \sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}}{\tilde{z}_{.h}}.\) Replacing \(\hat{\mu }_{hh}\) by \(\frac{\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}}{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }\) (see, Eq. 10), we obtain\( \frac{c_d'(\kappa _h)}{c_d(\kappa _h)} = - \frac{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }{\tilde{z}_{.h}\hat{w}_{.h}}.\) Let \(s = d/2-1\), then:

$$\begin{aligned} c_d'(\kappa _h)= & {} \frac{s\kappa _h^{s-1}(2\pi )^{s+1} I_s(\kappa _h) -\kappa _h^s(2\pi )^{s+1}I_s'(\kappa _h)}{(2\pi )^{2s+2}I_s^2(\kappa _h)}\nonumber \\= & {} \frac{s\kappa _h^{s-1}}{(2\pi )^{s+1}I_s(\kappa _h)}-\frac{\kappa _h^sI_s'(\kappa _h)}{(2\pi )^{s+1}I_s^2(\kappa _h)} = c_d(\kappa _h)\left( \frac{s}{\kappa _h}-\frac{I_s'(\kappa _h)}{I_s(\kappa _h)}\right) . \end{aligned}$$
(11)

Hence, we obtain

$$\begin{aligned} \frac{-c_d'(\kappa _h)}{c_d(\kappa _h)} = \frac{I_{s+1}(\kappa _h)}{I_{s}(\kappa _h)} = \frac{I_{d/2}(\kappa _h)}{I_{d/2-1}(\kappa _h)}. \end{aligned}$$
(12)

The latter Eq. (12), arises from the use of the following recurrence formula (Abramowitz and Stegun (1964), page 376): \( \kappa _h I_{s+1}(\kappa _h)= \kappa _h I_s'(\kappa _h)- sI_s(\kappa _h). \) Note that computing the maximizing value \(\hat{\kappa }_h\) from Eq. (11) implies to inverse a ratio of Bessel function, a problem for which no a closed-form solution can be obtained. Thus, following (Banerjee et al. 2005), we propose to derive an accurate approximation of the concentration parameter, by using the following continued fraction formula:

$$\begin{aligned} \frac{I_{d/2}(\kappa _h)}{I_{d/2-1}(\kappa _h)}= & {} \frac{1}{\frac{d}{\kappa _h}+\frac{1}{\frac{d+2}{\kappa _h}+\cdots }}. \end{aligned}$$
(13)

Letting \( \bar{r}_h^{{\mathbf {w}}} = \frac{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }{\tilde{z}_{.h}\hat{w}_{.h}} = \frac{I_{d/2}(\kappa _h)}{I_{d/2-1}(\kappa _h)}\) and using Eq. (13), we obtain: \(\frac{1}{\bar{r}_h^{{\mathbf {w}}}} \approx \frac{d}{\kappa _h} + \bar{r}_h^{{\mathbf {w}}} \) which yields the following approximation: \(\hat{\kappa }_h = \frac{d\bar{r}_h^{{\mathbf {w}}}}{1-(\bar{r}_h^{{\mathbf {w}}})^2}.\) Finally, Banerjee et al. (2005) have empirically shown that adding the following correction term \(\frac{-(\bar{r}_h^{{\mathbf {w}}})^3}{1-(\bar{r}_h^{{\mathbf {w}}})^2}\) results in a better approximation of \(\hat{\kappa }_h\), which leads to: \(\hat{\kappa }_h = \frac{d\bar{r}_h^{{\mathbf {w}}} - (\bar{r}_h^{{\mathbf {w}}})^3}{1-(\bar{r}_h^{{\mathbf {w}}})^2}. \)

A.2 Concentration parameters: dbmovMFs versus movMFs

Hereafter, we provide the proofs for Proposition 1 and Theorem 1. The following proposition is useful and necessary to prove both Proposition 1 and Theorem 1.

Proposition 3

Let \({\mathbf {r}}\) be a non-zero vector in \(\mathbb {R}^d\) (i.e., \({\mathbf {r}}= (r_1, \ldots ,r_d)^\top \), such as \(d\ge 1\)). Let \({\mathbf {r}}^d\) be a vector in \(\mathbb {R}^d\), such as all its component are equal to the sum of elements of \({\mathbf {r}}\) (i.e, where denotes the constant one vector). Then \(\frac{\Vert {\mathbf {r}}^d\Vert }{d} \;\le \; \Vert {\mathbf {r}}\Vert \) with equality only if all components of \({\mathbf {r}}\) are equal, i.e, \((r_1= \cdots = r_d)\).

Proof

Let \({\mathbf {d}}\) and \({\mathbf {r}}^+\) be two vectors in \(\mathbb {R}^d\) defined as follows: and \({\mathbf {r}}^+\) with \( r_j^+\; = \; |r_j|, \;\;\; \forall j \in \{1,\ldots ,d\}. \) We have

$$\begin{aligned} \frac{\Vert {\mathbf {r}}^d\Vert }{d} \;=\; \frac{\sqrt{d}\times \left| \sum _j r_j\right| }{d} \;\le \; \frac{1}{\sqrt{d}} \times \sum _j\left| r_j\right| \;=\;{\mathbf {d}}^\top .{\mathbf {r}}^+ \;=\;\Vert {\mathbf {d}}\Vert \Vert {\mathbf {r}}^+\Vert \cos ({\mathbf {d}},{\mathbf {r}}^+). \end{aligned}$$

By definition of \({\mathbf {r}}^+\) and \({\mathbf {d}}\), we have \(\Vert {\mathbf {r}}^+\Vert = \Vert {\mathbf {r}}\Vert \) and \(\Vert {\mathbf {d}}\Vert = 1\), hence \( \frac{\Vert {\mathbf {r}}^d\Vert }{d} \;\le \;\Vert {\mathbf {r}}\Vert \cos ({\mathbf {d}},{\mathbf {r}}^+). \) Since both \({\mathbf {d}}\) and \({\mathbf {r}}^+\) are non-zero vectors and lie on the first orthant of a d-dimensional unit hypersphere, by dividing both sides of the above inequality by \(\Vert {\mathbf {r}}\Vert \), we get \( 0\;\le \;\frac{\Vert {\mathbf {r}}^d\Vert }{d\Vert {\mathbf {r}}\Vert } \;\le \;\cos ({\mathbf {d}},{\mathbf {r}}^+) \;\le \;1. \) The right hand side equality holds only if \({\mathbf {d}}\) and \({\mathbf {r}}^+\) are collinear, thereby all components of \({\mathbf {r}}\) are equal (i.e,\(r_1 = \cdots =r_d\)). We now prove Proposition 1. \(\square \)

Proof of Proposition 1

Based on Proposition 3 and the following inequality: \(\Vert {\mathbf {r}}\Vert = \Vert p_1{\mathbf {x}}_1 + \cdots + p_n{\mathbf {x}}_n\Vert \le \Vert p_1{\mathbf {x}}_1\Vert + \cdots + \Vert p_n{\mathbf {x}}_n\Vert = \sum _i p_i\), it is straightforward to verify that

$$\begin{aligned} 0\; \le \; \Vert {\mathbf {r}}^d\Vert \; \le \; {d \times \sum _i p_i} \end{aligned}$$

In the following, we prove Theorem 1, which states that for a given row clustering \({\mathbf {z}}\), dbmovMFs-based algorithms lead to a concentration parameter that is less or equal to that of movMFs-based algorithms, for each cluster, and whatever the column partition \({\mathbf {w}}\). The following Lemma will be useful in the proof of Theorem 1. \(\square \)

Lemma 1

Let a and b be two real numbers in the interval [0, 1] (i.e, \(0\le a \le 1\) and \(0\le b \le 1\)). Then for all natural number \(n\ge 2\)\( \left| a^n - b^n\right| \;\le \; n\left| a-b\right| \) with equality only if \(a = b\).

Proof

For all natural number \(n\ge 2\), we can use the following well known remarkable identity: \( a^n - b^n \; = \; (a-b)\sum _{k=0}^{n-1}a^{n-1-k}b^k. \) Since both a and b are positive, we have \( \left| a^n - b^n\right| \; = \; \left| a-b\right| \sum _{k=0}^{n-1}a^{n-1-k}b^k \) as both a and b are in [0, 1], we have \( \sum _{k=0}^{n-1}a^{n-1-k}b^k\; \le \; n \) thereby, \( \left| a^n - b^n\right| \; = \;\left| a-b\right| \sum _{k=0}^{n-1}a^{n-1-k}b^k\; \le \; n\left| a-b\right| . \)\(\square \)

Proof of Theorem 1

We first prove that \( { \frac{\bar{r}_{h}^{{\mathbf {w}}} d }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h} d }{1 - \left( \bar{r}_{h}\right) ^2} \), which corresponds to the approximation of the concentration parameters under dbmovMFs and movMFs without the correction term. Since both \(\bar{r}_{h}^{{\mathbf {w}}}\) and \(\bar{r}_{h}\) are in ]0, 1[, it is straightforward to verify that if \(\bar{r}_{h}^{{\mathbf {w}}} \;\le \; \bar{r}_{h}\) then the above inequality is always verified. Thus, in what follows we aim to prove that \(\bar{r}_{h}^{{\mathbf {w}}} \;\le \; \bar{r}_{h}\). \(\square \)

By definition, we have \( \bar{r}_h = \frac{\Vert {\mathbf {r}}_h\Vert }{\sum _i \tilde{z}_{ih}}\quad \text{ where } \quad {\mathbf {r}}_h = \sum _i \tilde{z}_{ih}{\mathbf {x}}_i \). Now, let’s define \(\bar{r}_{h}'\) as:

$$\begin{aligned} \bar{r}_h' = \frac{\Vert {\mathbf {r}}_h'\Vert }{\sum _i \tilde{z}_{ih}}\quad \text{ where } \quad r_{hj}' = \left\{ \begin{array}{ll} r_{hj}, &{}\quad \text{ if } w_{jh}= 1\\ 0, &{}\quad \text{ otherwise }. \end{array} \right. \\ \end{aligned}$$

\({\mathbf {r}}_h'\) is a \(w_{.h}\) dimensional sub vector of \({\mathbf {r}}_h\), it follows from the above definition that\(\bar{r}_h'\;\le \; \bar{r}_{h}\). On the other hand, \( \bar{r}_{h}^{{\mathbf {w}}} = \frac{\Vert {\mathbf {r}}_{h}^{{\mathbf {w}}}\Vert }{\sum _i \tilde{z}_{ih} {\sum _j \hat{w}_{jh}}} \), where \({\mathbf {r}}_{h}^{{\mathbf {w}}}\) denotes a \(w_{.h}\) dimensional vector, each its elements are equal to sum of elements of the hth co-cluster (i.e, \(r_{h1}^{{\mathbf {w}}} = \cdots = r_{hw_{.h}}^{{\mathbf {w}}} = r_h^{{\mathbf {w}}} =\sum _{i,j}\tilde{z}_{ih}\hat{w}_{jh}x_{ij} \)), similarly we can show that each elements of \({\mathbf {r}}_{h}^{{\mathbf {w}}}\) are equal to sum of elements of \(r_h'\) (i.e, \(r_h^{{\mathbf {w}}} = \sum _j r_{hj}'\)). Hence using Proposition 3, where the dimensionality \(d = w_{.h}\), we have \( \frac{\Vert {\mathbf {r}}_{h}^{{\mathbf {w}}}\Vert }{{\sum _j \hat{w}_{jh}}} \;\le \; \Vert {\mathbf {r}}_{h}'\Vert .\) Thus,

$$\begin{aligned} \bar{r}_h^{{\mathbf {w}}} = \frac{\Vert {\mathbf {r}}_{h}^{{\mathbf {w}}}\Vert }{\sum _i \tilde{z}_{ih} {\sum _j \hat{w}_{jh}}} \;\le \; \bar{r}_h' = \frac{\Vert {\mathbf {r}}_{h}'\Vert }{\sum _i \tilde{z}_{ih}} \;\le \; \bar{r}_h, \text{ thereby }, { \frac{\bar{r}_{h}^{{\mathbf {w}}} d }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h} d }{1 - \left( \bar{r}_{h}\right) ^2}. \end{aligned}$$

We now prove that, \( { \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. \)

As \(\bar{r}_{h}^{{\mathbf {w}}} \le \bar{r}_{h}\) we have \( { \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2}, \) then it is sufficient to prove that, \( \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. \) We have

$$\begin{aligned} \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2} - \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2} \; = \;\frac{d(\bar{r}_{h}^{{\mathbf {w}}}-\bar{r}_{h}) + (\bar{r}_{h}^3 -(\bar{r}_{h}^{{\mathbf {w}}})^3) }{1 - \left( \bar{r}_{h}\right) ^2}. \end{aligned}$$

Based on Lemma 1, for all \(d\ge 3\) we have \( \left| \bar{r}_{h}^3 -(\bar{r}_{h}^{{\mathbf {w}}})^3\right| \;\le \; d\left| \bar{r}_{h}^{{\mathbf {w}}}-\bar{r}_{h}\right| .\) As \(\bar{r}_{h}^{{\mathbf {w}}}\;\le \;\bar{r}_{h}\) it is easy to verify that \(d(\bar{r}_{h}^{{\mathbf {w}}}-\bar{r}_{h}) + (\bar{r}_{h}^3 -(\bar{r}_{h}^{{\mathbf {w}}})^3)\;\le \; 0.\) Based on the fact that \(1 - \left( \bar{r}_{h}\right) ^2 \;>\; 0\) we have \( \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. \) Hence, using the above inequalities we get (for all \(d\;\ge \;3\))

$$\begin{aligned} { \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. \end{aligned}$$

A.3 Computational complexity in the worst case

Hereafter, we provide the proof of Proposition 2 (Sect. 6).

Proof

(i) The computational bottleneck for hard-dbmovMF is with row, column assignments and concentration parameters updates.

First, we prove that the complexity of both row and column assignments is O(nz). Let \({\mathbf {x}}_i\) denote the ith row of \({\mathbf {X}}\) (i.e, \({\mathbf {x}}_i\) is a d dimensional vector in \(\mathbb {S}^{d-1}\)). Assume that we look for a co-clustering of \({\mathbf {X}}\) into g co-clusters, and let \(\varvec{\mu }_h^{{\mathbf {w}}}\) be the hth centroid, characterizing the hth co-cluster. The computational cost of the scalar product \((\varvec{\mu }_h^{{\mathbf {w}}})^{T}{\mathbf {x}}_i\) is \(O(x_i^h)\), where \(x_i^h\) is number of non-zeros entries of \({\mathbf {x}}_i\) within the hth column cluster, this complexity holds thanks to the form of \(\varvec{\mu }_h^{{\mathbf {w}}}\), see Eq. (2). Thereby, the complexity of the assignment of \({\mathbf {x}}_i\) is given in \(O(x_i^*)\) (based on \(O(x_i^1+\cdots +x_i^g)\)), where \(x_i^*\) denotes the number of non-zeros entries of \({\mathbf {x}}_i\). Therefore, the total cost of one row assignments step is O(nz). Similarly, we can show that the cost of one column assignments step is O(nz).

We now prove that the computational cost for updating concentration parameters is also O(nz). The main computation for updating the hth concentration parameter is with the computation of \(r_h^{{\mathbf {w}}}\). The computational cost of the latter term is given in \(O(x_h^*)\), where \(x_h^*\) the number of non-zeros entries in the hth co-cluster. Thus, the complexity for updating all concentrations parameters is O(nz), based on \(O(x_1^*+\cdots +x_g^*)\) and the fact that at most all non-zeros entries in the matrix \({\mathbf {X}}\) are contained in the g diagonal co-clusters. Thereby, the total computational complexity of hard-dbmovMF is \(O(it\cdot nz)\).

\(\square \)

Proof

(ii) As for hard-dbmovMF it is easy to verify that the total cost of row assignments and concentration parameters updates is given in \(O(it\cdot nz)\) for soft-dbmovMF. Now we prove that in contrast to hard-dbmovMF the computational cost of column assignments for soft-dbmovMF is \(O(it\cdot g\cdot nz)\). The computational bottleneck for column assignment step of soft-dbmovMF is with the terms \(v_{hj} \leftarrow \sum _i \tilde{z}_{ih}x_{ij}\), \(h\in \{1,\ldots ,g\}\), \(j\in \{1,\ldots ,J\}\). The cost of \(v_{hj}\) is given in \(O(x_j^*)\), where \(x_j^*\) is the number of non-zeros entries in the jth column. During the column assignment step for each cluster h and column j we compute \(v_{hj}\), hence the computational cost of this step is given in \(O(g\cdot nz)\). Therefore, the total computational cost of soft-dbmovMF is \(O(it\cdot g\cdot nz)\), based on \(O(it\cdot nz\cdot (g+1))\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Salah, A., Nadif, M. Directional co-clustering. Adv Data Anal Classif 13, 591–620 (2019). https://doi.org/10.1007/s11634-018-0323-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-018-0323-4

Keywords

Mathematics Subject Classification

Navigation