Abstract
Co-clustering addresses the problem of simultaneous clustering of both dimensions of a data matrix. When dealing with high dimensional sparse data, co-clustering turns out to be more beneficial than one-sided clustering even if one is interested in clustering along one dimension only. Aside from being high dimensional and sparse, some datasets, such as document-term matrices, exhibit directional characteristics, and the \(L_2\) normalization of such data, so that it lies on the surface of a unit hypersphere, is useful. Popular co-clustering assumptions such as Gaussian or Multinomial are inadequate for this type of data. In this paper, we extend the scope of co-clustering to directional data. We present Diagonal Block Mixture of Von Mises–Fisher distributions (dbmovMFs), a co-clustering model which is well suited for directional data lying on a unit hypersphere. By setting the estimate of the model parameters under the maximum likelihood (ML) and classification ML approaches, we develop a class of EM algorithms for estimating dbmovMFs from data. Extensive experiments, on several real-world datasets, confirm the advantage of our approach and demonstrate the effectiveness of our algorithms.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig6_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig7_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig8_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig9_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig10_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig11_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11634-018-0323-4/MediaObjects/11634_2018_323_Fig12_HTML.gif)
Similar content being viewed by others
Notes
All algorithms, except stochastic variants, provide substantially better results when they are initialized using Skmeans (in almost all situations).
References
Abramowitz M, Stegun IA (1964) Handbook of mathematical functions: with formulas, graphs, and mathematical tables, vol 55. Courier Corporation, North Chelmsford
Ailem M, Role F, Nadif M (2016) Graph modularity maximization as an effective method for co-clustering text data. Knowl Based Syst 109:160–173
Ailem M, Role F, Nadif M (2017a) Model-based co-clustering for the effective handling of sparse data. Pattern Recognit 72:108–122
Ailem M, Role F, Nadif M (2017b) Sparse poisson latent block model for document clustering. IEEE Trans Knowl Data Eng 29(7):1563–1576
Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G (eds) Selected papers of Hirotugu Akaike. Springer, New York, pp 199–213
Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises–Fisher distributions. J Mach Learn Res 6:1345–1382
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE TPAMI 22(7):719–725
Bock HH (1979) Simultaneous clustering of objects and variables. In: Tomassone R (ed) Analyse des Données et Informatique. INRIA, Le Chesnay, pp 187–203
Bock HH (1994) Information and entropy in cluster analysis. In: Bozdogan H et al (eds) Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: an informational approach. Springer, Dordrecht, pp 115–147
Bozdogan H (2000) Akaike’s information criterion and recent developments in information complexity. J Math Psychol 44(1):62–91
Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q 2(1):73–82
Celeux G, Diebolt J (1992) A stochastic approximation type EM algorithm for the mixture problem. Stoch Int J Probab Stoch Process 41(1–2):119–134
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Deodhar M, Ghosh J (2010) Scoal: a framework for simultaneous co-clustering and learning from complex data. ACM Trans Knowl Discov Data 4(3):11
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM SIGKDD, pp 269–274
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1–2):143–175
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: ACM SIGKDD, pp 89–98. ACM
Gopal S, Yang Y (2014) Von Mises–Fisher clustering models. In: ICML, pp 154–162
Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24:437–458
Govaert G, Nadif M (2013) Co-Clustering. Wiley, New York
Govaert G, Nadif M (2016) Mutual information, phi-squared and model-based co-clustering for contingency tables. Advances in Data Analysis and Classification pp 1–34
Hanczar B, Nadif M (2010) Bagging for biclustering: application to microarray data. In: ECML/PKDD, pp 490–505
Hartigan JA (1975) Clustering algorithms, 99th edn. Wiley, New York
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Labiod L, Nadif M (2011) Co-clustering for binary and categorical data with maximum modularity. In: ICDM, pp 1140–1145
Laclau C, Nadif M (2017) Diagonal latent block model for binary data. Stat Comput 27(5):1145–1163
Li T (2005) A general model for clustering binary data. In: SIGKDD, pp 188–197
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM TCBB 1(1):24–45
Mardia KV, Jupp PE (2000) Directional statistics. Wiley series in probability and statistics. Wiley, New York
McLachlan G, Krishnan T (2007) The EM algorithm and extensions. Wiley, New York
McLachlan G, Peel D (2004) Finite mixture models. Wiley, New York
Nadif M, Govaert G (2010) Model-based co-clustering for continuous data. In: ICMLA, pp 175–180
Reisinger J, Waters A, Silverthorn B, Mooney RJ (2010) Spherical topic models. In: ICML, pp 903–910
Salah A, Nadif M (2017) Social regularized von Mises–Fisher mixture model for item recommendation. Data Min Knowl Discov 31:1–24
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. JMLR 3:583–617
van Dijk B, van Rosmalen J, Paap R (2009) A Bayesian approach to two-mode clustering. Econometric Institute, Erasmus University Rotterdam, Report no EI 2009-06, pp 1–26
Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394
Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. In: Borra S, Rocci R, Vichi M, Schader M (eds) Advances in classification and data analysis. Springer, Berlin, Heidelberg, pp 43–52
Wyse J, Friel N (2012) Block clustering with collapsed latent block models. Stat Comput 22(2):415–428
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3):374–384
Author information
Authors and Affiliations
Corresponding author
Additional information
Availablity: All presented algorithms in the paper are available at https://github.com/dbmovMFs/DirecCoclus.
Appendices
A. Appendix
A.1 Maximum likelihood estimate
The expectation of the classification log-likelihood of dbmovMFs is given by
where \(\tilde{z}_{.h} = \sum _i \tilde{z}_{ih}\). We first maximize the expectation of the classification log-likelihood (9) with respect to \(\alpha _h\), subject to the constraint \(\sum _h\alpha _h = 1\). The corresponding Lagrangian, up to terms which are not function of \(\alpha _h\), is given by \(L(\varvec{\alpha },\lambda )= \sum _{h}{\tilde{z}_{.h}\log {\alpha _h}} + \lambda _h(1 - \sum _h \alpha _h). \) Taking derivatives with respect to \(\alpha _h\), we obtain \( \frac{\partial L(\varvec{\alpha },\lambda )}{\partial \alpha _h} = \frac{\tilde{z}_{.h}}{\alpha _h} - \lambda . \) Setting this derivative to zero leads to \(\tilde{z}_{.h} = \lambda \alpha _h.\) Summing both sides over all h yields \(\lambda = n\), thereby the maximizing value of the parameter \(\alpha _h\) is given by \( \hat{\alpha }_h = \frac{\tilde{z}_{.h}}{n}. \) In the same manner, to maximize expectation (9) with respect to \(\varvec{\mu }_h^{{\mathbf {w}}}\), subject to the constraint \(||\varvec{\mu }_h^{{\mathbf {w}}}|| = 1\), we form the corresponding Lagrangian by isolating the terms depending on \(\varvec{\mu }_h^{{\mathbf {w}}}\), this leads to
Taking the derivative with respect to \(\mu _{hh}\), we obtain:
where \(w_{.h} = \sum _j w_{jh}\). Setting this derivative to zero, we obtain \(\lambda \mu _{hh} = \frac{\sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}}}{2w_{.h}}.\) Thus, \(\lambda ^2\mu _{hh}^2 = \frac{(\sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}})^2}{4w_{.h}^2}.\) Multiplying both sides by \(w_{.h}\), yields:\( \lambda ^2w_{.h}\mu _{hh}^2 = \frac{w_{.h}(\sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}})^2}{4w_{.h}^2}. \) Since \(w_{.h}\mu _{hh}^2 =1\), we have \(\lambda = \kappa _h\frac{\sqrt{w_{.h} (\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}})^2}}{2 w_{.h}} \) or \(\lambda = \kappa _h \frac{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }{2 w_{.h}}\) where \({\mathbf {r}}_h^{{\mathbf {w}}}\) is a d dimensional vector, i.e, let \(j' = 1,\ldots ,d\), \(r_{hj'}^{{\mathbf {w}}} = r_{h}^{{\mathbf {w}}} = \sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}\) if \(w_{jh} = 1\) and \(r_{hj'}^{{\mathbf {w}}} = 0\), otherwise. Hence, the maximizing value of the parameter \(\mu _{hh}\) is given
according to whether \(r_h^{{\mathbf {w}}} = \sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}\) is positive or negative.
Next we concentrate on maximizing Eq. (9), with respect to the concentration parameters \(\kappa _h\), subject to the constraint \(\kappa _h>0, \;\forall h\). The Lagrangian up to terms which do not contain \(\kappa _h\) is given by \( L(\kappa )= \sum _{h}{\tilde{z}_{.h}\log (c_d(\kappa _h))} + \sum _{h,i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h\hat{\mu }_{hh} x_{ij}}}. \) Note that, by KKT conditions, the Lagrangian multiplier for the constraint \(\kappa _h>0\) has to be equal to zero. Taking the partial derivative of Eq. (8) with respect to \(\kappa _h\), we obtain
Setting this derivative equal to zero, leads to: \(\frac{c_d'(\kappa _h)}{c_d(\kappa _h)} = - \frac{\hat{\mu }_{hh}\times \sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}}{\tilde{z}_{.h}}.\) Replacing \(\hat{\mu }_{hh}\) by \(\frac{\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}}{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }\) (see, Eq. 10), we obtain\( \frac{c_d'(\kappa _h)}{c_d(\kappa _h)} = - \frac{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }{\tilde{z}_{.h}\hat{w}_{.h}}.\) Let \(s = d/2-1\), then:
Hence, we obtain
The latter Eq. (12), arises from the use of the following recurrence formula (Abramowitz and Stegun (1964), page 376): \( \kappa _h I_{s+1}(\kappa _h)= \kappa _h I_s'(\kappa _h)- sI_s(\kappa _h). \) Note that computing the maximizing value \(\hat{\kappa }_h\) from Eq. (11) implies to inverse a ratio of Bessel function, a problem for which no a closed-form solution can be obtained. Thus, following (Banerjee et al. 2005), we propose to derive an accurate approximation of the concentration parameter, by using the following continued fraction formula:
Letting \( \bar{r}_h^{{\mathbf {w}}} = \frac{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }{\tilde{z}_{.h}\hat{w}_{.h}} = \frac{I_{d/2}(\kappa _h)}{I_{d/2-1}(\kappa _h)}\) and using Eq. (13), we obtain: \(\frac{1}{\bar{r}_h^{{\mathbf {w}}}} \approx \frac{d}{\kappa _h} + \bar{r}_h^{{\mathbf {w}}} \) which yields the following approximation: \(\hat{\kappa }_h = \frac{d\bar{r}_h^{{\mathbf {w}}}}{1-(\bar{r}_h^{{\mathbf {w}}})^2}.\) Finally, Banerjee et al. (2005) have empirically shown that adding the following correction term \(\frac{-(\bar{r}_h^{{\mathbf {w}}})^3}{1-(\bar{r}_h^{{\mathbf {w}}})^2}\) results in a better approximation of \(\hat{\kappa }_h\), which leads to: \(\hat{\kappa }_h = \frac{d\bar{r}_h^{{\mathbf {w}}} - (\bar{r}_h^{{\mathbf {w}}})^3}{1-(\bar{r}_h^{{\mathbf {w}}})^2}. \)
A.2 Concentration parameters: dbmovMFs versus movMFs
Hereafter, we provide the proofs for Proposition 1 and Theorem 1. The following proposition is useful and necessary to prove both Proposition 1 and Theorem 1.
Proposition 3
Let \({\mathbf {r}}\) be a non-zero vector in \(\mathbb {R}^d\) (i.e., \({\mathbf {r}}= (r_1, \ldots ,r_d)^\top \), such as \(d\ge 1\)). Let \({\mathbf {r}}^d\) be a vector in \(\mathbb {R}^d\), such as all its component are equal to the sum of elements of \({\mathbf {r}}\) (i.e, where
denotes the constant one vector). Then \(\frac{\Vert {\mathbf {r}}^d\Vert }{d} \;\le \; \Vert {\mathbf {r}}\Vert \) with equality only if all components of \({\mathbf {r}}\) are equal, i.e, \((r_1= \cdots = r_d)\).
Proof
Let \({\mathbf {d}}\) and \({\mathbf {r}}^+\) be two vectors in \(\mathbb {R}^d\) defined as follows: and \({\mathbf {r}}^+\) with \( r_j^+\; = \; |r_j|, \;\;\; \forall j \in \{1,\ldots ,d\}. \) We have
By definition of \({\mathbf {r}}^+\) and \({\mathbf {d}}\), we have \(\Vert {\mathbf {r}}^+\Vert = \Vert {\mathbf {r}}\Vert \) and \(\Vert {\mathbf {d}}\Vert = 1\), hence \( \frac{\Vert {\mathbf {r}}^d\Vert }{d} \;\le \;\Vert {\mathbf {r}}\Vert \cos ({\mathbf {d}},{\mathbf {r}}^+). \) Since both \({\mathbf {d}}\) and \({\mathbf {r}}^+\) are non-zero vectors and lie on the first orthant of a d-dimensional unit hypersphere, by dividing both sides of the above inequality by \(\Vert {\mathbf {r}}\Vert \), we get \( 0\;\le \;\frac{\Vert {\mathbf {r}}^d\Vert }{d\Vert {\mathbf {r}}\Vert } \;\le \;\cos ({\mathbf {d}},{\mathbf {r}}^+) \;\le \;1. \) The right hand side equality holds only if \({\mathbf {d}}\) and \({\mathbf {r}}^+\) are collinear, thereby all components of \({\mathbf {r}}\) are equal (i.e,\(r_1 = \cdots =r_d\)). We now prove Proposition 1. \(\square \)
Proof of Proposition 1
Based on Proposition 3 and the following inequality: \(\Vert {\mathbf {r}}\Vert = \Vert p_1{\mathbf {x}}_1 + \cdots + p_n{\mathbf {x}}_n\Vert \le \Vert p_1{\mathbf {x}}_1\Vert + \cdots + \Vert p_n{\mathbf {x}}_n\Vert = \sum _i p_i\), it is straightforward to verify that
In the following, we prove Theorem 1, which states that for a given row clustering \({\mathbf {z}}\), dbmovMFs-based algorithms lead to a concentration parameter that is less or equal to that of movMFs-based algorithms, for each cluster, and whatever the column partition \({\mathbf {w}}\). The following Lemma will be useful in the proof of Theorem 1. \(\square \)
Lemma 1
Let a and b be two real numbers in the interval [0, 1] (i.e, \(0\le a \le 1\) and \(0\le b \le 1\)). Then for all natural number \(n\ge 2\)\( \left| a^n - b^n\right| \;\le \; n\left| a-b\right| \) with equality only if \(a = b\).
Proof
For all natural number \(n\ge 2\), we can use the following well known remarkable identity: \( a^n - b^n \; = \; (a-b)\sum _{k=0}^{n-1}a^{n-1-k}b^k. \) Since both a and b are positive, we have \( \left| a^n - b^n\right| \; = \; \left| a-b\right| \sum _{k=0}^{n-1}a^{n-1-k}b^k \) as both a and b are in [0, 1], we have \( \sum _{k=0}^{n-1}a^{n-1-k}b^k\; \le \; n \) thereby, \( \left| a^n - b^n\right| \; = \;\left| a-b\right| \sum _{k=0}^{n-1}a^{n-1-k}b^k\; \le \; n\left| a-b\right| . \)\(\square \)
Proof of Theorem 1
We first prove that \( { \frac{\bar{r}_{h}^{{\mathbf {w}}} d }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h} d }{1 - \left( \bar{r}_{h}\right) ^2} \), which corresponds to the approximation of the concentration parameters under dbmovMFs and movMFs without the correction term. Since both \(\bar{r}_{h}^{{\mathbf {w}}}\) and \(\bar{r}_{h}\) are in ]0, 1[, it is straightforward to verify that if \(\bar{r}_{h}^{{\mathbf {w}}} \;\le \; \bar{r}_{h}\) then the above inequality is always verified. Thus, in what follows we aim to prove that \(\bar{r}_{h}^{{\mathbf {w}}} \;\le \; \bar{r}_{h}\). \(\square \)
By definition, we have \( \bar{r}_h = \frac{\Vert {\mathbf {r}}_h\Vert }{\sum _i \tilde{z}_{ih}}\quad \text{ where } \quad {\mathbf {r}}_h = \sum _i \tilde{z}_{ih}{\mathbf {x}}_i \). Now, let’s define \(\bar{r}_{h}'\) as:
\({\mathbf {r}}_h'\) is a \(w_{.h}\) dimensional sub vector of \({\mathbf {r}}_h\), it follows from the above definition that\(\bar{r}_h'\;\le \; \bar{r}_{h}\). On the other hand, \( \bar{r}_{h}^{{\mathbf {w}}} = \frac{\Vert {\mathbf {r}}_{h}^{{\mathbf {w}}}\Vert }{\sum _i \tilde{z}_{ih} {\sum _j \hat{w}_{jh}}} \), where \({\mathbf {r}}_{h}^{{\mathbf {w}}}\) denotes a \(w_{.h}\) dimensional vector, each its elements are equal to sum of elements of the hth co-cluster (i.e, \(r_{h1}^{{\mathbf {w}}} = \cdots = r_{hw_{.h}}^{{\mathbf {w}}} = r_h^{{\mathbf {w}}} =\sum _{i,j}\tilde{z}_{ih}\hat{w}_{jh}x_{ij} \)), similarly we can show that each elements of \({\mathbf {r}}_{h}^{{\mathbf {w}}}\) are equal to sum of elements of \(r_h'\) (i.e, \(r_h^{{\mathbf {w}}} = \sum _j r_{hj}'\)). Hence using Proposition 3, where the dimensionality \(d = w_{.h}\), we have \( \frac{\Vert {\mathbf {r}}_{h}^{{\mathbf {w}}}\Vert }{{\sum _j \hat{w}_{jh}}} \;\le \; \Vert {\mathbf {r}}_{h}'\Vert .\) Thus,
We now prove that, \( { \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. \)
As \(\bar{r}_{h}^{{\mathbf {w}}} \le \bar{r}_{h}\) we have \( { \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2}, \) then it is sufficient to prove that, \( \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. \) We have
Based on Lemma 1, for all \(d\ge 3\) we have \( \left| \bar{r}_{h}^3 -(\bar{r}_{h}^{{\mathbf {w}}})^3\right| \;\le \; d\left| \bar{r}_{h}^{{\mathbf {w}}}-\bar{r}_{h}\right| .\) As \(\bar{r}_{h}^{{\mathbf {w}}}\;\le \;\bar{r}_{h}\) it is easy to verify that \(d(\bar{r}_{h}^{{\mathbf {w}}}-\bar{r}_{h}) + (\bar{r}_{h}^3 -(\bar{r}_{h}^{{\mathbf {w}}})^3)\;\le \; 0.\) Based on the fact that \(1 - \left( \bar{r}_{h}\right) ^2 \;>\; 0\) we have \( \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. \) Hence, using the above inequalities we get (for all \(d\;\ge \;3\))
A.3 Computational complexity in the worst case
Hereafter, we provide the proof of Proposition 2 (Sect. 6).
Proof
(i) The computational bottleneck for hard-dbmovMF is with row, column assignments and concentration parameters updates.
First, we prove that the complexity of both row and column assignments is O(nz). Let \({\mathbf {x}}_i\) denote the ith row of \({\mathbf {X}}\) (i.e, \({\mathbf {x}}_i\) is a d dimensional vector in \(\mathbb {S}^{d-1}\)). Assume that we look for a co-clustering of \({\mathbf {X}}\) into g co-clusters, and let \(\varvec{\mu }_h^{{\mathbf {w}}}\) be the hth centroid, characterizing the hth co-cluster. The computational cost of the scalar product \((\varvec{\mu }_h^{{\mathbf {w}}})^{T}{\mathbf {x}}_i\) is \(O(x_i^h)\), where \(x_i^h\) is number of non-zeros entries of \({\mathbf {x}}_i\) within the hth column cluster, this complexity holds thanks to the form of \(\varvec{\mu }_h^{{\mathbf {w}}}\), see Eq. (2). Thereby, the complexity of the assignment of \({\mathbf {x}}_i\) is given in \(O(x_i^*)\) (based on \(O(x_i^1+\cdots +x_i^g)\)), where \(x_i^*\) denotes the number of non-zeros entries of \({\mathbf {x}}_i\). Therefore, the total cost of one row assignments step is O(nz). Similarly, we can show that the cost of one column assignments step is O(nz).
We now prove that the computational cost for updating concentration parameters is also O(nz). The main computation for updating the hth concentration parameter is with the computation of \(r_h^{{\mathbf {w}}}\). The computational cost of the latter term is given in \(O(x_h^*)\), where \(x_h^*\) the number of non-zeros entries in the hth co-cluster. Thus, the complexity for updating all concentrations parameters is O(nz), based on \(O(x_1^*+\cdots +x_g^*)\) and the fact that at most all non-zeros entries in the matrix \({\mathbf {X}}\) are contained in the g diagonal co-clusters. Thereby, the total computational complexity of hard-dbmovMF is \(O(it\cdot nz)\).
\(\square \)
Proof
(ii) As for hard-dbmovMF it is easy to verify that the total cost of row assignments and concentration parameters updates is given in \(O(it\cdot nz)\) for soft-dbmovMF. Now we prove that in contrast to hard-dbmovMF the computational cost of column assignments for soft-dbmovMF is \(O(it\cdot g\cdot nz)\). The computational bottleneck for column assignment step of soft-dbmovMF is with the terms \(v_{hj} \leftarrow \sum _i \tilde{z}_{ih}x_{ij}\), \(h\in \{1,\ldots ,g\}\), \(j\in \{1,\ldots ,J\}\). The cost of \(v_{hj}\) is given in \(O(x_j^*)\), where \(x_j^*\) is the number of non-zeros entries in the jth column. During the column assignment step for each cluster h and column j we compute \(v_{hj}\), hence the computational cost of this step is given in \(O(g\cdot nz)\). Therefore, the total computational cost of soft-dbmovMF is \(O(it\cdot g\cdot nz)\), based on \(O(it\cdot nz\cdot (g+1))\). \(\square \)
Rights and permissions
About this article
Cite this article
Salah, A., Nadif, M. Directional co-clustering. Adv Data Anal Classif 13, 591–620 (2019). https://doi.org/10.1007/s11634-018-0323-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-018-0323-4