Directional co-clustering

Salah, Aghiles; Nadif, Mohamed

doi:10.1007/s11634-018-0323-4

Directional co-clustering

Regular Article
Published: 30 April 2018

Volume 13, pages 591–620, (2019)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Aghiles Salah¹ &
Mohamed Nadif²

594 Accesses
21 Citations
2 Altmetric
Explore all metrics

Abstract

Co-clustering addresses the problem of simultaneous clustering of both dimensions of a data matrix. When dealing with high dimensional sparse data, co-clustering turns out to be more beneficial than one-sided clustering even if one is interested in clustering along one dimension only. Aside from being high dimensional and sparse, some datasets, such as document-term matrices, exhibit directional characteristics, and the $L_2$ normalization of such data, so that it lies on the surface of a unit hypersphere, is useful. Popular co-clustering assumptions such as Gaussian or Multinomial are inadequate for this type of data. In this paper, we extend the scope of co-clustering to directional data. We present Diagonal Block Mixture of Von Mises–Fisher distributions (dbmovMFs), a co-clustering model which is well suited for directional data lying on a unit hypersphere. By setting the estimate of the model parameters under the maximum likelihood (ML) and classification ML approaches, we develop a class of EM algorithms for estimating dbmovMFs from data. Extensive experiments, on several real-world datasets, confirm the advantage of our approach and demonstrate the effectiveness of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Co-Clustering for Object by Variable Data Matrices

Regularized bi-directional co-clustering

Article 10 April 2021

Parameter-wise co-clustering for high-dimensional data

Article 03 October 2022

Notes

https://pypi.python.org/pypi/coclust.
http://www.dataminingresearch.com/.
http://www.cad.zju.edu.cn/home/dengcai/.
All algorithms, except stochastic variants, provide substantially better results when they are initialized using Skmeans (in almost all situations).
For presentation purposes we omit CEM, EM$_{\mathbf {b}}$ and CEM$_{\mathbf {b}}$ from Fig. 4, but from Tables 2 and 3 it is straightforward to verify that our comments still holds for these algorithms.

References

Abramowitz M, Stegun IA (1964) Handbook of mathematical functions: with formulas, graphs, and mathematical tables, vol 55. Courier Corporation, North Chelmsford
MATH Google Scholar
Ailem M, Role F, Nadif M (2016) Graph modularity maximization as an effective method for co-clustering text data. Knowl Based Syst 109:160–173
Article Google Scholar
Ailem M, Role F, Nadif M (2017a) Model-based co-clustering for the effective handling of sparse data. Pattern Recognit 72:108–122
Article Google Scholar
Ailem M, Role F, Nadif M (2017b) Sparse poisson latent block model for document clustering. IEEE Trans Knowl Data Eng 29(7):1563–1576
Article Google Scholar
Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Parzen E, Tanabe K, Kitagawa G (eds) Selected papers of Hirotugu Akaike. Springer, New York, pp 199–213
Chapter Google Scholar
Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises–Fisher distributions. J Mach Learn Res 6:1345–1382
MathSciNet MATH Google Scholar
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Article MathSciNet MATH Google Scholar
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE TPAMI 22(7):719–725
Article Google Scholar
Bock HH (1979) Simultaneous clustering of objects and variables. In: Tomassone R (ed) Analyse des Données et Informatique. INRIA, Le Chesnay, pp 187–203
Google Scholar
Bock HH (1994) Information and entropy in cluster analysis. In: Bozdogan H et al (eds) Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: an informational approach. Springer, Dordrecht, pp 115–147
Bozdogan H (2000) Akaike’s information criterion and recent developments in information complexity. J Math Psychol 44(1):62–91
Article MathSciNet MATH Google Scholar
Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat Q 2(1):73–82
Google Scholar
Celeux G, Diebolt J (1992) A stochastic approximation type EM algorithm for the mixture problem. Stoch Int J Probab Stoch Process 41(1–2):119–134
MathSciNet MATH Google Scholar
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332
Article MathSciNet MATH Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
MathSciNet MATH Google Scholar
Deodhar M, Ghosh J (2010) Scoal: a framework for simultaneous co-clustering and learning from complex data. ACM Trans Knowl Discov Data 4(3):11
Article Google Scholar
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM SIGKDD, pp 269–274
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1–2):143–175
Article MATH Google Scholar
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: ACM SIGKDD, pp 89–98. ACM
Gopal S, Yang Y (2014) Von Mises–Fisher clustering models. In: ICML, pp 154–162
Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24:437–458
MATH Google Scholar
Govaert G, Nadif M (2013) Co-Clustering. Wiley, New York
Book MATH Google Scholar
Govaert G, Nadif M (2016) Mutual information, phi-squared and model-based co-clustering for contingency tables. Advances in Data Analysis and Classification pp 1–34
Hanczar B, Nadif M (2010) Bagging for biclustering: application to microarray data. In: ECML/PKDD, pp 490–505
Hartigan JA (1975) Clustering algorithms, 99th edn. Wiley, New York
MATH Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Article MATH Google Scholar
Labiod L, Nadif M (2011) Co-clustering for binary and categorical data with maximum modularity. In: ICDM, pp 1140–1145
Laclau C, Nadif M (2017) Diagonal latent block model for binary data. Stat Comput 27(5):1145–1163
Article MathSciNet MATH Google Scholar
Li T (2005) A general model for clustering binary data. In: SIGKDD, pp 188–197
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM TCBB 1(1):24–45
Google Scholar
Mardia KV, Jupp PE (2000) Directional statistics. Wiley series in probability and statistics. Wiley, New York
MATH Google Scholar
McLachlan G, Krishnan T (2007) The EM algorithm and extensions. Wiley, New York
MATH Google Scholar
McLachlan G, Peel D (2004) Finite mixture models. Wiley, New York
MATH Google Scholar
Nadif M, Govaert G (2010) Model-based co-clustering for continuous data. In: ICMLA, pp 175–180
Reisinger J, Waters A, Silverthorn B, Mooney RJ (2010) Spherical topic models. In: ICML, pp 903–910
Salah A, Nadif M (2017) Social regularized von Mises–Fisher mixture model for item recommendation. Data Min Knowl Discov 31:1–24
Article MathSciNet MATH Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet MATH Google Scholar
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. JMLR 3:583–617
MathSciNet MATH Google Scholar
van Dijk B, van Rosmalen J, Paap R (2009) A Bayesian approach to two-mode clustering. Econometric Institute, Erasmus University Rotterdam, Report no EI 2009-06, pp 1–26
Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394
Article MathSciNet MATH Google Scholar
Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. In: Borra S, Rocci R, Vichi M, Schader M (eds) Advances in classification and data analysis. Springer, Berlin, Heidelberg, pp 43–52
Google Scholar
Wyse J, Friel N (2012) Block clustering with collapsed latent block models. Stat Comput 22(2):415–428
Article MathSciNet MATH Google Scholar
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3):374–384
Article Google Scholar

Download references

Author information

Authors and Affiliations

SIS, Singapore Management University, Singapore, 178902, Singapore
Aghiles Salah
LIPADE, Paris Descartes University, 75006, Paris, France
Mohamed Nadif

Authors

Aghiles Salah
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Nadif
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aghiles Salah.

Additional information

Availablity: All presented algorithms in the paper are available at https://github.com/dbmovMFs/DirecCoclus.

Appendices

A. Appendix

A.1 Maximum likelihood estimate

The expectation of the classification log-likelihood of dbmovMFs is given by

$$\begin{aligned} \mathbb {E}[L_c(\varTheta |\mathcal {X},{\mathbf {Z}})]= & {} \sum _{h}{\tilde{z}_{.h}\log {\alpha _h}} + \sum _{h}{\tilde{z}_{.h}\log (c_d(\kappa _h))} + \sum _{h,i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h\mu _{hh} x_{ij}}}\nonumber \\ \end{aligned}$$

(9)

where $\tilde{z}_{.h} = \sum _i \tilde{z}_{ih}$. We first maximize the expectation of the classification log-likelihood (9) with respect to $\alpha _h$, subject to the constraint $\sum _h\alpha _h = 1$. The corresponding Lagrangian, up to terms which are not function of $\alpha _h$, is given by $L(\varvec{\alpha },\lambda )= \sum _{h}{\tilde{z}_{.h}\log {\alpha _h}} + \lambda _h(1 - \sum _h \alpha _h). $ Taking derivatives with respect to $\alpha _h$, we obtain $ \frac{\partial L(\varvec{\alpha },\lambda )}{\partial \alpha _h} = \frac{\tilde{z}_{.h}}{\alpha _h} - \lambda . $ Setting this derivative to zero leads to $\tilde{z}_{.h} = \lambda \alpha _h.$ Summing both sides over all h yields $\lambda = n$, thereby the maximizing value of the parameter $\alpha _h$ is given by $ \hat{\alpha }_h = \frac{\tilde{z}_{.h}}{n}. $ In the same manner, to maximize expectation (9) with respect to $\varvec{\mu }_h^{{\mathbf {w}}}$, subject to the constraint $||\varvec{\mu }_h^{{\mathbf {w}}}|| = 1$, we form the corresponding Lagrangian by isolating the terms depending on $\varvec{\mu }_h^{{\mathbf {w}}}$, this leads to

$$\begin{aligned} L(\varvec{\mu },\lambda )= \sum _{h,i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h\mu _{hh} x_{ij}}} + \lambda _h \left( 1 - \sum _j w_{jh}\mu _{hh}^2\right) . \end{aligned}$$

Taking the derivative with respect to $\mu _{hh}$, we obtain:

$$\begin{aligned} \frac{\partial L(\varvec{\mu },\lambda )}{\partial \mu _h} = \sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}} - 2\lambda w_{.h} \mu _{hh} \end{aligned}$$

where $w_{.h} = \sum _j w_{jh}$. Setting this derivative to zero, we obtain $\lambda \mu _{hh} = \frac{\sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}}}{2w_{.h}}.$ Thus, $\lambda ^2\mu _{hh}^2 = \frac{(\sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}})^2}{4w_{.h}^2}.$ Multiplying both sides by $w_{.h}$, yields:$ \lambda ^2w_{.h}\mu _{hh}^2 = \frac{w_{.h}(\sum _{i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h x_{ij}}})^2}{4w_{.h}^2}. $ Since $w_{.h}\mu _{hh}^2 =1$, we have $\lambda = \kappa _h\frac{\sqrt{w_{.h} (\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}})^2}}{2 w_{.h}} $ or $\lambda = \kappa _h \frac{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }{2 w_{.h}}$ where ${\mathbf {r}}_h^{{\mathbf {w}}}$ is a d dimensional vector, i.e, let $j' = 1,\ldots ,d$, $r_{hj'}^{{\mathbf {w}}} = r_{h}^{{\mathbf {w}}} = \sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}$ if $w_{jh} = 1$ and $r_{hj'}^{{\mathbf {w}}} = 0$, otherwise. Hence, the maximizing value of the parameter $\mu _{hh}$ is given

$$\begin{aligned} \text{ by } \quad \hat{\mu }_{hh}= \frac{\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}}{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert } = \frac{\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}}{\sqrt{w_{.h} (\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}})^2}}=\pm \frac{1}{\sqrt{w_{.h}}} \end{aligned}$$

(10)

according to whether $r_h^{{\mathbf {w}}} = \sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}$ is positive or negative.

Next we concentrate on maximizing Eq. (9), with respect to the concentration parameters $\kappa _h$, subject to the constraint $\kappa _h>0, \;\forall h$. The Lagrangian up to terms which do not contain $\kappa _h$ is given by $ L(\kappa )= \sum _{h}{\tilde{z}_{.h}\log (c_d(\kappa _h))} + \sum _{h,i,j}{{\tilde{z}_{ih}w_{jh}\kappa _h\hat{\mu }_{hh} x_{ij}}}. $ Note that, by KKT conditions, the Lagrangian multiplier for the constraint $\kappa _h>0$ has to be equal to zero. Taking the partial derivative of Eq. (8) with respect to $\kappa _h$, we obtain

$$\begin{aligned} \frac{\partial L(\kappa )}{\partial \kappa _h} = \tilde{z}_{.h}\frac{c_d'(\kappa _h)}{c_d(\kappa _h)} + \sum _{i,j}{{\tilde{z}_{ih}w_{jh}\hat{\mu }_{hh} x_{ij}}}. \end{aligned}$$

Setting this derivative equal to zero, leads to: $\frac{c_d'(\kappa _h)}{c_d(\kappa _h)} = - \frac{\hat{\mu }_{hh}\times \sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}}{\tilde{z}_{.h}}.$ Replacing $\hat{\mu }_{hh}$ by $\frac{\sum _{i,j}{{\tilde{z}_{ih}w_{jh} x_{ij}}}}{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }$ (see, Eq. 10), we obtain$ \frac{c_d'(\kappa _h)}{c_d(\kappa _h)} = - \frac{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }{\tilde{z}_{.h}\hat{w}_{.h}}.$ Let $s = d/2-1$, then:

$$\begin{aligned} c_d'(\kappa _h)= & {} \frac{s\kappa _h^{s-1}(2\pi )^{s+1} I_s(\kappa _h) -\kappa _h^s(2\pi )^{s+1}I_s'(\kappa _h)}{(2\pi )^{2s+2}I_s^2(\kappa _h)}\nonumber \\= & {} \frac{s\kappa _h^{s-1}}{(2\pi )^{s+1}I_s(\kappa _h)}-\frac{\kappa _h^sI_s'(\kappa _h)}{(2\pi )^{s+1}I_s^2(\kappa _h)} = c_d(\kappa _h)\left( \frac{s}{\kappa _h}-\frac{I_s'(\kappa _h)}{I_s(\kappa _h)}\right) . \end{aligned}$$

(11)

Hence, we obtain

$$\begin{aligned} \frac{-c_d'(\kappa _h)}{c_d(\kappa _h)} = \frac{I_{s+1}(\kappa _h)}{I_{s}(\kappa _h)} = \frac{I_{d/2}(\kappa _h)}{I_{d/2-1}(\kappa _h)}. \end{aligned}$$

(12)

The latter Eq. (12), arises from the use of the following recurrence formula (Abramowitz and Stegun (1964), page 376): $ \kappa _h I_{s+1}(\kappa _h)= \kappa _h I_s'(\kappa _h)- sI_s(\kappa _h). $ Note that computing the maximizing value $\hat{\kappa }_h$ from Eq. (11) implies to inverse a ratio of Bessel function, a problem for which no a closed-form solution can be obtained. Thus, following (Banerjee et al. 2005), we propose to derive an accurate approximation of the concentration parameter, by using the following continued fraction formula:

$$\begin{aligned} \frac{I_{d/2}(\kappa _h)}{I_{d/2-1}(\kappa _h)}= & {} \frac{1}{\frac{d}{\kappa _h}+\frac{1}{\frac{d+2}{\kappa _h}+\cdots }}. \end{aligned}$$

(13)

Letting $ \bar{r}_h^{{\mathbf {w}}} = \frac{\Vert {\mathbf {r}}_h^{{\mathbf {w}}}\Vert }{\tilde{z}_{.h}\hat{w}_{.h}} = \frac{I_{d/2}(\kappa _h)}{I_{d/2-1}(\kappa _h)}$ and using Eq. (13), we obtain: $\frac{1}{\bar{r}_h^{{\mathbf {w}}}} \approx \frac{d}{\kappa _h} + \bar{r}_h^{{\mathbf {w}}} $ which yields the following approximation: $\hat{\kappa }_h = \frac{d\bar{r}_h^{{\mathbf {w}}}}{1-(\bar{r}_h^{{\mathbf {w}}})^2}.$ Finally, Banerjee et al. (2005) have empirically shown that adding the following correction term $\frac{-(\bar{r}_h^{{\mathbf {w}}})^3}{1-(\bar{r}_h^{{\mathbf {w}}})^2}$ results in a better approximation of $\hat{\kappa }_h$, which leads to: $\hat{\kappa }_h = \frac{d\bar{r}_h^{{\mathbf {w}}} - (\bar{r}_h^{{\mathbf {w}}})^3}{1-(\bar{r}_h^{{\mathbf {w}}})^2}. $

A.2 Concentration parameters: dbmovMFs versus movMFs

Hereafter, we provide the proofs for Proposition 1 and Theorem 1. The following proposition is useful and necessary to prove both Proposition 1 and Theorem 1.

Proposition 3

Let ${\mathbf {r}}$ be a non-zero vector in $\mathbb {R}^d$ (i.e., ${\mathbf {r}}= (r_1, \ldots ,r_d)^\top $, such as $d\ge 1$). Let ${\mathbf {r}}^d$ be a vector in $\mathbb {R}^d$, such as all its component are equal to the sum of elements of ${\mathbf {r}}$ (i.e, where denotes the constant one vector). Then $\frac{\Vert {\mathbf {r}}^d\Vert }{d} \;\le \; \Vert {\mathbf {r}}\Vert $ with equality only if all components of ${\mathbf {r}}$ are equal, i.e, $(r_1= \cdots = r_d)$.

Proof

Let ${\mathbf {d}}$ and ${\mathbf {r}}^+$ be two vectors in $\mathbb {R}^d$ defined as follows: and ${\mathbf {r}}^+$ with $ r_j^+\; = \; |r_j|, \;\;\; \forall j \in \{1,\ldots ,d\}. $ We have

$$\begin{aligned} \frac{\Vert {\mathbf {r}}^d\Vert }{d} \;=\; \frac{\sqrt{d}\times \left| \sum _j r_j\right| }{d} \;\le \; \frac{1}{\sqrt{d}} \times \sum _j\left| r_j\right| \;=\;{\mathbf {d}}^\top .{\mathbf {r}}^+ \;=\;\Vert {\mathbf {d}}\Vert \Vert {\mathbf {r}}^+\Vert \cos ({\mathbf {d}},{\mathbf {r}}^+). \end{aligned}$$

By definition of ${\mathbf {r}}^+$ and ${\mathbf {d}}$, we have $\Vert {\mathbf {r}}^+\Vert = \Vert {\mathbf {r}}\Vert $ and $\Vert {\mathbf {d}}\Vert = 1$, hence $ \frac{\Vert {\mathbf {r}}^d\Vert }{d} \;\le \;\Vert {\mathbf {r}}\Vert \cos ({\mathbf {d}},{\mathbf {r}}^+). $ Since both ${\mathbf {d}}$ and ${\mathbf {r}}^+$ are non-zero vectors and lie on the first orthant of a d-dimensional unit hypersphere, by dividing both sides of the above inequality by $\Vert {\mathbf {r}}\Vert $, we get $ 0\;\le \;\frac{\Vert {\mathbf {r}}^d\Vert }{d\Vert {\mathbf {r}}\Vert } \;\le \;\cos ({\mathbf {d}},{\mathbf {r}}^+) \;\le \;1. $ The right hand side equality holds only if ${\mathbf {d}}$ and ${\mathbf {r}}^+$ are collinear, thereby all components of ${\mathbf {r}}$ are equal (i.e,$r_1 = \cdots =r_d$). We now prove Proposition 1. $\square $

Proof of Proposition 1

Based on Proposition 3 and the following inequality: $\Vert {\mathbf {r}}\Vert = \Vert p_1{\mathbf {x}}_1 + \cdots + p_n{\mathbf {x}}_n\Vert \le \Vert p_1{\mathbf {x}}_1\Vert + \cdots + \Vert p_n{\mathbf {x}}_n\Vert = \sum _i p_i$, it is straightforward to verify that

$$\begin{aligned} 0\; \le \; \Vert {\mathbf {r}}^d\Vert \; \le \; {d \times \sum _i p_i} \end{aligned}$$

In the following, we prove Theorem 1, which states that for a given row clustering ${\mathbf {z}}$, dbmovMFs-based algorithms lead to a concentration parameter that is less or equal to that of movMFs-based algorithms, for each cluster, and whatever the column partition ${\mathbf {w}}$. The following Lemma will be useful in the proof of Theorem 1. $\square $

Lemma 1

Let a and b be two real numbers in the interval [0, 1] (i.e, $0\le a \le 1$ and $0\le b \le 1$). Then for all natural number $n\ge 2$$ \left| a^n - b^n\right| \;\le \; n\left| a-b\right| $ with equality only if $a = b$.

Proof

For all natural number $n\ge 2$, we can use the following well known remarkable identity: $ a^n - b^n \; = \; (a-b)\sum _{k=0}^{n-1}a^{n-1-k}b^k. $ Since both a and b are positive, we have $ \left| a^n - b^n\right| \; = \; \left| a-b\right| \sum _{k=0}^{n-1}a^{n-1-k}b^k $ as both a and b are in [0, 1], we have $ \sum _{k=0}^{n-1}a^{n-1-k}b^k\; \le \; n $ thereby, $ \left| a^n - b^n\right| \; = \;\left| a-b\right| \sum _{k=0}^{n-1}a^{n-1-k}b^k\; \le \; n\left| a-b\right| . $$\square $

Proof of Theorem 1

We first prove that $ { \frac{\bar{r}_{h}^{{\mathbf {w}}} d }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h} d }{1 - \left( \bar{r}_{h}\right) ^2} $, which corresponds to the approximation of the concentration parameters under dbmovMFs and movMFs without the correction term. Since both $\bar{r}_{h}^{{\mathbf {w}}}$ and $\bar{r}_{h}$ are in ]0, 1[, it is straightforward to verify that if $\bar{r}_{h}^{{\mathbf {w}}} \;\le \; \bar{r}_{h}$ then the above inequality is always verified. Thus, in what follows we aim to prove that $\bar{r}_{h}^{{\mathbf {w}}} \;\le \; \bar{r}_{h}$. $\square $

By definition, we have $ \bar{r}_h = \frac{\Vert {\mathbf {r}}_h\Vert }{\sum _i \tilde{z}_{ih}}\quad \text{ where } \quad {\mathbf {r}}_h = \sum _i \tilde{z}_{ih}{\mathbf {x}}_i $. Now, let’s define $\bar{r}_{h}'$ as:

$$\begin{aligned} \bar{r}_h' = \frac{\Vert {\mathbf {r}}_h'\Vert }{\sum _i \tilde{z}_{ih}}\quad \text{ where } \quad r_{hj}' = \left\{ \begin{array}{ll} r_{hj}, &{}\quad \text{ if } w_{jh}= 1\\ 0, &{}\quad \text{ otherwise }. \end{array} \right. \\ \end{aligned}$$

${\mathbf {r}}_h'$ is a $w_{.h}$ dimensional sub vector of ${\mathbf {r}}_h$, it follows from the above definition that$\bar{r}_h'\;\le \; \bar{r}_{h}$. On the other hand, $ \bar{r}_{h}^{{\mathbf {w}}} = \frac{\Vert {\mathbf {r}}_{h}^{{\mathbf {w}}}\Vert }{\sum _i \tilde{z}_{ih} {\sum _j \hat{w}_{jh}}} $, where ${\mathbf {r}}_{h}^{{\mathbf {w}}}$ denotes a $w_{.h}$ dimensional vector, each its elements are equal to sum of elements of the hth co-cluster (i.e, $r_{h1}^{{\mathbf {w}}} = \cdots = r_{hw_{.h}}^{{\mathbf {w}}} = r_h^{{\mathbf {w}}} =\sum _{i,j}\tilde{z}_{ih}\hat{w}_{jh}x_{ij} $), similarly we can show that each elements of ${\mathbf {r}}_{h}^{{\mathbf {w}}}$ are equal to sum of elements of $r_h'$ (i.e, $r_h^{{\mathbf {w}}} = \sum _j r_{hj}'$). Hence using Proposition 3, where the dimensionality $d = w_{.h}$, we have $ \frac{\Vert {\mathbf {r}}_{h}^{{\mathbf {w}}}\Vert }{{\sum _j \hat{w}_{jh}}} \;\le \; \Vert {\mathbf {r}}_{h}'\Vert .$ Thus,

$$\begin{aligned} \bar{r}_h^{{\mathbf {w}}} = \frac{\Vert {\mathbf {r}}_{h}^{{\mathbf {w}}}\Vert }{\sum _i \tilde{z}_{ih} {\sum _j \hat{w}_{jh}}} \;\le \; \bar{r}_h' = \frac{\Vert {\mathbf {r}}_{h}'\Vert }{\sum _i \tilde{z}_{ih}} \;\le \; \bar{r}_h, \text{ thereby }, { \frac{\bar{r}_{h}^{{\mathbf {w}}} d }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h} d }{1 - \left( \bar{r}_{h}\right) ^2}. \end{aligned}$$

We now prove that, $ { \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. $

As $\bar{r}_{h}^{{\mathbf {w}}} \le \bar{r}_{h}$ we have $ { \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2}, $ then it is sufficient to prove that, $ \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. $ We have

$$\begin{aligned} \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2} - \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2} \; = \;\frac{d(\bar{r}_{h}^{{\mathbf {w}}}-\bar{r}_{h}) + (\bar{r}_{h}^3 -(\bar{r}_{h}^{{\mathbf {w}}})^3) }{1 - \left( \bar{r}_{h}\right) ^2}. \end{aligned}$$

Based on Lemma 1, for all $d\ge 3$ we have $ \left| \bar{r}_{h}^3 -(\bar{r}_{h}^{{\mathbf {w}}})^3\right| \;\le \; d\left| \bar{r}_{h}^{{\mathbf {w}}}-\bar{r}_{h}\right| .$ As $\bar{r}_{h}^{{\mathbf {w}}}\;\le \;\bar{r}_{h}$ it is easy to verify that $d(\bar{r}_{h}^{{\mathbf {w}}}-\bar{r}_{h}) + (\bar{r}_{h}^3 -(\bar{r}_{h}^{{\mathbf {w}}})^3)\;\le \; 0.$ Based on the fact that $1 - \left( \bar{r}_{h}\right) ^2 \;>\; 0$ we have $ \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}\right) ^2} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. $ Hence, using the above inequalities we get (for all $d\;\ge \;3$)

$$\begin{aligned} { \frac{\bar{r}_{h}^{{\mathbf {w}}} d - (\bar{r}_{h}^{{\mathbf {w}}})^3 }{1 - \left( \bar{r}_{h}^{{\mathbf {w}}}\right) ^2}} \;\le \; \frac{\bar{r}_{h} d - \bar{r}_{h}^3}{1 - \left( \bar{r}_{h}\right) ^2}. \end{aligned}$$

A.3 Computational complexity in the worst case

Hereafter, we provide the proof of Proposition 2 (Sect. 6).

Proof

(i) The computational bottleneck for hard-dbmovMF is with row, column assignments and concentration parameters updates.

First, we prove that the complexity of both row and column assignments is O(nz). Let ${\mathbf {x}}_i$ denote the ith row of ${\mathbf {X}}$ (i.e, ${\mathbf {x}}_i$ is a d dimensional vector in $\mathbb {S}^{d-1}$). Assume that we look for a co-clustering of ${\mathbf {X}}$ into g co-clusters, and let $\varvec{\mu }_h^{{\mathbf {w}}}$ be the hth centroid, characterizing the hth co-cluster. The computational cost of the scalar product $(\varvec{\mu }_h^{{\mathbf {w}}})^{T}{\mathbf {x}}_i$ is $O(x_i^h)$, where $x_i^h$ is number of non-zeros entries of ${\mathbf {x}}_i$ within the hth column cluster, this complexity holds thanks to the form of $\varvec{\mu }_h^{{\mathbf {w}}}$, see Eq. (2). Thereby, the complexity of the assignment of ${\mathbf {x}}_i$ is given in $O(x_i^*)$ (based on $O(x_i^1+\cdots +x_i^g)$), where $x_i^*$ denotes the number of non-zeros entries of ${\mathbf {x}}_i$. Therefore, the total cost of one row assignments step is O(nz). Similarly, we can show that the cost of one column assignments step is O(nz).

We now prove that the computational cost for updating concentration parameters is also O(nz). The main computation for updating the hth concentration parameter is with the computation of $r_h^{{\mathbf {w}}}$. The computational cost of the latter term is given in $O(x_h^*)$, where $x_h^*$ the number of non-zeros entries in the hth co-cluster. Thus, the complexity for updating all concentrations parameters is O(nz), based on $O(x_1^*+\cdots +x_g^*)$ and the fact that at most all non-zeros entries in the matrix ${\mathbf {X}}$ are contained in the g diagonal co-clusters. Thereby, the total computational complexity of hard-dbmovMF is $O(it\cdot nz)$.

$\square $

Proof

(ii) As for hard-dbmovMF it is easy to verify that the total cost of row assignments and concentration parameters updates is given in $O(it\cdot nz)$ for soft-dbmovMF. Now we prove that in contrast to hard-dbmovMF the computational cost of column assignments for soft-dbmovMF is $O(it\cdot g\cdot nz)$. The computational bottleneck for column assignment step of soft-dbmovMF is with the terms $v_{hj} \leftarrow \sum _i \tilde{z}_{ih}x_{ij}$, $h\in \{1,\ldots ,g\}$, $j\in \{1,\ldots ,J\}$. The cost of $v_{hj}$ is given in $O(x_j^*)$, where $x_j^*$ is the number of non-zeros entries in the jth column. During the column assignment step for each cluster h and column j we compute $v_{hj}$, hence the computational cost of this step is given in $O(g\cdot nz)$. Therefore, the total computational cost of soft-dbmovMF is $O(it\cdot g\cdot nz)$, based on $O(it\cdot nz\cdot (g+1))$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Salah, A., Nadif, M. Directional co-clustering. Adv Data Anal Classif 13, 591–620 (2019). https://doi.org/10.1007/s11634-018-0323-4

Download citation

Received: 13 September 2017
Revised: 16 April 2018
Accepted: 20 April 2018
Published: 30 April 2018
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s11634-018-0323-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Directional co-clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others