Representation learning for clustering via building consensus

Deshmukh, Aniket Anand; Regatti, Jayanth Reddy; Manavoglu, Eren; Dogan, Urun

doi:10.1007/s10994-022-06194-9

Representation learning for clustering via building consensus

Open access
Published: 09 September 2022

Volume 111, pages 4601–4638, (2022)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Representation learning for clustering via building consensus

Download PDF

Aniket Anand Deshmukh ORCID: orcid.org/0000-0002-7292-8436¹,
Jayanth Reddy Regatti²,
Eren Manavoglu¹ &
…
Urun Dogan¹

3045 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In this paper, we focus on unsupervised representation learning for clustering of images. Recent advances in deep clustering and unsupervised representation learning are based on the idea that different views of an input image (generated through data augmentation techniques) must be close in the representation space (exemplar consistency), and/or similar images must have similar cluster assignments (population consistency). We define an additional notion of consistency, consensus consistency, which ensures that representations are learned to induce similar partitions for variations in the representation space, different clustering algorithms or different initializations of a single clustering algorithm. We define a clustering loss by executing variations in the representation space and seamlessly integrate all three consistencies (consensus, exemplar and population) into an end-to-end learning framework. The proposed algorithm, consensus clustering using unsupervised representation learning (ConCURL), improves upon the clustering performance of state-of-the-art methods on four out of five image datasets. Furthermore, we extend the evaluation procedure for clustering to reflect the challenges encountered in real-world clustering tasks, such as maintaining clustering performance in cases with distribution shifts. We also perform a detailed ablation study for a deeper understanding of the proposed algorithm. The code and the trained models are available at https://github.com/JayanthRR/ConCURL_NCE.

D-TRACE: Deep Triply-Aligned Clustering

Contrastive Hierarchical Clustering

DeepECT: The Deep Embedded Cluster Tree

Article Open access 14 July 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The field of artificial intelligence (AI) advanced significantly in the previous decade due to developments in deep learning LeCun et al. (2015). In the early years of this field, deep learning methods exhibited stellar supervised learning performance, where each data sample was coupled with a ground truth (labeled data), e.g., each image was associated with a category. Unfortunately, generating labeled datasets is time consuming and expensive, and there may not be enough experts to label the data at hand (e.g., medical images). The straightforward solution is to use clustering for large-scale problems.

In this work, we focus on representation learning for the unsupervised learning task of clustering images. Clustering is a ubiquitous task and has been actively used in many different scientific and practical pursuits (Frey & Dueck, 2007; Masulli & Schenone, 1999; Jain et al., 1999; Xu & Wunsch, 2005). Clustering algorithms do not learn representations and are hence limited to data for which we have a good representation available.

Advancements in deep learning techniques have enabled the end-to-end learning of rich image representations for supervised learning. For the purposes of clustering, however, such features learned via supervised learning cannot be obtained due to lack of available labels. Therefore, supervised learning approaches fall short of providing a solution. Self-supervised learning addresses the issue of learning representations without labeled data. Self-supervised learning is a subfield of unsupervised learning in which the main goal is to learn general-purpose representations by exploiting user-defined tasks (pretext tasks) Wu et al. (2018); Zhuang et al. (2019); He et al. (2020); Zhuang et al. (2019); Chen et al. (2020); Grill et al. (2020); Caron et al. (2020). Representation learning algorithms have been shown to achieve good results when evaluated using a linear evaluation protocol, semisupervised training on ImageNet, or transfer to downstream tasks. A straightforward solution to the clustering problem is to use the features obtained via self-supervised learning and apply an out-of-the-box clustering algorithm (such as k-means) to compute data clusters. However, the performance of these features for clustering (using an out-of-the-box clustering algorithm) is not known, and as seen in our results, these features may be improved for clustering purposes.

On the other hand, deep clustering involves simultaneously learning cluster assignments and features using deep neural networks. Simultaneously learning the feature spaces with a clustering objective in deep clustering may lead to degenerate solutions, which until recently limited end-to-end implementations of clustering with representation learning approaches (Caron et al., 2018a). Subsequently, several works have been developed (**e et al., 2016a; Caron et al., 2018a; Shah and Koltun, 2018; Ji et al., 2019a; Niu et al., 2020a; Wu et al., 2019a; Huang et al., 2020a; Tao et al., 2021). We provide details regarding some of these works in Sect. 1.1. Our previous work shows some encouraging results but we extend the work substantially (Regatti et al., 2021). We categorize the current clustering and representation learning works based on the consistency constraints that are used to define their objective functions. We define an additional notion of consistency, consensus consistency, which ensures that representations are learned to induce similar partitions for variations in the representation space, different clustering algorithms or different initializations of a clustering algorithm. We use consensus consistency and propose an end-to-end learning approach that outperforms other end-to-end learning methods for image clustering. We summarize our contributions as follows:

(1)
We introduce different notions of consistency (exemplar, population and consensus) that are used in unsupervised representation learning.
(2)
We propose a novel clustering algorithm that incorporates the above three consistency constraints and can be trained in an end-to-end way. An ensemble is generated in the consensus clustering objective by performing random transformations on the underlying embeddings. We combine several methods, which is not trivial, and this combination, along with our new consensus loss, is novel.
(3)
We show that the proposed algorithm ConCURL (consensus clustering with unsupervised representation learning) outperforms baselines on popularly used computer vision datasets when evaluated with clustering metrics.
(4)
We demonstrate the clustering abilities of trained models under a data shift and argue for the need for different evaluation metrics for deep clustering algorithms.
(5)
We study the impacts of various hyperparameters, data augmentation methods, and image resolutions on the clustering ability of the proposed algorithm.

1.1 Related work

1.1.1 Self-supervised learning

Self-supervised learning is used to learn representations in an unsupervised way by defining some pretext tasks. There are many different flavors of self-supervised learning, such as instance discrimination (ID) tasks (Wu et al., 2020), have achieved state-of-the-art results without requiring negative pairs. Although self-supervised learning methods exhibit impressive performance on a variety of problems, it is not clear whether learned representations are good for clustering.

1.1.2 Clustering with representation learning

DEC (**e et al., 2016a) is one of the first algorithms to show that deep learning can be used to effectively cluster images in an unsupervised manner; this approach uses features learned from an autoencoder to fine-tune the cluster assignments. DeepCluster (Caron et al., 2018a) shows that it is possible to train deep convolutional neural networks (DeCNNs) in an end-to-end manner with pseudolabels that are generated by a clustering algorithm. Subsequently, several works (Shah and Koltun, 2018; Ji et al., 2019a; Niu et al., 2020a; Wu et al., 2019a; Huang et al., 2020a) have introduced end-to-end clustering-based objectives and achieved state-of-the-art clustering results. For example, in the Gaussian attention network for image clustering (GATCluster) (Niu et al., 2020a), training is performed in two distinct steps (similar to Caron et al. (2018a)), where the first step is to compute pseudotargets for a large batch of data and the second step is to train the model in a supervised way using these pseudotargets. Both DeepCluster and GATCluster use k-means to generate pseudolabels that may not scale well. Wu et al. (2019a) proposed deep comprehensive correlation mining (DCCM), where discriminative features are learned by taking advantage of the correlations among the data using pseudolabel supervision and the triplet mutual information among the features. However, DCCM may be susceptible to trivial solutions (Niu et al., 2020a). Invariant information clustering (IIC) (Ji et al., 2019a) maximizes the mutual information between the class assignments of two different views of the same image (paired samples) to learn representations that preserve the commonalities between the views while discarding instance-specific details. It has been argued that the presence of an entropy term in mutual information plays an important role in avoiding degenerate solutions. However, a large batch size is needed for the computation of mutual information in IIC; this process may not be scalable for larger image sizes, which are common in popular datasets (Ji et al., 2019a; Niu et al., 2020a). Huang et al. (2020a) extended the celebrated maximal margin clustering idea to the deep learning paradigm by learning the most semantically plausible clusters through the minimization of a proposed partition uncertainty index. Their pixel intensity clustering (PICA) algorithm uses a stochastic version of this index, thereby facilitating minibatch training. PICA fails to assign a sample-correct cluster when that sample has either high foreground or background similarity to samples in other clusters. In a more recent approach, contrastive clustering (Li et al., 2021), a contrastive learning loss (as in SimCLR (Chen et al., 2020) was adopted along with an entropy term to avoid degenerate solutions. Similarly, Tao et al. (2021) combined ID (Wu et al., 2018) with novel softmax-formulated decorrelation constraints for representation learning and clustering. Their approach outperforms state-of-the-art methods and improves upon the instance discrimination method. Our method also improves upon ID and outperforms the method of Tao et al. (2021) on all datasets considered. There are other non-end-to-end approaches, such as SCAN (Van Gansbeke et al., 2020), which use the learned representations from a pretext task to find the images that are semantically closest to the given image using the nearest neighbors algorithm. Similarly, one more state of the art non-end-to-end approach, SPICE (Niu et al., 2021) divides the clustering network in two parts - one to measure instance level similarity and one to identify cluster level discrepancy.

2 Consensus clustering

One of the distinguishing factors between supervised learning and unsupervised learning is the existence of ground truth labels that construct a global constraint based on examples. In most self-supervised learning methods, the ground truth is replaced with some consistency constraint (Chen et al., 2020). Without a doubt, the performance of any self-supervised method is a function of the power of the consistency constraint used. We define two types of consistency constraints: exemplar consistency and population consistency.

Definition 1

Exemplar consistency: Representation learning algorithms that learn closer representations (in terms of some distance metric) for different augmentations of the same data point are said to follow exemplar consistency.

Examples of the usage of exemplar consistency include contrastive learning methods such as MoCo (He et al., 2019) and SimCLR (Chen et al., 2020). In these methods, a positive pair of images is defined as any two image augmentations of the same image, and a negative pair consists of any two different images.

Definition 2

Population consistency Representation learning algorithms that ensure that learned representations satisfy the consistency constraint, where two similar data points or any augmentations of the same data points should belong to the same cluster (or population), are said to follow population consistency.

Deep Cluster (Caron et al., 2018a) is a prominent self-supervised method that utilizes population consistency, i.e., Definition 2, by enforcing a clustering constraint on the input dataset. Please note that each cluster assignment contains data points that are similar to each other. Similarly, SwAV (Caron et al., 2020) is an example of the population consistency method.

Definition 3

Consensus consistency Representation learning algorithms that are able to learn representations that induce similar partitions for variations in the given representation space (subsets of features, random projections, etc. ), different clustering algorithms (k-means, Gaussian mixture models (GMMs), etc.) or different initializations of clustering algorithms are said to follow consensus consistency.

Earlier works on consensus consistency did not consider representation learning and used the knowledge reuse framework (see Strehl and Ghosh (2002),Ghosh and Acharya (2011)), where the cluster partitions were available (the features were irrelevant) or the features of the data were fixed. For example, Fern and Brodley (2003) successfully applied random projections to consensus clustering by performing k-means clustering on multiple random projections of the fixed features of input data. In contrast, the notion of consensus consistency here deals with learning representations that achieve a consensus regarding the cluster assignments of multiple clustering algorithms. One example of a method that enforces consensus consistency is LA (Zhuang et al., 2019). LA builds on the ID task (Wu et al., 2018) and was proposed as a method based on a robust clustering objective (using multiple runs of k-means) to move statistically similar data points closer in the representation space and dissimilar data points further away. However, Zhuang et al. (2019) did not evaluate the method with clustering metrics and only focused on linear evaluation using the learned features. Subsequently, we conducted a study to evaluate the clustering performance of these features (see Appendix) and observed that LA performed poorly when evaluated for clustering accuracy. In Definition 3, we inherently assume that the clustering algorithms under consideration have been tuned properly. Unfortunately, the definition of consensus consistency is ill posed, and there can be arbitrarily many different partitions that can satisfy the given condition^{Footnote 1}. We show that when exemplar consistency is used as an inductive bias, the resulting objective function achieves impressive performance on challenging datasets. Combining the exemplar and population constraints with consensus consistency seamlessly and effectively for clustering is the basis of our proposed method.

2.1 Loss for consensus and population consistency

We focus on learning generic representations that satisfy Definition 3 for clustering. By using different clustering algorithms or different representation variations (such as projections), one can easily generate multiple different partitions of the same data. In unsupervised learning, it is not known which partitioning is correct. To tackle this problem, some additional assumptions are needed.

We assume that there is an underlying latent space $\mathcal {Z}^*$ (possibly not unique) such that all clusterings (based on latent space, algorithm or initialization variations) that take input data from this latent space produce similar data partitions. Furthermore, every clustering algorithm that also takes the true number of clusters as input produces the partition that is closest to the hypothetical ground truth. Moreover, we assume that there exists a function $h:X \rightarrow \mathcal {Z}^*$, where X represents the input space and $\mathcal {Z}^*$ represents the underlying latent space. We call this assumption the principle of consensus. The open question is how one constructs an efficient loss that reflects the principle of consensus. We define one such way below.

Given an input batch of images $\mathcal {X}_b\subset \mathcal {X}$, the goal is to partition these images into K clusters. We obtain p views of these images (by different image augmentation approaches) and define a loss such that cluster assignment of any of the p views matches the target estimated from any other view. Without loss of generality, we define a loss for $p = 2$ views. The two views $\mathcal {X}_b^1, \mathcal {X}_b^2$ are generated using two randomly chosen image augmentations.

We learn a representation space $\mathcal {Z}_0$ at the end of every training iteration and obtain M variations of $\mathcal {Z}_0$ as $\{ \mathcal {Z}_1, \mathcal {Z}_2, ... , \mathcal {Z}_M\}$ (e.g., random projections). The goal is to build an efficient loss according to the principle of consensus among $\mathcal {Z}_0$ and its M variations $\{ \mathcal {Z}_1, \mathcal {Z}_2, ... , \mathcal {Z}_M\}$ such that we learn the latent space $\mathcal {Z}^*$ at the end of training (i.e., the learned features lie in the latent space described above). For a given batch of images $\mathcal {X}_b$ and a representation space $\mathcal {Z}_m, \forall m \in [1,...,M]$, we denote the cluster assignment probability of image i and cluster j for view 1 as $\textbf {p}_{i,j}^{1}(\mathcal {Z}_m)$ and that for view 2 as $\textbf {p}_{i,j}^{2}(\mathcal {Z}_m)$. We concisely use $\tilde{\textbf {p}}^{(1,m)},\tilde{\textbf {p}}^{(2,m)}$ when we talk about all the images and all the clusters. Here, we define a loss that incorporates “population consistency" and “consensus consistency". We assume that the target cluster assignment probabilities for the representation $\mathcal {Z}_0$ are given (as in DeepCluster (Caron et al., 2018a)), and they are denoted as $\textbf {q}_{i,j}^{1}$ for view 1 and $\textbf {q}_{i,j}^{2}$ for view 2.

We define the loss for any representation space $\mathcal {Z}$ and batch of images $\mathcal {X}_b$ as

$$\begin{aligned} {\begin{matrix} L_{\mathcal {Z}_m}^1 &=& - \frac{1}{2B}\sum _{i=1}^{B}\sum _{j=1}^K \textbf {q}^{2}_{ij} \log \textbf {p}^{1}_{ij}(\mathcal {Z}_m) \\ L_{\mathcal {Z}_m}^2 &=& - \frac{1}{2B}\sum _{i=1}^{B}\sum _{j=1}^K \textbf {q}^{1}_{ij} \log \textbf {p}^{2}_{ij}(\mathcal {Z}_m), \\ L_{\mathcal {Z}} &=& \sum _{m = 1}^M \Big ( L_{\mathcal {Z}_m}^1 + L_{\mathcal {Z}_m}^2 \Big ). \end{matrix}} \end{aligned}$$

(1)

Note that here, consensus among the clustering results is defined via the number of common targets $\textbf {q}$. An overview of the procedure is shown in Fig. 1. The exact details regarding how to obtain variations of $\mathcal {Z}_0$ and calculate the cluster assignment probabilities $\textbf {p}$ and targets $\textbf {q}$ are described in the next section.

2.2 End-to-End Stochastic Gradient Descent (SGD)-Based trainable consensus loss

In this section, we propose an end-to-end trainable algorithm and define a way to compute $\textbf {p}$ and $\textbf {q}$. When the cluster assignment probabilities $\textbf {p}$ can take any values in the set [0, 1], we refer to the process as soft clustering, and when $\textbf {p}$ is restricted to the set $\{0,1\}$, we refer to the process as hard clustering.

Without loss of generality, in this paper, we focus on soft clustering, which makes it easier to define a loss function using the probabilities and update the parameters using the gradients to enable end-to-end learning. We follow the soft clustering framework presented in SwAV (Caron et al., 2020), which is a centroid-based technique that aims to maintain consistency between the clusterings of the augmented views $\mathcal {X}_b^{1}$ and $\mathcal {X}_b^{2}$. We store a set of randomly initialized prototypes $C_0=\{ \textbf {c}_0^1,\cdots ,\textbf {c}_0^K \} \in \mathbb {R}^{d\times K}$, where K is the number of clusters and d is the dimensionality of the prototypes. These prototypes are used to represent clusters and define a “consensus consistency" loss. We compute M variations of $C_0$ as $C_1,...,C_M$ exactly as we compute the M variations of $\mathcal {Z}_0$.

2.2.1 Cluster assignment probability $\textbf {p}$

We use a two-layer multilayer perceptron (MLP) g to project the features $\textbf {f}^1 = f_\theta (\mathcal {X}_b^1)$ and $\textbf {f}^2 = f_\theta (\mathcal {X}_b^2)$ to a lower-dimensional space $\mathcal {Z}_0$ (of size d). The outputs of this MLP (referred to as cluster embeddings) are denoted as ${Z}_0^1 = \{\textbf {z}_0^{1,1}, \ldots , \textbf {z}_0^{1,B} \}$ and ${Z}_0^2 = \{\textbf {z}_0^{2,1}, \ldots , \textbf {z}_0^{2,B} \}$ for view 1 and view 2, respectively. Note that $h: \mathcal {X} \rightarrow \mathcal {Z}$ defined in 2.1 is equivalent to the composite function of $f: \mathcal {X} \rightarrow \Phi $ and $g: \Phi \rightarrow \mathcal {Z}$.

For a latent space $\mathcal {Z}$, we compute the probability of assigning a cluster j to image i using the normalized vectors $\bar{\textbf {z}}^{1,i} = \frac{\textbf {z}^{1,i}}{\Vert \textbf {z}^{1,i}\Vert }$, $\bar{\textbf {z}}^{2,i} = \frac{\textbf {z}^{2,i}}{\Vert \textbf {z}^{2,i}\Vert }$ and $\bar{\textbf {c}}_j = \frac{{\textbf{c}}^j}{\Vert {\textbf{c}}^j\Vert }$ as

$$ \begin{gathered} {\textbf{p}}_{{i,j}}^{1} ({\mathcal{Z}},C) = \frac{{\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{1} ,\overline{{\textbf{c}}} _{j} \rangle } \right)}}{{\sum\nolimits_{{j\prime }} {\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{1} ,\overline{{\textbf{c}}} _{{j\prime }} \rangle } \right)} }}, \hfill \\ {\textbf{p}}_{{i,j}}^{2} ({\mathcal{Z}},C) = \frac{{\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{2} ,\overline{{\textbf{c}}} _{j} \rangle } \right)}}{{\sum\nolimits_{{j\prime }} {\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{2} ,\overline{{\textbf{c}}} _{{j\prime }} \rangle } \right)} }}. \hfill \\ \end{gathered} $$

(2)

We concisely write $ \textbf {p}^1_{i}(\mathcal {Z}) = \{ \textbf {p}^1_{i,j}(\mathcal {Z},C) \}_{j = 1}^K $ and $ \textbf {p}^2_{i} = \{ \textbf {p}^2_{i,j}(\mathcal {Z},C) \}_{j = 1}^K $. Here, $\tau $ is a temperature parameter, and we set its value to 0.1, similar to Caron et al. (2020). Note that we use $\textbf {p}_{i}$ to denote the predicted cluster assignment probabilities for image i (when not referring to a particular view), and the shorthand notation $\textbf {p}$ is used when i is clear from context.

2.2.2 Targets $\textbf {q}$

The idea of predicting the assignments $\textbf {p}$ and then comparing them with the high-confidence estimates $\textbf {q}$ (referred to as codes henceforth) of the predictions was proposed by **e et al. (2016a). While **e et al. (2016a) used pretrained features (from autoencoders) to compute the predicted assignments and the codes, the use of their approach in an end-to-end unsupervised manner might lead to degenerate solutions. Asano et al. (2019) avoided such degenerate solutions by enforcing an equipartition constraint (the prototypes equally partitioned the data) during code computation using the Sinkhorn-Knopp algorithm (Cuturi, 2013). Caron et al. (2020) followed a similar formulation but computed the codes for the two views separately in an online manner for each minibatch. The assignment codes are computed by solving the following optimization problem:

$$\begin{aligned} {\begin{matrix} Q^1 &{}= \mathop {\hbox {arg max}}\limits _{Q\in \mathcal {Q}} \text {Tr}(Q^TC_0^TZ_0^1) + \epsilon H(Q) \\ Q^2 &{}= \mathop {\hbox {arg max}}\limits _{Q\in \mathcal {Q}} \text {Tr}(Q^TC_0^TZ_0^2) + \epsilon H(Q), \end{matrix}} \end{aligned}$$

(3)

where $ Q = \{\textbf {q}_1, \ldots , \textbf {q}_B \} \in \mathbb {R}_{+}^{K\times B}$, $\mathcal {Q}$ is the transportation polytope defined by

$$\begin{aligned} \mathcal {Q} = \{\textbf {Q}\in \mathbb {R}^{K\times B}_{+}~\text {s.t}~ \textbf {Q}\textbf {1}_B = \frac{1}{K}\textbf {1}_K, \textbf {Q}^T\textbf {1}_K = \frac{1}{B}\textbf {1}_B \} \end{aligned}$$

$\textbf {1}_K$ is a vector of ones of dimension K and $ H(Q) = -\sum _{i,j}Q_{i,j}\log Q_{i,j} $. The above optimization is computed using a fast version of the Sinkhorn-Knopp algorithm (Cuturi, 2013), as described by Caron et al. (2020).

After computing the codes $Q^1 $ and $Q^2$, to maintain the consistency between the clustering results of the augmented views, the loss is computed using the probabilities $\textbf {p}_{ij}$ and the assigned codes $\textbf {q}_{ij}$ by comparing the probabilities of view 1 with the assigned codes of view 2 and vice versa, as in (1).

2.2.3 Defining variations of $Z_0$ and $C_0$

To compute $\{Z_1,...,Z_M \}$, we project the d-dimensional space $Z_0$ to a D-dimensional space using a random projection matrix. We follow the same procedure to compute $\{C_1,...,C_M \}$ from $C_0$. At the beginning of the algorithm, we randomly initialize M such transformations and fix them throughout training. Suppose that by using a particular random transformation (a randomly generated matrix A), we obtain $\tilde{\textbf {z}} = A\textbf {z},\; \tilde{\textbf {c}} = A\textbf {c}$. We then compute the softmax probabilities using the normalized vectors $\tilde{\textbf {z}}/\Vert \tilde{\textbf {z}}\Vert $ and $\tilde{\textbf {c}}/\Vert \tilde{\textbf {c}}\Vert $. This step is repeated with the M transformation results in the M predicted cluster assignment probabilities for each view. When the network is untrained, the embeddings $\textbf {z}$ are random, and applying the random transformations, followed by computing the predicted cluster assignments, leads to a diverse set of soft cluster assignments. The parameter weights are trained by using the stochastic gradients of the loss for updates.

2.2.4 Backbone loss

To better capture exemplar consistency, based on previous evidence of successful clustering with the ID approach (Tao et al., 2021), we use ID (Wu et al., 2018) as one of the losses, as in Tao et al. (2021). The exemplar objective of ID is to classify each image as its own class.

Given n images and a neural network $f_{\theta }$ for calculating features, we first normalize the features $\bar{f}_{\theta }(x) = \frac{f_{\theta }(x)}{\Vert f_{\theta }(x) \Vert }$. Then, ID defines the probability of an example x being recognized as the i-th example as

$$\begin{aligned} P(i \vert f_{\theta }(x)) = \frac{\exp \left( \langle \bar{f}_{\theta }(x_i), \bar{f}_\theta (x) \rangle / \tau \right) }{\sum _{j=1}^n \exp \left( \langle \bar{f}_{\theta }(x_j), \bar{f}_\theta (x) \rangle / \tau \right) }. \end{aligned}$$

(4)

ID then uses the uniform distribution as a noise distribution $P_n = \frac{1}{n}$ to compute the probability that data example x comes from a data distribution $P_d$ as opposed to the noise distribution $P_n$ as $h(i, f_{\theta }(x)) := \frac{P(i\vert f_{\theta }(x))}{P(i\vert f_{\theta }(x)) + m P_n(i)}$. Assuming that the noise samples are m times more frequent than actual data samples, the ID loss is defined as

$$\begin{aligned} {\begin{matrix} L_{b}&= - E_{P_d} \left[ \log h(i, x)\right] - m E_{P_n} \left[ \log (1 - h(i, x')) \right] , \end{matrix}} \end{aligned}$$

(5)

where $x'$ is the feature from a randomly drawn image other than image x in a given dataset. We exactly follow the framework developed in Wu et al. (2018) to implement the ID loss.

The final loss that we seek to minimize is the combination of the losses $L_{\mathcal {Z}}$ ((1)) and $L_b$ ((5)),

$$\begin{aligned} L_{\text {total}} = \alpha L_{\mathcal {Z}} + \beta L_b. \end{aligned}$$

(6)

where $\alpha , \beta $ are nonnegative constants. Details of the algorithm are given Algorithm 1 and we also provide a PyTorch-style pseudocode in Algorithm 2 in the Appendix.

2.2.5 Computing the cluster metrics

In this section, we describe the approach used to compute the cluster assignments and the metrics chosen to evaluate their quality. Note that we assume that the number of true clusters (K) in the data is known.

There are two ways to compute the cluster assignments. The first way is to use the embeddings generated by the backbone; here, the embeddings are the outputs of the ID block $f_{\theta }(x)$. The embeddings of all the images are computed, and then we perform k-means clustering.

The second method is to use the soft clustering block to compute the cluster assignments. It is sufficient to use the computed probability assignments $\{\textbf {p}_i\}_{i=1}^N$ or the computed codes $\{\textbf {q}_i\}_{i=1}^N$ and assign the cluster index as $c_i = \arg \max _{k} \textbf {q}_{ik}$ for the $i^{\text {th}}$ data point. Once the model is trained, in this second approach, cluster assignment can be performed online without requiring the computation of the embeddings of all the input data.

We evaluate the quality of the clusterings using metrics such as the cluster accuracy, normalized mutual information (NMI), and adjusted Rand index (ARI). To compute the clustering accuracy, we are required to solve an assignment problem (computed using a Hungarian match (Kuhn, 1955, 1956)) between the true class labels and the cluster assignments. In our analysis, we observe that using k-means with the embeddings produced by the ID block achieves better clustering accuracy, and we use this method throughout the paper while evaluating our proposed algorithm.

2.3 Generating multiple clustering results

Fred and Jain (2005) discussed different ways to generate cluster ensembles; these methods are tabulated in Table 1. In our proposed algorithm, we focus on choosing of the appropriate data representation to generate cluster ensembles.

Table 1 Different ways to generate ensembles

Full size table

By fixing a stable clustering algorithm, we can generate arbitrarily large ensembles by applying different transformations on the embeddings. Random projections were previously successfully used in consensus clustering (Fern and Brodley, 2003). By generating ensembles using random projections, we have control over the amount of diversity we can induce into the framework by varying the dimensionality of the random projection. In addition to random projections, we also use diagonal transformations (Hsu et al., 11 and 12, we show how the running mean of accuracy progresses during training for each of the experiments in Table 10.

Table 10 Data augmentation details

Full size table

5.3 Effect of image resolution

Image resolution is often considered a free parameter (Niu et al., 2020a), and however, its effect on clustering performance is not evaluated rigorously in most works. We try to quantify the effects of different resolutions to the greatest extent possible, given that some datasets are available only at specific resolutions. For STL-10, we use $32\times 32$, $64\times 64$ and $96\times 96$ resolutions. For ImageNet-10 and ImageNet-Dog-15, we use $96\times 96$, $160\times 160$ and $224\times 224$ resolutions. The results are given in Table 11.

Table 11 Effects of different resolutions for STL-10, ImageNet-10 and ImageNet-Dogs

Full size table

The best performance for ImageNet-10 and ImageNet-Dogs is obtained at a resolution of 160, and for STL-10, the best performance is obtained at a resolution of 96. It is not clear why ImageNet-10 and ImageNet-Dogs do not yield the best performance at high resolutions, and further investigation is needed; we keep this as an open problem.

5.4 Distribution of accuracies across the set of hyperparameters

Table 12 Hyperparameters and the range values used for the experiments

Full size table

Table 13 Hyperparameters for obtaining maximum performance

Full size table

The proposed consensus loss has two parameters. The first is the number of transformations used, and the second is the dimensionality of the projection space. To understand the proposed loss, we conduct a detailed experimental study on STL-10 and CIFAR100-20.^{Footnote 3} The hyperparameters used are given in Table 12.

Due to the sheer number of conducted experiments, we supply the summary statistics obtained on a random set. We report the empirical mean and standard deviation of the marginal distribution of the quantity under investigation. Let $P_{\tau ,\eta ,d,l}$ be the joint distribution over the hyperparameters $\tau $ (temperature parameter), l (learning rate), $\eta $ (natural log of the number of transformations) and d (dimensionality of the projection space). We consider $n_h$ as the number of distinct values used in the experiment for each hyperparameter $h \in \{ \tau ,\eta ,d,l \}$ based on Table 12. We the denote accuracy from each experiment based on the hyperparameters used as $a_{\tau ,\eta ,d,l}$ . Let $P_{h_i \vert h_j}$ be the conditional marginal distribution of hyperparameter $h_i$ given $h_j$ and the conditional empirical mean of $P_{h_i \vert h_j}$ be $m(P_{h_i \vert h_j})$. In this case, the conditional empirical mean $m(P_{h_i \vert h_j})$ when $h_i = d$ and $h_j=\tau $ can be calculated using $m(P_{d \vert \tau }) = \frac{1}{n_{\eta } \times n_{l}} \sum _{\eta } \sum _{l} a_{\tau ,\eta ,d,l}$. The conditional empirical means and standard deviations of other hyperparameters are calculated in the same way. In Fig. 13, we show each conditional empirical mean with a blue dot, and each red line around a dot represents a standard deviation. For both STL-10 and CIFAR100-20, we see a trend regarding the number of projections. For STL-10, the smaller the number of random projections, the better the results are, and for CIFAR100-20, increasing the number of random projections is helpful for improving the clustering accuracy up to some point. Note that when the number of random projections is equal to zero, our setting is equivalent to the baseline ID model, and our approach always performs better than ID. This means that the optimal number of random projections is greater than or equal to one. There is no such clear trend in the number of dimensions of the random projections.

The max-performance procedure provides some insights into the performance of the algorithms at hand, although it does not provide the whole picture because it does not consider the robustness of the performance differences. In Table 13, we give the hyperparameters that yield the max performance. In other words, finding a hyperparameter set that yields better performance than the baseline is the core idea behind the max-performance procedure. We ask the following question: given a hyperparameter grid, how likely is our method to achieve better accuracy than the baseline? In Fig. 14, we report the empirical accuracy distributions on STL-10 and CIFAR100-20 for all hyperparameters given in Table 12. The red dotted lines show the corresponding baseline accuracy for each dataset. For STL-10, only approximately $12.5\%$ of the hyperparameter sets yield better results than the baseline. On the other hand, for CIFAR100-20, approximately $90\%$ of the hyperparameter sets yield better results than the baseline. In other words, it does not require a significant amount of computational power to find a better model than the state-of-the art models for CIFAR100-20; however, the situation is the opposite for STL-10. The results given in Fig. 14 suggest that when comparing models, multiple metrics need to be considered, not only the max-performance procedure.

5.5 Effect of architecture choice

In this work, we use ResNet-18 and ResNet-50 as network architectures. For both ResNet-18 and ResNet-50, we sweep over the same set of hyperparameter choices, i.e., the temperature, number of projections and projection dimensionality, and report the results for ImageNet-10 dataset with image resolution of 160$\times 160$. Figure 15 shows the distribution of $\Delta _{acc}$, which is defined as the accuracy difference between ResNet-50 and ResNet-18. Figure 15 indicates that ResNet-50 slightly outperforms ResNet-18, i.e., the mean difference is approximately $0.5\%$.

5.6 Runtime comparison

To study the runtime of the proposed method, we compare the time taken per epoch for the baseline ID algorithm and the proposed algorithm. Due to the additional loss computation, the time taken to run the proposed algorithm is higher which can be observed from Fig. 16. The additional time taken is mainly due to computing the consensus loss for the different number of transformations. The current implementation computes the forward pass for the different transformations sequentially thus increasing the runtime. However, a more time efficient implementation where the forward passes for all the different random transformations are computed in parallel can make the runtime more comparable to the baseline ID algorithm.

6 Conclusion

In this work, we introduce different notions of the consistency constraints that are enforced in different unsupervised/self-supervised learning algorithms. We propose a novel clustering algorithm that seamlessly incorporates all three consistency constraints (exemplar, population and consensus) and achieves state-of-the-art clustering results for four out of five popular and challenging computer vision datasets. Our work on consensus clustering is significantly different from earlier consensus clustering works that do not learn representations. Moreover, we initiate a discussion on the adequacy of the currently used methods for evaluating clustering algorithms. We significantly extend the evaluation procedure for clustering algorithms, thereby reflecting the challenges of applying clustering to real-world tasks. We provide evaluation results for ConCURL and other state-of-the-art clustering algorithms based on max-performance criteria, according to which ConCURL outperforms other algorithms on most datasets. However, its average performance according to out-of-distribution criteria highlights the need to use the proposed evaluation methods for deep clustering algorithms.

Data Availability

The datasets used are available at https://www.cs.toronto.edu/~kriz/cifar.html, https://cs.stanford.edu/~acoates/stl10/ or can be requested at https://image-net.org/. All the trained models, their usage is available https://github.com/JayanthRR/ConCURL_NCE.

Code Availability

Code is available https://github.com/JayanthRR/ConCURL_NCE.

Notes

a) Degenerate solution where all cluster assignments are the same. b) Random assignment can satisfy this condition given that all clustering process produces the same but random assignments
A theoretically grounded explanation of ConCURL is considered future work due to the nonconvexity of deep learning methods and the nonconvexity of the proposed loss.
We conduct a similar but smaller study on the remaining dataset, and we observe similar trends.
https://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-score..
https://scikit-learn.org/stable/modules/clustering.html#mini-batch-kmeans..
using the official PyTorch implementation available at https://github.com/neuroailab/LocalAggregation-PyTorch/tree/master/config..

References

Asano, Y.M. , Rupprecht, C., Vedaldi, A. (2019). Self-labelling via simultaneous clustering and representation learning. ar**v preprint ar**v:1911.05371.
Bengio, Y. , Lamblin, P. , Popovici, D. , Larochelle, H., Montreal, U. (2007). Greedy layer-wise training of deep networks . NeurIPS19, 153-160.
Cai, D. , He, X. , Wang, X. , Bao, H., Han, J. (2009). Locality Preserving Nonnegative Matrix Factorization. Ijcai (pp. 1010–1015). http://ijcai.org/Proceedings/09/Papers/171.pdf
Caron, M. , Bojanowski, P. , Joulin, A., Douze, M. (2018a). Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision (eccv) (pp. 132–149).
Caron, M., Bojanowski, P., Joulin, A., & Douze, M. (2018). Deep clustering for unsupervised learning of visual features. Eccv,11218, 139–156.
Caron, M. , Misra, I. , Mairal, J. , Goyal, P. , Bojanowski, P., Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. ar**v preprint ar**v:2006.09882.
Chang, J. , Guo, Y. , Wang, L. , Meng, G. , **ang, S., Pan, C. (2019). Deep discriminative clustering analysis.
Chang, J. , Wang, L. , Meng, G. , **ang, S., Pan, C. (2017). Deep adaptive image clustering. The IEEE International Conference on Computer Vision (iccv).
Chang, J. , Wang, L. , Meng, G. , **ang, S., Pan, C. (2017). Deep adaptive image clustering. Iccv (pp. 5880-5888). https://doi.org/10.1109/ICCV.2017.626
Chen, T. , Kornblith, S. , Norouzi, M., Hinton, G. (2020). A simple framework for contrastive learning of visual representations. ar**v preprint ar**v:2002.05709.
Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems (pp. 2292–2300).
Deng, J. , Dong, W. , Socher, R. , Li, L.J. , Li, K., Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255).
Fern, X.Z., & Brodley, C.E. (2003). Random projection for high dimensional data clustering: A cluster ensemble approach. In: Proceedings of the 20th International Conference on Machine Learning (icml-03) (pp. 186–193).
Franti, P., Virmajoki, O., & Hautamaki, V. (2006). Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1875–1881.
Article Google Scholar
Fred, A. L., & Jain, A. K. (2005). Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 835–850.
Article Google Scholar
Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814), 972–976.
Article MathSciNet MATH Google Scholar
Ghosh, J., & Acharya, A. (2011). Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4), 305–315.
Google Scholar
Grill, J. B. , Strub, F. , Altché, F. , Tallec, C. , Richemond, P.H. , Buchatskaya, E. & others (2020). Bootstrap your own latent: A new approach to self-supervised learning. ar**v preprint ar**v:2006.07733.
Haeusser, P. Plapp, J. , Golkov, V. , Aljalbout, E., Cremers, D. (2019). Associative deep clustering: Training a classification network with no labels. T. Brox, A. Bruhn, & M. Fritz (eds), Pattern recognition (pp. 18–32). Cham, Springer International Publishing.
He, K. , Fan, H. , Wu, Y. , **e, S., Girshick, R. (2019). Momentum contrast for unsupervised visual representation learning. ar**v preprint ar**v:1911.05722.
He, K. , Fan, H. , Wu, Y. , **e, S., Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9729–9738).
Hsu, K. , Levine, S., Finn, C. (2018). Unsupervised learning via meta-learning. ar**v preprint ar**v:1810.02334.
Huang, J. , Gong, S., Zhu, X. (2020a). Deep semantic clustering by partition confidence maximisation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8849–8858).
Huang, J. , Gong, S., Zhu, X. (2020b). Deep semantic clustering by partition confidence maximisation. Cvpr.
Jain, A.K., & Dubes, R.C. (1988). Algorithms for clustering data. Prentice-Hall, Inc.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys (CSUR), 31(3), 264–323.
Article Google Scholar
Ji, X. , Henriques, J.F., Vedaldi, A. (2019a). Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 9865–9874).
Ji, X., Henriques, J.F., Vedaldi, A. (2019b). Invariant information clustering for unsupervised image classification and segmentation. Iccv.
Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2), 83–97.
Article MathSciNet MATH Google Scholar
Kuhn, H. W. (1956). Variants of the hungarian method for assignment problems. Naval Research Logistics Quarterly, 3(4), 253–258.
Article MathSciNet MATH Google Scholar
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
Article Google Scholar
Li, Y. , Hu, P. , Liu, Z. , Peng, D. , Zhou, J.T., Peng, X. (2021). Contrastive clustering. Aaai.
Macqueen, J. (1967). Some methods for classification and analysis of multivariate observations. In 5-th berkeley Symposium on Mathematical Statistics and Probability (pp. 281–297).
Masulli, F., & Schenone, A. (1999). A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artificial Intelligence in Medicine, 16(2), 129–147.
Article Google Scholar
Ng, A.Y. , Jordan, M.I., Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. T.G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), NeurIPS Neurips (pp. 849–856). MIT Press. http://papers.nips.cc/paper/2092-on-spectral-clustering-analysis-and-an-algorithm.pdf
Niu, C. , Shan, H., Wang, G. (2021). Spice: Semantic pseudo-labeling for image clustering. ar**v preprint ar**v:2103.09382.
Niu, C. , Zhang, J. , Wang, G., Liang, J. (2020a). Gatcluster: Self-supervised gaussian-attention network for image clustering. pp. 735–751.
Niu, C. , Zhang, J. , Wang, G., Liang, J. (2020b). Gatcluster: Self-supervised gaussian-attention network for image clustering. Eccv (pp. 735–751).
Regatti, J.R. , Deshmukh, A.A. , Manavoglu, E., Dogan, U. (2021). Consensus clustering with unsupervised representation learning. In: International Joint Conference on Neural Networks (IJCNN) ar**v preprint ar**v:2010.01245.
Schops, T. , Schonberger, J.L. , Galliani, S. , Sattler, T. , Schindler, K. , Pollefeys, M., Geiger, A. (2017). A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3260–3269).
Shah, S.A., & Koltun, V. (2018). Deep continuous clustering. ar**v preprint ar**v:1803.01449.
Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 1–48.
Article Google Scholar
Strehl, A., & Ghosh, J. (2002). Cluster ensembles–a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617.
MathSciNet MATH Google Scholar
Tao, Y. , Takagi, K., Nakata, K. (2021). Clustering-friendly representation learn-ing via instance discrimination and feature decorrelation. ar**v preprint ar**v:2106.00131.
Tian, Y. , Sun, C. , Poole, B. , Krishnan, D. , Schmid, C., Isola, P. (2020). What makes for good views for contrastive learning. ar**v preprint ar**v:2005.10243.
Van Gansbeke, W. , Vandenhende, S. , Georgoulis, S. , Proesmans, M. & Van Gool, L. (2020). Scan: Learning to classify images without labels.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(12), 3371–3408.
MathSciNet MATH Google Scholar
Wu, J. , Long, K. , Wang, F. , Qian, C. , Li, C. , Lin, Z., Zha, H. (2019a). Deep comprehensive correlation mining for image clustering. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 8150–8159).
Wu, J. , Long, K. , Wang, F. , Qian, C. , Li, C. , Lin, Z., Zha, H. (2019b). Deep comprehensive correlation mining for image clustering. Iccv.
Wu, Z. , **ong, Y. , Yu, S.X., Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3733–3742).
**e, J. , Girshick, R., Farhadi, A. (2016a). Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning (pp. 478–487).
**e, J. , Girshick, R., Farhadi, A. (2016b). Unsupervised deep embedding for clustering analysis. Icml (pp. 478–487). JMLR.org. http://dl.acm.org/citation.cfm?id=3045390.3045442
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678.
Article Google Scholar
Yang, J. , Parikh, D., Batra, D. (2016). Joint unsupervised learning of deep representations and image clusters. Cvpr.
Zeiler, M.D. , Krishnan, D. , Taylor, G.W., Fergus, R. (2010). Deconvolutional networks. In: Computer Vision and Pattern Recognition.
Zhuang, C. , Zhai, A.L., Yamins, D. (2019). Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 6002–6012).

Download references

Funding

Work was done at Microsoft with Microsoft’s support.

Author information

Authors and Affiliations

Microsoft, Mountain View, CA, 94043, USA
Aniket Anand Deshmukh, Eren Manavoglu & Urun Dogan
The Ohio State University, Columbus, OH, 43210, USA
Jayanth Reddy Regatti

Authors

Aniket Anand Deshmukh
View author publications
You can also search for this author in PubMed Google Scholar
Jayanth Reddy Regatti
View author publications
You can also search for this author in PubMed Google Scholar
Eren Manavoglu
View author publications
You can also search for this author in PubMed Google Scholar
Urun Dogan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors - Aniket Anand Deshmukh, Jayanth Reddy Regatti, Eren Manavoglu and Urun Dogan contributed to this work and were essential to complete this submission.

Corresponding author

Correspondence to Aniket Anand Deshmukh.

Ethics declarations

Conflict of Interest

Authors are from Microsoft and The Ohio State University. There are no conflict of interests to disclose.

Ethics approval

Not applicable

Consent to participate

Yes

Consent for publication

Yes

Additional information

Editor: Andrea Passerini.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jayanth Reddy Regatti: Work was done while author was at Microsoft.

Appendices

Appendix A: Pseudo-code

Appendix B: Evaluation Metrics

We evaluate our algorithm by computing traditional clustering metrics (the ACC, NMI, and ARI), which we discuss below in detail.

1.1 B.1 ACC

The ACC is computed by first computing a cluster partition of the input data. Once the partitions are computed and cluster indices are assigned to each input data point, a linear assignment map is computed using the Kuhn-Munkres (Hungarian) algorithm, which reassigns the cluster indices to the true labels of the data. The ACC is then given by

$$\begin{aligned} ACC = \frac{ \sum _{i = 1}^N \mathbb {I}\{y_{true}(x_i) = c(x_i)\}}{N}, \end{aligned}$$

where $y_{true}(x_i) $ is a true label of $x_i$ and $c(x_i)$ is the cluster assignment produced by an algorithm (after Hungarian map**).

1.2 B.2 NMI

For two clusters U, V, where each contains |U|, |V| clusters and $|U_i|$ represents the number of samples in cluster $U_i$ of clustering result U (similar for V), the MI is given by

$$\begin{aligned} MI(U,V) = \sum _{i=1}^{\vert U \vert }\sum _{j=1}^{\vert V \vert } \frac{\vert U_i \cap V_i \vert }{N}\log \frac{N \vert U_i\cap V_j \vert }{\vert U_i\vert \vert V_j \vert } \end{aligned}$$

where N is the number of data points under consideration. The NMI is defined as

$$\begin{aligned} NMI(U,V) = \frac{MI(U,V)}{\sqrt{MI(U,U)MI(V,V)}} \end{aligned}$$

1.3 B.3 ARI

^{Footnote 4} Suppose that R is the ground truth clustering result and that S is a partition. The RI of S is given as follows. Let a be the number of pairs of elements that are in the same set in R and in S; let b be the number of pairs of elements that are in different sets in R and in S. Then,

$$\begin{aligned} RI= & {} \frac{a + b}{{n \atopwithdelims ()2}}\\ ARI= & {} \frac{RI - \mathbb {E}[RI] }{\max (RI) - \mathbb {E}[RI] } \end{aligned}$$

Appendix C: Implementation Details

In this section, we discuss the implementation of the proposed algorithm. We use PyTorch version 1.7.1 for the implementation. The ID block of the algorithm uses the code from the implementation of ID (Wu et al., 2018) available at https://github.com/zhirongw/lemniscate.pytorch. The repository uses PyTorch version 0.3, and appropriate changes are made to use it with the latest version of PyTorch. We use ResNet-18 and ResNet-50 blocks during our experiments. In the ID block, the ResNet architecture is modified as follows. The final fully connected layer consists of 128 dimensions instead of the usual 1000 dimensions. The output of the final fully connected layer proceeds to compute the noise contrastive estimation (NCE) loss, and the feature representations (the layer before the fully connected layer) are fed to the clustering part.

For the clustering part, the MLP projection head g consists of a hidden layer of size 2048, followed by batch normalization and rectified linear unit (ReLU) layers, and an output layer of size 256. The prototypes are thus chosen to have 256 dimensions. Note that we fix the number of prototypes to be equal to the number of ground truth classes in the dataset. It has been shown, however, that overclustering leads to better representations (Caron et al., 2020; Ji et al., 2019a; Asano et al., 2019), and we can extend our model to include an overclustering block with a larger set of prototypes (Ji et al., 2019a) and alternate the training procedure between the blocks.

We train the algorithm for 2000 epochs on all datasets. We use the SGD optimizer with a learning rate decay of 0.1 at prespecified epochs (600, 950, 1300, 1650, 2000) to perform the updates for all datasets. We perform a coarse learning rate search and find that 0.03 is the best-performing setting. We use a batch size of 128 for all the datasets. To evaluate the cluster accuracy, we compute the cluster assignments using MiniBatchKMeans^{Footnote 5} with a batch size of 6000 and 20 random initializations.

1.1 Image augmentations

The different views $\mathcal {X}_b^1, \mathcal {X}_b^2$ are not the same as the views in multiview datasets (Schops et al., 2017). The views referred to in this paper correspond to different augmented views that are generated by image augmentation techniques, such as RandomHorizontalFlip and RandomCrop. We explain the generation process of multiple augmented views, which have been shown to be very effective in unsupervised learning (Chen et al., 2020). Indeed, it is possible to use more than two augmented views, but we limit to the number to two for the sake of simplicity. Caron et al. (2020) proposed an augmentation technique (Multi-Crop) to use more than two views. In this work, we use the augmentation methods used in Chen et al. (2020); Grill et al. (2020). We first crop a random patch of the input image with a scale ranging from 0.08 to 1.0 and resize the cropped patch to 224$\times $224 (96$\times $96 for smaller-resolution datasets such as STL10). The resulting image is then flipped horizontally with a probability of 0.5. We then apply color transformations, starting by applying grayscale with a probability of 0.2 followed by randomly changing the brightness, contrast, saturation and hue with a probability of 0.8. Then, we apply a Gaussian blur with a kernel size of 23$\times $23 and a sigma chosen uniformly and randomly between 0.1 and 2.0. The probabilities of applying Gaussian blur are 1.0 for view 1 and 0.5 for view 2. During the evaluation, we resize the image such that the smaller edge of the image is of size 256 (not required for STL-10, CIFAR-10, and CIFAR100-20), and a center crop operation is performed with the resolution mentioned in the main paper. We finally normalize the image channels with the mean and standard deviation computed on ImageNet. Additionally, during training, we experiment with applying a Sobel filter after all the image augmentation steps are performed but before the forward pass. Applying a Sobel filter reduces the number of channels in the input images to 2. We also experiment by augmenting the RGB images with the output of the Sobel transform, resulting in 5-channel input images. In both of these cases, the input channels in the first convolution layer are modified accordingly. All image augmentations are computed using PyTorch’s torchvision module (available in version 1.7.1).

1.2 Random transformations

To compute the random transformations on the embeddings $\textbf {z}$, we follow two techniques. We use Gaussian random projections with an output dimensionality of d and transform the embeddings $\textbf {z}$ to the new space with a dimensionality of d. In Gaussian random projections, the projection matrix is generated by picking rows from a Gaussian distribution such that they are orthogonal. We also use diagonal transformation (Hsu et al., 2018), where we multiply $\textbf {z}$ with a randomly generated diagonal matrix with the same dimensions as $\textbf {z}$. We initialize M random transformations at the beginning, and they are kept fixed throughout the training process.

Appendix D: Comparison with LA

LA (Zhuang et al., 2019) builds on nonparametric ID (Wu et al., 2018) and uses a robust clustering objective (it uses a closest neighbors set generated using multiple runs of k-means) similar to consensus clustering to move statistically similar points closer in the representation space and dissimilar points farther away. By using the linear evaluation protocol on ImageNet, the authors demonstrate that the representations learned with LA are better than those obtained without LA.

However, the performance of these features with respect to clustering was not discussed. Since LA is similar to our work in spirit, we perform a study on the clustering performance of the features learned using LA^{Footnote 6} on the ImageNet-10 and ImageNet-Dogs datasets. The results are presented in Table 14.

Table 14 Comparison with LA on ImageNet-10

Full size table

We observe that the clustering performance of our proposed ConCURL algorithm is much better than the clustering performance of the LA method. Note that the clustering performance of the ID features is better than that of LA, and our algorithm further improves upon the clustering performance of ID. One major difference between our work and LA is the way in which we generate the ensemble. Our approach allows us to control and measure the diversity of the ensemble, which can be useful in making algorithm design choices. Although LA controls the ensemble by varying the number of clustering results and the number of clusters in each clustering result, which aptly suits the objective that LA is solving, the resultant ensembles are limited to utilizing k-means clustering (the authors showed that other clustering approaches were either not scalable or not optimal). In our case, by applying feature space transformations, we have much more freedom in generating the ensemble. We use random projections and diagonal transformations, but there could be other transformations on the feature space that we have not yet explored.

1.1 Implementation Details of LA

We train the model for 500 epochs. Since the original implementation of LA was designed for ImageNet, we perform a hyperparameter search as follows. Using the config file from the official PyTorch implementation, we create 36 configuration files by varying the learning rate and k-means-k. In the original config file provided by the authors, k-means-k = 30000 (for 1.28 million images). We scale the k-means-k for ImageNet-10 (13000 images) and ImageNet-Dogs (19500 images) accordingly and try six different values. In particular, we try learning rates = [0.003, 0.005, 0.01, 0.03, 0.05, 0.1], and k-means-k = [10, 15, 100, 310, 452, 500].

The number of background neighbors is 4096, as used in the original paper. We run ResNet-18 experiments for the full hyperparameter search (36 experiments) and evaluate the cluster metrics. Additionally, we choose the top 5 choices of hyperparameters from above and run ResNet-34 experiments with those parameters for both datasets. In Table 14, we present the best results obtained for each dataset.

The code repository uses a different version of ResNet (PreActResNet). Therefore, for evaluating the clustering performance, we take the output of the layer before the final dense layer. For ResNet-18, the output dimensions of this layer are (512,7,7), and we take the mean along the (1,2) dimensions and use the resulting 512-dimensional vector. We compute the k-means clustering results on these embeddings using faiss (https://github.com/facebookresearch/faiss) for the training split of the data and compute the cluster metrics as mentioned in our paper.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Deshmukh, A.A., Regatti, J.R., Manavoglu, E. et al. Representation learning for clustering via building consensus. Mach Learn 111, 4601–4638 (2022). https://doi.org/10.1007/s10994-022-06194-9

Download citation

Received: 08 September 2021
Revised: 25 February 2022
Accepted: 05 April 2022
Published: 09 September 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10994-022-06194-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Representation learning for clustering via building consensus

Abstract

Similar content being viewed by others

D-TRACE: Deep Triply-Aligned Clustering

Contrastive Hierarchical Clustering

DeepECT: The Deep Embedded Cluster Tree

1 Introduction

1.1 Related work

1.1.1 Self-supervised learning

1.1.2 Clustering with representation learning

2 Consensus clustering

Definition 1

Definition 2

Definition 3

2.1 Loss for consensus and population consistency

2.2 End-to-End Stochastic Gradient Descent (SGD)-Based trainable consensus loss

2.2.1 Cluster assignment probability \(\textbf {p}\)

2.2.2 Targets \(\textbf {q}\)

2.2.3 Defining variations of \(Z_0\) and \(C_0\)

2.2.4 Backbone loss

2.2.5 Computing the cluster metrics

2.3 Generating multiple clustering results

5.3 Effect of image resolution

5.4 Distribution of accuracies across the set of hyperparameters

5.5 Effect of architecture choice

5.6 Runtime comparison

6 Conclusion

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A: Pseudo-code

Appendix B: Evaluation Metrics

1.1 B.1 ACC

1.2 B.2 NMI

1.3 B.3 ARI

Appendix C: Implementation Details

1.1 Image augmentations

1.2 Random transformations

Appendix D: Comparison with LA

1.1 Implementation Details of LA

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation