1 Introduction

The field of artificial intelligence (AI) advanced significantly in the previous decade due to developments in deep learning LeCun et al. (2015). In the early years of this field, deep learning methods exhibited stellar supervised learning performance, where each data sample was coupled with a ground truth (labeled data), e.g., each image was associated with a category. Unfortunately, generating labeled datasets is time consuming and expensive, and there may not be enough experts to label the data at hand (e.g., medical images). The straightforward solution is to use clustering for large-scale problems.

In this work, we focus on representation learning for the unsupervised learning task of clustering images. Clustering is a ubiquitous task and has been actively used in many different scientific and practical pursuits (Frey & Dueck, 2007; Masulli & Schenone, 1999; Jain et al., 1999; Xu & Wunsch, 2005). Clustering algorithms do not learn representations and are hence limited to data for which we have a good representation available.

Advancements in deep learning techniques have enabled the end-to-end learning of rich image representations for supervised learning. For the purposes of clustering, however, such features learned via supervised learning cannot be obtained due to lack of available labels. Therefore, supervised learning approaches fall short of providing a solution. Self-supervised learning addresses the issue of learning representations without labeled data. Self-supervised learning is a subfield of unsupervised learning in which the main goal is to learn general-purpose representations by exploiting user-defined tasks (pretext tasks) Wu et al. (2018); Zhuang et al. (2019); He et al. (2020); Zhuang et al. (2019); Chen et al. (2020); Grill et al. (2020); Caron et al. (2020). Representation learning algorithms have been shown to achieve good results when evaluated using a linear evaluation protocol, semisupervised training on ImageNet, or transfer to downstream tasks. A straightforward solution to the clustering problem is to use the features obtained via self-supervised learning and apply an out-of-the-box clustering algorithm (such as k-means) to compute data clusters. However, the performance of these features for clustering (using an out-of-the-box clustering algorithm) is not known, and as seen in our results, these features may be improved for clustering purposes.

On the other hand, deep clustering involves simultaneously learning cluster assignments and features using deep neural networks. Simultaneously learning the feature spaces with a clustering objective in deep clustering may lead to degenerate solutions, which until recently limited end-to-end implementations of clustering with representation learning approaches (Caron et al., 2018a). Subsequently, several works have been developed (**e et al., 2016a; Caron et al., 2018a; Shah and Koltun, 2018; Ji et al., 2019a; Niu et al., 2020a; Wu et al., 2019a; Huang et al., 2020a; Tao et al., 2021). We provide details regarding some of these works in Sect. 1.1. Our previous work shows some encouraging results but we extend the work substantially (Regatti et al., 2021). We categorize the current clustering and representation learning works based on the consistency constraints that are used to define their objective functions. We define an additional notion of consistency, consensus consistency, which ensures that representations are learned to induce similar partitions for variations in the representation space, different clustering algorithms or different initializations of a clustering algorithm. We use consensus consistency and propose an end-to-end learning approach that outperforms other end-to-end learning methods for image clustering. We summarize our contributions as follows:

  1. (1)

    We introduce different notions of consistency (exemplar, population and consensus) that are used in unsupervised representation learning.

  2. (2)

    We propose a novel clustering algorithm that incorporates the above three consistency constraints and can be trained in an end-to-end way. An ensemble is generated in the consensus clustering objective by performing random transformations on the underlying embeddings. We combine several methods, which is not trivial, and this combination, along with our new consensus loss, is novel.

  3. (3)

    We show that the proposed algorithm ConCURL (consensus clustering with unsupervised representation learning) outperforms baselines on popularly used computer vision datasets when evaluated with clustering metrics.

  4. (4)

    We demonstrate the clustering abilities of trained models under a data shift and argue for the need for different evaluation metrics for deep clustering algorithms.

  5. (5)

    We study the impacts of various hyperparameters, data augmentation methods, and image resolutions on the clustering ability of the proposed algorithm.

1.1 Related work

1.1.1 Self-supervised learning

Self-supervised learning is used to learn representations in an unsupervised way by defining some pretext tasks. There are many different flavors of self-supervised learning, such as instance discrimination (ID) tasks (Wu et al., 2020), have achieved state-of-the-art results without requiring negative pairs. Although self-supervised learning methods exhibit impressive performance on a variety of problems, it is not clear whether learned representations are good for clustering.

1.1.2 Clustering with representation learning

DEC (**e et al., 2016a) is one of the first algorithms to show that deep learning can be used to effectively cluster images in an unsupervised manner; this approach uses features learned from an autoencoder to fine-tune the cluster assignments. DeepCluster (Caron et al., 2018a) shows that it is possible to train deep convolutional neural networks (DeCNNs) in an end-to-end manner with pseudolabels that are generated by a clustering algorithm. Subsequently, several works (Shah and Koltun, 2018; Ji et al., 2019a; Niu et al., 2020a; Wu et al., 2019a; Huang et al., 2020a) have introduced end-to-end clustering-based objectives and achieved state-of-the-art clustering results. For example, in the Gaussian attention network for image clustering (GATCluster) (Niu et al., 2020a), training is performed in two distinct steps (similar to Caron et al. (2018a)), where the first step is to compute pseudotargets for a large batch of data and the second step is to train the model in a supervised way using these pseudotargets. Both DeepCluster and GATCluster use k-means to generate pseudolabels that may not scale well. Wu et al. (2019a) proposed deep comprehensive correlation mining (DCCM), where discriminative features are learned by taking advantage of the correlations among the data using pseudolabel supervision and the triplet mutual information among the features. However, DCCM may be susceptible to trivial solutions (Niu et al., 2020a). Invariant information clustering (IIC) (Ji et al., 2019a) maximizes the mutual information between the class assignments of two different views of the same image (paired samples) to learn representations that preserve the commonalities between the views while discarding instance-specific details. It has been argued that the presence of an entropy term in mutual information plays an important role in avoiding degenerate solutions. However, a large batch size is needed for the computation of mutual information in IIC; this process may not be scalable for larger image sizes, which are common in popular datasets (Ji et al., 2019a; Niu et al., 2020a). Huang et al. (2020a) extended the celebrated maximal margin clustering idea to the deep learning paradigm by learning the most semantically plausible clusters through the minimization of a proposed partition uncertainty index. Their pixel intensity clustering (PICA) algorithm uses a stochastic version of this index, thereby facilitating minibatch training. PICA fails to assign a sample-correct cluster when that sample has either high foreground or background similarity to samples in other clusters. In a more recent approach, contrastive clustering (Li et al., 2021), a contrastive learning loss (as in SimCLR (Chen et al., 2020) was adopted along with an entropy term to avoid degenerate solutions. Similarly, Tao et al. (2021) combined ID (Wu et al., 2018) with novel softmax-formulated decorrelation constraints for representation learning and clustering. Their approach outperforms state-of-the-art methods and improves upon the instance discrimination method. Our method also improves upon ID and outperforms the method of  Tao et al. (2021) on all datasets considered. There are other non-end-to-end approaches, such as SCAN (Van Gansbeke et al., 2020), which use the learned representations from a pretext task to find the images that are semantically closest to the given image using the nearest neighbors algorithm. Similarly, one more state of the art non-end-to-end approach, SPICE (Niu et al., 2021) divides the clustering network in two parts - one to measure instance level similarity and one to identify cluster level discrepancy.

2 Consensus clustering

One of the distinguishing factors between supervised learning and unsupervised learning is the existence of ground truth labels that construct a global constraint based on examples. In most self-supervised learning methods, the ground truth is replaced with some consistency constraint (Chen et al., 2020). Without a doubt, the performance of any self-supervised method is a function of the power of the consistency constraint used. We define two types of consistency constraints: exemplar consistency and population consistency.

Definition 1

Exemplar consistency: Representation learning algorithms that learn closer representations (in terms of some distance metric) for different augmentations of the same data point are said to follow exemplar consistency.

Examples of the usage of exemplar consistency include contrastive learning methods such as MoCo (He et al., 2019) and SimCLR (Chen et al., 2020). In these methods, a positive pair of images is defined as any two image augmentations of the same image, and a negative pair consists of any two different images.

Definition 2

Population consistency Representation learning algorithms that ensure that learned representations satisfy the consistency constraint, where two similar data points or any augmentations of the same data points should belong to the same cluster (or population), are said to follow population consistency.

Deep Cluster (Caron et al., 2018a) is a prominent self-supervised method that utilizes population consistency, i.e., Definition 2, by enforcing a clustering constraint on the input dataset. Please note that each cluster assignment contains data points that are similar to each other. Similarly, SwAV (Caron et al., 2020) is an example of the population consistency method.

Definition 3

Consensus consistency Representation learning algorithms that are able to learn representations that induce similar partitions for variations in the given representation space (subsets of features, random projections, etc. ), different clustering algorithms (k-means, Gaussian mixture models (GMMs), etc.) or different initializations of clustering algorithms are said to follow consensus consistency.

Earlier works on consensus consistency did not consider representation learning and used the knowledge reuse framework (see Strehl and Ghosh (2002),Ghosh and Acharya (2011)), where the cluster partitions were available (the features were irrelevant) or the features of the data were fixed. For example, Fern and Brodley (2003) successfully applied random projections to consensus clustering by performing k-means clustering on multiple random projections of the fixed features of input data. In contrast, the notion of consensus consistency here deals with learning representations that achieve a consensus regarding the cluster assignments of multiple clustering algorithms. One example of a method that enforces consensus consistency is LA (Zhuang et al., 2019). LA builds on the ID task (Wu et al., 2018) and was proposed as a method based on a robust clustering objective (using multiple runs of k-means) to move statistically similar data points closer in the representation space and dissimilar data points further away. However, Zhuang et al. (2019) did not evaluate the method with clustering metrics and only focused on linear evaluation using the learned features. Subsequently, we conducted a study to evaluate the clustering performance of these features (see Appendix) and observed that LA performed poorly when evaluated for clustering accuracy. In Definition 3, we inherently assume that the clustering algorithms under consideration have been tuned properly. Unfortunately, the definition of consensus consistency is ill posed, and there can be arbitrarily many different partitions that can satisfy the given conditionFootnote 1. We show that when exemplar consistency is used as an inductive bias, the resulting objective function achieves impressive performance on challenging datasets. Combining the exemplar and population constraints with consensus consistency seamlessly and effectively for clustering is the basis of our proposed method.

2.1 Loss for consensus and population consistency

We focus on learning generic representations that satisfy Definition 3 for clustering. By using different clustering algorithms or different representation variations (such as projections), one can easily generate multiple different partitions of the same data. In unsupervised learning, it is not known which partitioning is correct. To tackle this problem, some additional assumptions are needed.

We assume that there is an underlying latent space \(\mathcal {Z}^*\) (possibly not unique) such that all clusterings (based on latent space, algorithm or initialization variations) that take input data from this latent space produce similar data partitions. Furthermore, every clustering algorithm that also takes the true number of clusters as input produces the partition that is closest to the hypothetical ground truth. Moreover, we assume that there exists a function \(h:X \rightarrow \mathcal {Z}^*\), where X represents the input space and \(\mathcal {Z}^*\) represents the underlying latent space. We call this assumption the principle of consensus. The open question is how one constructs an efficient loss that reflects the principle of consensus. We define one such way below.

Given an input batch of images \(\mathcal {X}_b\subset \mathcal {X}\), the goal is to partition these images into K clusters. We obtain p views of these images (by different image augmentation approaches) and define a loss such that cluster assignment of any of the p views matches the target estimated from any other view. Without loss of generality, we define a loss for \(p = 2\) views. The two views \(\mathcal {X}_b^1, \mathcal {X}_b^2\) are generated using two randomly chosen image augmentations.

We learn a representation space \(\mathcal {Z}_0\) at the end of every training iteration and obtain M variations of \(\mathcal {Z}_0\) as \(\{ \mathcal {Z}_1, \mathcal {Z}_2, ... , \mathcal {Z}_M\}\) (e.g., random projections). The goal is to build an efficient loss according to the principle of consensus among \(\mathcal {Z}_0\) and its M variations \(\{ \mathcal {Z}_1, \mathcal {Z}_2, ... , \mathcal {Z}_M\}\) such that we learn the latent space \(\mathcal {Z}^*\) at the end of training (i.e., the learned features lie in the latent space described above). For a given batch of images \(\mathcal {X}_b\) and a representation space \(\mathcal {Z}_m, \forall m \in [1,...,M]\), we denote the cluster assignment probability of image i and cluster j for view 1 as \(\textbf {p}_{i,j}^{1}(\mathcal {Z}_m)\) and that for view 2 as \(\textbf {p}_{i,j}^{2}(\mathcal {Z}_m)\). We concisely use \(\tilde{\textbf {p}}^{(1,m)},\tilde{\textbf {p}}^{(2,m)}\) when we talk about all the images and all the clusters. Here, we define a loss that incorporates “population consistency" and “consensus consistency". We assume that the target cluster assignment probabilities for the representation \(\mathcal {Z}_0\) are given (as in DeepCluster (Caron et al., 2018a)), and they are denoted as \(\textbf {q}_{i,j}^{1}\) for view 1 and \(\textbf {q}_{i,j}^{2}\) for view 2.

We define the loss for any representation space \(\mathcal {Z}\) and batch of images \(\mathcal {X}_b\) as

$$\begin{aligned} {\begin{matrix} L_{\mathcal {Z}_m}^1 &=& - \frac{1}{2B}\sum _{i=1}^{B}\sum _{j=1}^K \textbf {q}^{2}_{ij} \log \textbf {p}^{1}_{ij}(\mathcal {Z}_m) \\ L_{\mathcal {Z}_m}^2 &=& - \frac{1}{2B}\sum _{i=1}^{B}\sum _{j=1}^K \textbf {q}^{1}_{ij} \log \textbf {p}^{2}_{ij}(\mathcal {Z}_m), \\ L_{\mathcal {Z}} &=& \sum _{m = 1}^M \Big ( L_{\mathcal {Z}_m}^1 + L_{\mathcal {Z}_m}^2 \Big ). \end{matrix}} \end{aligned}$$
(1)

Note that here, consensus among the clustering results is defined via the number of common targets \(\textbf {q}\). An overview of the procedure is shown in Fig. 1. The exact details regarding how to obtain variations of \(\mathcal {Z}_0\) and calculate the cluster assignment probabilities \(\textbf {p}\) and targets \(\textbf {q}\) are described in the next section.

Fig. 1
figure 1

An illustration of the consensus loss part of ConCURL

2.2 End-to-End Stochastic Gradient Descent (SGD)-Based trainable consensus loss

In this section, we propose an end-to-end trainable algorithm and define a way to compute \(\textbf {p}\) and \(\textbf {q}\). When the cluster assignment probabilities \(\textbf {p}\) can take any values in the set [0, 1], we refer to the process as soft clustering, and when \(\textbf {p}\) is restricted to the set \(\{0,1\}\), we refer to the process as hard clustering.

Without loss of generality, in this paper, we focus on soft clustering, which makes it easier to define a loss function using the probabilities and update the parameters using the gradients to enable end-to-end learning. We follow the soft clustering framework presented in SwAV (Caron et al., 2020), which is a centroid-based technique that aims to maintain consistency between the clusterings of the augmented views \(\mathcal {X}_b^{1}\) and \(\mathcal {X}_b^{2}\). We store a set of randomly initialized prototypes \(C_0=\{ \textbf {c}_0^1,\cdots ,\textbf {c}_0^K \} \in \mathbb {R}^{d\times K}\), where K is the number of clusters and d is the dimensionality of the prototypes. These prototypes are used to represent clusters and define a “consensus consistency" loss. We compute M variations of \(C_0\) as \(C_1,...,C_M\) exactly as we compute the M variations of \(\mathcal {Z}_0\).

2.2.1 Cluster assignment probability \(\textbf {p}\)

We use a two-layer multilayer perceptron (MLP) g to project the features \(\textbf {f}^1 = f_\theta (\mathcal {X}_b^1)\) and \(\textbf {f}^2 = f_\theta (\mathcal {X}_b^2)\) to a lower-dimensional space \(\mathcal {Z}_0\) (of size d). The outputs of this MLP (referred to as cluster embeddings) are denoted as \({Z}_0^1 = \{\textbf {z}_0^{1,1}, \ldots , \textbf {z}_0^{1,B} \}\) and \({Z}_0^2 = \{\textbf {z}_0^{2,1}, \ldots , \textbf {z}_0^{2,B} \}\) for view 1 and view 2, respectively. Note that \(h: \mathcal {X} \rightarrow \mathcal {Z}\) defined in 2.1 is equivalent to the composite function of \(f: \mathcal {X} \rightarrow \Phi \) and \(g: \Phi \rightarrow \mathcal {Z}\).

For a latent space \(\mathcal {Z}\), we compute the probability of assigning a cluster j to image i using the normalized vectors \(\bar{\textbf {z}}^{1,i} = \frac{\textbf {z}^{1,i}}{\Vert \textbf {z}^{1,i}\Vert }\), \(\bar{\textbf {z}}^{2,i} = \frac{\textbf {z}^{2,i}}{\Vert \textbf {z}^{2,i}\Vert }\) and \(\bar{\textbf {c}}_j = \frac{{\textbf{c}}^j}{\Vert {\textbf{c}}^j\Vert }\) as

$$ \begin{gathered} {\textbf{p}}_{{i,j}}^{1} ({\mathcal{Z}},C) = \frac{{\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{1} ,\overline{{\textbf{c}}} _{j} \rangle } \right)}}{{\sum\nolimits_{{j\prime }} {\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{1} ,\overline{{\textbf{c}}} _{{j\prime }} \rangle } \right)} }}, \hfill \\ {\textbf{p}}_{{i,j}}^{2} ({\mathcal{Z}},C) = \frac{{\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{2} ,\overline{{\textbf{c}}} _{j} \rangle } \right)}}{{\sum\nolimits_{{j\prime }} {\exp \left( {\frac{1}{\tau }\langle \overline{{\textbf{z}}} _{i}^{2} ,\overline{{\textbf{c}}} _{{j\prime }} \rangle } \right)} }}. \hfill \\ \end{gathered} $$
(2)

We concisely write \( \textbf {p}^1_{i}(\mathcal {Z}) = \{ \textbf {p}^1_{i,j}(\mathcal {Z},C) \}_{j = 1}^K \) and \( \textbf {p}^2_{i} = \{ \textbf {p}^2_{i,j}(\mathcal {Z},C) \}_{j = 1}^K \). Here, \(\tau \) is a temperature parameter, and we set its value to 0.1, similar to Caron et al. (2020). Note that we use \(\textbf {p}_{i}\) to denote the predicted cluster assignment probabilities for image i (when not referring to a particular view), and the shorthand notation \(\textbf {p}\) is used when i is clear from context.

2.2.2 Targets \(\textbf {q}\)

The idea of predicting the assignments \(\textbf {p}\) and then comparing them with the high-confidence estimates \(\textbf {q}\) (referred to as codes henceforth) of the predictions was proposed by **e et al. (2016a). While **e et al. (2016a) used pretrained features (from autoencoders) to compute the predicted assignments and the codes, the use of their approach in an end-to-end unsupervised manner might lead to degenerate solutions. Asano et al. (2019) avoided such degenerate solutions by enforcing an equipartition constraint (the prototypes equally partitioned the data) during code computation using the Sinkhorn-Knopp algorithm (Cuturi, 2013). Caron et al. (2020) followed a similar formulation but computed the codes for the two views separately in an online manner for each minibatch. The assignment codes are computed by solving the following optimization problem:

$$\begin{aligned} {\begin{matrix} Q^1 &{}= \mathop {\hbox {arg max}}\limits _{Q\in \mathcal {Q}} \text {Tr}(Q^TC_0^TZ_0^1) + \epsilon H(Q) \\ Q^2 &{}= \mathop {\hbox {arg max}}\limits _{Q\in \mathcal {Q}} \text {Tr}(Q^TC_0^TZ_0^2) + \epsilon H(Q), \end{matrix}} \end{aligned}$$
(3)

where \( Q = \{\textbf {q}_1, \ldots , \textbf {q}_B \} \in \mathbb {R}_{+}^{K\times B}\), \(\mathcal {Q}\) is the transportation polytope defined by

$$\begin{aligned} \mathcal {Q} = \{\textbf {Q}\in \mathbb {R}^{K\times B}_{+}~\text {s.t}~ \textbf {Q}\textbf {1}_B = \frac{1}{K}\textbf {1}_K, \textbf {Q}^T\textbf {1}_K = \frac{1}{B}\textbf {1}_B \} \end{aligned}$$

\(\textbf {1}_K\) is a vector of ones of dimension K and \( H(Q) = -\sum _{i,j}Q_{i,j}\log Q_{i,j} \). The above optimization is computed using a fast version of the Sinkhorn-Knopp algorithm (Cuturi, 2013), as described by Caron et al. (2020).

After computing the codes \(Q^1 \) and \(Q^2\), to maintain the consistency between the clustering results of the augmented views, the loss is computed using the probabilities \(\textbf {p}_{ij}\) and the assigned codes \(\textbf {q}_{ij}\) by comparing the probabilities of view 1 with the assigned codes of view 2 and vice versa, as in (1).

2.2.3 Defining variations of \(Z_0\) and \(C_0\)

To compute \(\{Z_1,...,Z_M \}\), we project the d-dimensional space \(Z_0\) to a D-dimensional space using a random projection matrix. We follow the same procedure to compute \(\{C_1,...,C_M \}\) from \(C_0\). At the beginning of the algorithm, we randomly initialize M such transformations and fix them throughout training. Suppose that by using a particular random transformation (a randomly generated matrix A), we obtain \(\tilde{\textbf {z}} = A\textbf {z},\; \tilde{\textbf {c}} = A\textbf {c}\). We then compute the softmax probabilities using the normalized vectors \(\tilde{\textbf {z}}/\Vert \tilde{\textbf {z}}\Vert \) and \(\tilde{\textbf {c}}/\Vert \tilde{\textbf {c}}\Vert \). This step is repeated with the M transformation results in the M predicted cluster assignment probabilities for each view. When the network is untrained, the embeddings \(\textbf {z}\) are random, and applying the random transformations, followed by computing the predicted cluster assignments, leads to a diverse set of soft cluster assignments. The parameter weights are trained by using the stochastic gradients of the loss for updates.

2.2.4 Backbone loss


To better capture exemplar consistency, based on previous evidence of successful clustering with the ID approach (Tao et al., 2021), we use ID (Wu et al., 2018) as one of the losses, as in Tao et al. (2021). The exemplar objective of ID is to classify each image as its own class.

Given n images and a neural network \(f_{\theta }\) for calculating features, we first normalize the features \(\bar{f}_{\theta }(x) = \frac{f_{\theta }(x)}{\Vert f_{\theta }(x) \Vert }\). Then, ID defines the probability of an example x being recognized as the i-th example as

$$\begin{aligned} P(i \vert f_{\theta }(x)) = \frac{\exp \left( \langle \bar{f}_{\theta }(x_i), \bar{f}_\theta (x) \rangle / \tau \right) }{\sum _{j=1}^n \exp \left( \langle \bar{f}_{\theta }(x_j), \bar{f}_\theta (x) \rangle / \tau \right) }. \end{aligned}$$
(4)

ID then uses the uniform distribution as a noise distribution \(P_n = \frac{1}{n}\) to compute the probability that data example x comes from a data distribution \(P_d\) as opposed to the noise distribution \(P_n\) as \(h(i, f_{\theta }(x)) := \frac{P(i\vert f_{\theta }(x))}{P(i\vert f_{\theta }(x)) + m P_n(i)}\). Assuming that the noise samples are m times more frequent than actual data samples, the ID loss is defined as

$$\begin{aligned} {\begin{matrix} L_{b}&= - E_{P_d} \left[ \log h(i, x)\right] - m E_{P_n} \left[ \log (1 - h(i, x')) \right] , \end{matrix}} \end{aligned}$$
(5)

where \(x'\) is the feature from a randomly drawn image other than image x in a given dataset. We exactly follow the framework developed in Wu et al. (2018) to implement the ID loss.

The final loss that we seek to minimize is the combination of the losses \(L_{\mathcal {Z}}\) ((1)) and \(L_b\) ((5)),

$$\begin{aligned} L_{\text {total}} = \alpha L_{\mathcal {Z}} + \beta L_b. \end{aligned}$$
(6)

where \(\alpha , \beta \) are nonnegative constants. Details of the algorithm are given Algorithm 1 and we also provide a PyTorch-style pseudocode in Algorithm 2 in the Appendix.

figure a

2.2.5 Computing the cluster metrics

In this section, we describe the approach used to compute the cluster assignments and the metrics chosen to evaluate their quality. Note that we assume that the number of true clusters (K) in the data is known.

There are two ways to compute the cluster assignments. The first way is to use the embeddings generated by the backbone; here, the embeddings are the outputs of the ID block \(f_{\theta }(x)\). The embeddings of all the images are computed, and then we perform k-means clustering.

The second method is to use the soft clustering block to compute the cluster assignments. It is sufficient to use the computed probability assignments \(\{\textbf {p}_i\}_{i=1}^N\) or the computed codes \(\{\textbf {q}_i\}_{i=1}^N\) and assign the cluster index as \(c_i = \arg \max _{k} \textbf {q}_{ik}\) for the \(i^{\text {th}}\) data point. Once the model is trained, in this second approach, cluster assignment can be performed online without requiring the computation of the embeddings of all the input data.

We evaluate the quality of the clusterings using metrics such as the cluster accuracy, normalized mutual information (NMI), and adjusted Rand index (ARI). To compute the clustering accuracy, we are required to solve an assignment problem (computed using a Hungarian match  (Kuhn, 1955, 1956)) between the true class labels and the cluster assignments. In our analysis, we observe that using k-means with the embeddings produced by the ID block achieves better clustering accuracy, and we use this method throughout the paper while evaluating our proposed algorithm.

2.3 Generating multiple clustering results

Fred and Jain (2005) discussed different ways to generate cluster ensembles; these methods are tabulated in Table 1. In our proposed algorithm, we focus on choosing of the appropriate data representation to generate cluster ensembles.

Table 1 Different ways to generate ensembles

By fixing a stable clustering algorithm, we can generate arbitrarily large ensembles by applying different transformations on the embeddings. Random projections were previously successfully used in consensus clustering (Fern and Brodley, 2003). By generating ensembles using random projections, we have control over the amount of diversity we can induce into the framework by varying the dimensionality of the random projection. In addition to random projections, we also use diagonal transformations (Hsu et al., 11 and 12, we show how the running mean of accuracy progresses during training for each of the experiments in Table 10.

Table 10 Data augmentation details
Fig. 11
figure 11

Effect of data augmentation on CIFAR-10

Fig. 12
figure 12

Effect of data augmentation on CIFAR100-20

5.3 Effect of image resolution

Image resolution is often considered a free parameter (Niu et al., 2020a), and however, its effect on clustering performance is not evaluated rigorously in most works. We try to quantify the effects of different resolutions to the greatest extent possible, given that some datasets are available only at specific resolutions. For STL-10, we use \(32\times 32\), \(64\times 64\) and \(96\times 96\) resolutions. For ImageNet-10 and ImageNet-Dog-15, we use \(96\times 96\), \(160\times 160\) and \(224\times 224\) resolutions. The results are given in Table 11.

Table 11 Effects of different resolutions for STL-10, ImageNet-10 and ImageNet-Dogs

The best performance for ImageNet-10 and ImageNet-Dogs is obtained at a resolution of 160, and for STL-10, the best performance is obtained at a resolution of 96. It is not clear why ImageNet-10 and ImageNet-Dogs do not yield the best performance at high resolutions, and further investigation is needed; we keep this as an open problem.

5.4 Distribution of accuracies across the set of hyperparameters

Table 12 Hyperparameters and the range values used for the experiments
Table 13 Hyperparameters for obtaining maximum performance
Fig. 13
figure 13

Components of the consensus loss; ablation of STL-10 and CIFAR100-20

The proposed consensus loss has two parameters. The first is the number of transformations used, and the second is the dimensionality of the projection space. To understand the proposed loss, we conduct a detailed experimental study on STL-10 and CIFAR100-20.Footnote 3 The hyperparameters used are given in Table 12.

Due to the sheer number of conducted experiments, we supply the summary statistics obtained on a random set. We report the empirical mean and standard deviation of the marginal distribution of the quantity under investigation. Let \(P_{\tau ,\eta ,d,l}\) be the joint distribution over the hyperparameters \(\tau \) (temperature parameter), l (learning rate), \(\eta \) (natural log of the number of transformations) and d (dimensionality of the projection space). We consider \(n_h\) as the number of distinct values used in the experiment for each hyperparameter \(h \in \{ \tau ,\eta ,d,l \}\) based on Table 12. We the denote accuracy from each experiment based on the hyperparameters used as \(a_{\tau ,\eta ,d,l}\) . Let \(P_{h_i \vert h_j}\) be the conditional marginal distribution of hyperparameter \(h_i\) given \(h_j\) and the conditional empirical mean of \(P_{h_i \vert h_j}\) be \(m(P_{h_i \vert h_j})\). In this case, the conditional empirical mean \(m(P_{h_i \vert h_j})\) when \(h_i = d\) and \(h_j=\tau \) can be calculated using \(m(P_{d \vert \tau }) = \frac{1}{n_{\eta } \times n_{l}} \sum _{\eta } \sum _{l} a_{\tau ,\eta ,d,l}\). The conditional empirical means and standard deviations of other hyperparameters are calculated in the same way. In Fig. 13, we show each conditional empirical mean with a blue dot, and each red line around a dot represents a standard deviation. For both STL-10 and CIFAR100-20, we see a trend regarding the number of projections. For STL-10, the smaller the number of random projections, the better the results are, and for CIFAR100-20, increasing the number of random projections is helpful for improving the clustering accuracy up to some point. Note that when the number of random projections is equal to zero, our setting is equivalent to the baseline ID model, and our approach always performs better than ID. This means that the optimal number of random projections is greater than or equal to one. There is no such clear trend in the number of dimensions of the random projections.

Fig. 14
figure 14

The dotted red lines show the accuracy of the baseline, i.e., ID Tao et al. (2021), on the corresponding dataset, and DE is the density estimate of the empirical distribution: (a) Empirical accuracy distribution for STL-10, (b) Empirical accuracy distribution for CIFAR100-20

The max-performance procedure provides some insights into the performance of the algorithms at hand, although it does not provide the whole picture because it does not consider the robustness of the performance differences. In Table 13, we give the hyperparameters that yield the max performance. In other words, finding a hyperparameter set that yields better performance than the baseline is the core idea behind the max-performance procedure. We ask the following question: given a hyperparameter grid, how likely is our method to achieve better accuracy than the baseline? In Fig. 14, we report the empirical accuracy distributions on STL-10 and CIFAR100-20 for all hyperparameters given in Table 12. The red dotted lines show the corresponding baseline accuracy for each dataset. For STL-10, only approximately \(12.5\%\) of the hyperparameter sets yield better results than the baseline. On the other hand, for CIFAR100-20, approximately \(90\%\) of the hyperparameter sets yield better results than the baseline. In other words, it does not require a significant amount of computational power to find a better model than the state-of-the art models for CIFAR100-20; however, the situation is the opposite for STL-10. The results given in Fig. 14 suggest that when comparing models, multiple metrics need to be considered, not only the max-performance procedure.

5.5 Effect of architecture choice

Fig. 15
figure 15

Empirical distribution of the performance difference between Residual Network (ResNet)-50 and ResNet-18. DE is the density estimate of the empirical distribution

In this work, we use ResNet-18 and ResNet-50 as network architectures. For both ResNet-18 and ResNet-50, we sweep over the same set of hyperparameter choices, i.e., the temperature, number of projections and projection dimensionality, and report the results for ImageNet-10 dataset with image resolution of 160\(\times 160\). Figure 15 shows the distribution of \(\Delta _{acc}\), which is defined as the accuracy difference between ResNet-50 and ResNet-18. Figure 15 indicates that ResNet-50 slightly outperforms ResNet-18, i.e., the mean difference is approximately \(0.5\%\).

5.6 Runtime comparison

Fig. 16
figure 16

Comparison of the runtimes per epoch for ID and ConCURL  on CIFAR-10  dataset. We vary the number of random transformations in the computation of the consensus loss and is mentioned in paranthesis

To study the runtime of the proposed method, we compare the time taken per epoch for the baseline ID algorithm and the proposed algorithm. Due to the additional loss computation, the time taken to run the proposed algorithm is higher which can be observed from Fig. 16. The additional time taken is mainly due to computing the consensus loss for the different number of transformations. The current implementation computes the forward pass for the different transformations sequentially thus increasing the runtime. However, a more time efficient implementation where the forward passes for all the different random transformations are computed in parallel can make the runtime more comparable to the baseline ID algorithm.

6 Conclusion

In this work, we introduce different notions of the consistency constraints that are enforced in different unsupervised/self-supervised learning algorithms. We propose a novel clustering algorithm that seamlessly incorporates all three consistency constraints (exemplar, population and consensus) and achieves state-of-the-art clustering results for four out of five popular and challenging computer vision datasets. Our work on consensus clustering is significantly different from earlier consensus clustering works that do not learn representations. Moreover, we initiate a discussion on the adequacy of the currently used methods for evaluating clustering algorithms. We significantly extend the evaluation procedure for clustering algorithms, thereby reflecting the challenges of applying clustering to real-world tasks. We provide evaluation results for ConCURL and other state-of-the-art clustering algorithms based on max-performance criteria, according to which ConCURL outperforms other algorithms on most datasets. However, its average performance according to out-of-distribution criteria highlights the need to use the proposed evaluation methods for deep clustering algorithms.