1 Introduction

Classification is a fundamental task in machine learning, involving assigning data objects to apriori classes based on the values they assume for a set of features. It has received significant interest and has been extensively utilized in fields such as healthcare and medical diagnosis (Sivasankari et al., 2022; Malakouti, 2023), as well as image and video recognition (Wang et al., 2023; Chen et al., 2021).

The accuracy of a classifier depends on the availability of relevant and informative features, as well as on the choice of the classification algorithm. However, in many real-world scenarios, the data is complex, and the available features may not provide enough information to achieve high accuracy.

The statistical literature on supervised classification is growing rapidly. The use of a single algorithm often produces unsatisfactory results, often due to the complex structure of the groups to be classified (e.g., unbalanced distributions, non-linear relationships from predictors, presence of anomalous values). In recent years, techniques integrating or merging multiple algorithms from both supervised and unsupervised learning have been developed to enhance the decision rules provided by the model (Soheily-Khah et al., 2018; Sarker, 2021).

The K-nearest neighbor (KNN) algorithm has gained recognition as a powerful tool in the field of machine learning, providing an effective and straightforward method for classification in various pattern recognition scenarios (Zhang, 2016; Taunk et al., 2019). The primary approach employed by KNN involves determining the class of query samples by measuring the distance to the objects in the training set. The label of the query sample is then set by majority voting on the membership of the k-nearest objects in the training set. Recently, several novel adaptations of the KNN algorithm have been developed (Zhang et al., 2017b; Luo et al., 2020; Rastin et al., 2021a).

The Euclidean distance is often used with KNN algorithm to measure the dissimilarity between training and testing data. This procedure involves computing the dissimilarity, determining the nearest k neighbors based on these dissimilarities, and subsequently classifying the test sample based on the dominant class among the k neighbors. Although the Euclidean distance is easy to understand, it assigns equal importance to all sample features by considering them equally when calculating the distance. The use of equal weighting might be a limitation, particularly in situations where distinct features have differing degrees of importance to the categorization objective. To tackle this problem, various research has suggested alternative distance metrics that provide a more sophisticated method for calculating distances in KNN-based classification. These metrics have the potential to improve the performance of KNN-based classification (Chomboon et al., 2015; Ruan et al., 2018) presented a new approach for addressing interval-valued data clustering, wherein they proposed an adaptive fuzzy c-means algorithm that incorporates the consideration of interval membership across various clusters within the partition. de Carvalho et al. (2022) presented a batch self-organizing map (SOM) algorithm for distributional-valued data based on a weighted Wasserstein distance, where the weights are computed through the optimization of the clustering loss function.

3 Methodology

This section presents two novel approaches to supervised classification using clustering. The first step involves develo** a new DC variant to cluster apriori classes. The approach incorporates a new objective function that employs intra-cluster compactness, which uses an adaptive distance metric to compute dispersion information between subgroups of a given class, as well as inter-cluster separation, which measures the distance between a subgroup of a specific class and the subgroups of other apriori classes. Finally, the results obtained from DC are used for classification with the new KNN algorithm. In this process, subgroup weights are utilized with the adaptive distance to measure the similarity between the testing sample and their neighboring subgroup centroid points. Subsequently, the k nearest neighbors are determined based on the calculated similarities. The allocation of the new instances to a class is based on the majority vote among the neighbors of the k subgroup centroids.

3.1 Dynamic Clustering Algorithm

Clustering is a widely utilized technique in various applications, including image processing (Chang et al., 2017), video processing (Alayrac et al., 2016), gene analysis (Dapas et al., 2020), healthcare (Liao et al., 2016), and community detection (Li et al., 2022), among others. It involves dividing a dataset into groups, or clusters, based on similarity criteria, where objects within the same cluster are more alike than those in different clusters.

In this paper, our focus is on DC, an unsupervised learning algorithm that aims to partition data into clusters while simultaneously finding cluster representatives consistent with the distance function used for allocating units. Typically, the representatives are obtained as the minimizers of the sum of distances. The classic K-means algorithm can be seen as a specific case of DC where the distance metric is the Euclidean distance, and the representatives (centroids) are calculated as cluster averages.

The original concept of dynamic clustering was introduced by Diday (1971) and involves a two-step process of constructing clusters and selecting the best prototype for each cluster based on an adequacy criterion (Diday & Simon, 1976). The advantages of this scheme are mainly its flexibility with respect to the nature of the analyzed data and the choice of the distance function and the focus on providing cluster representatives, named prototypes. For instance, DC methods exist for datasets described by interval variables (De Carvalho & Lechevallier, 2009) and histogram variables (Balzanella & Verde, 2020; de Carvalho et al., 2022).

Let \(X=\{X_{1},\ldots ,X_{i},\ldots ,X_{n}\}\) be a set of n objects, where each object \(X_{i}=\{x_{i1},\ldots ,x_{im}\}\) is described by a set of m features. The general DC looks for the partition \(G=\{C_1,\ldots , C_K\}\) in K clusters and the set \(Z=\{Z_1,\ldots , Z_K\}\) of K prototypes representing the clusters in G, such that the following \(\Delta\) fitting criterion between the set Z of prototypes and the partition G is minimized:

$$\begin{aligned} \Delta (G,Z)=\sum _{k=1}^{K}\sum _{X_i\in C_k}d(X_i,Z_k) \end{aligned}$$
(1)

The fitting criterion is defined as the sum of dissimilarities or distance measures between each object \(X_i\) belonging to a class \(C_k\in G\) and the class representation \(Z_k \in Z\).

In this context, the DC algorithm iteratively implements the following representation and allocation steps:

  1. 1.

    The representation step describes the K clusters \((C_1,\dots ,C_K)\) of the partition G through a vector \(Z=(Z_1,\dots ,Z_K)\) of prototypes. Kee** the partition \(G=\hat{G}\) fixed for the current iteration of the algorithm, Z is obtained from the minimization of \(\Delta (\hat{G},Z)\), which is equivalent to finding the \(Z_k\;\;(k=1,\dots ,K)\) that minimize \(\sum _{i\in C_k}d(X_i,Z_k)\).

  2. 2.

    The allocation step assigns each element \(X_i\) to a cluster \(C_k\) according to the proximity to the prototype \(Z_k\in Z\). Kee** \(Z=\hat{Z}\) fixed for the current iteration of the algorithm, it finds the partition of G that minimizes \(\Delta (G,\hat{Z})\), by finding the cluster \(C_k=\{X_i\in X\mid d(X_i,Z_k)\le d(X_i,Z_l),\forall l=1,\dots ,K; l \ne k \}\).

3.1.1 DC as a Generalization of K-Means Algorithm

The K-means clustering methodology is broadly utilized as a partitioning strategy. The proposed dynamic clustering method is a generalization of the K-means algorithm, which puts forth a compelling notion that cluster centers do not essentially have to be the centroids of clusters in \(R^m\). Rather, it suggests substituting them with centers that can take various forms, based on the problem that needs to be addressed.

The K-means algorithm starts by selecting K initial cluster centers and then assigns each object to the closest cluster through the optimization of an objective function. As mentioned previously, the classical K-means algorithm only considers the intracluster compactness and the distances between the cluster centroids and individual data points. The membership matrix U, a \(n\times K\) binary matrix, indicates which objects are assigned to which clusters, and \(Z=\{Z_{1},\ldots , Z_{k},\ldots , Z_{K}\}\) represents the centroids of the K clusters, with elements \(Z_{k}=\{z_{k1},\ldots , z_{kj},\ldots , z_{km}\}\) for each feature \(j=1, \ldots , m\).

The objective function of the classic K-means without considering the inter-cluster separation is the following:

$$\begin{aligned} \begin{aligned} \Delta (U,Z) =&\sum _{k=1}^{K}\sum _{i=1}^{n}u_{ip}\sum _{j=1}^{m}(x_{ij}-z_{kj})^{2}, \end{aligned} \end{aligned}$$
(2)

such that \(\displaystyle \sum _{k=1}^{K} u_{ik}=1\), (with \(u_{ik}\in \{0,1\}\)), and \(X_i=\{ x_{i1},\ldots ,x_{ij},\ldots , x_{im} \}\) is an object of X described by m features.

DC algorithm optimizes the objective function by alternating the representation and allocation steps:

  1. 1.

    Representation step (the matrix of membership \(\hat{U}\) is fixed) The solution for the optimization problem \(\Delta (\hat{U},Z)\) is provided by the minimizer Z

    $$\begin{aligned} z_{kj}=\frac{\sum _{i=1}^{n}u_{ik}x_{ij}}{\sum _{i=1}^{n}u_{ik}}, \end{aligned}$$
    (3)

    where \(1\le k\le K\)

  2. 2.

    Allocation step (the vector of centroids \(\hat{Z}\) is fixed) According to Chan et al. (2004), the minimizer U of the optimization problem \(\Delta (U,\hat{Z})\) is given by

    $$\begin{aligned} u_{ik} = \left\{ \begin{array}{ll} 1 &{} \text {if}\, {\displaystyle \sum _{j=1}^{m}(x_{ij}-z_{kj})^{2}\le \sum _{j=1}^{m}(x_{ij}-z_{lj})^{2}}, \\ 0 &{} \text {otherwise.} \end{array}\right. \end{aligned}$$
    (4)

The partitioning criterion Eq. 2 decreases at each iteration, converging to a stationary value.

3.1.2 The Need for Adaptive Distances in DC

The central concept of dynamic clustering with adaptive distances is to assign a specific distance measure, denoted as \(d_k\), to each cluster \(C_k\) and to minimize the sum of distances \(d_k(X_i, Z_k)\) between objects \(X_i\) belonging to cluster \(C_k\) and the centroid \(Z_k\). Importantly, the distances employed in the DC algorithm are not fixed in advance but rather are tailored to each cluster.

In this clustering algorithm, a weighting step is introduced. It assigns a weight to each variable for each cluster, reflecting the relevance of the variable in a cluster. The use of adaptive distance can also be viewed as a means of automatically scaling variables, as scaling can greatly impact the dissimilarity values and clustering outcomes in clustering analysis.

The DC criterion, which incorporates adaptive distances, is expressed as follows:

$$\begin{aligned} \begin{aligned} \Delta (U,W,Z) =&\sum _{k=1}^{K}\sum _{X_i\in C_k}u_{ik}d_{k}(X_i,Z_k), \end{aligned} \end{aligned}$$
(5)

such that \(u_{ik}\in \{0,1\}\), \(\displaystyle \sum _{k=1}^{K} u_{ik}=1\)

In this context, distance \(d_{k}\) is a weighted sum of distances \(d_{w_{kj}}\)

$$\begin{aligned} \begin{aligned} d_{k}(X_i,Z_k) =&\sum _{j=1}^{m}d_{w_{kj}}(x_{ij},z_{kj}) = \sum _{j=1}^{m} w_{kj}d(x_{ij},z_{kj}) \end{aligned} \end{aligned}$$
(6)

The adaptivity of the distance \(d_{w_{kj}}\) is expressed by the vector of weights \(W_k\).

When using adaptive distances, the representation step is divided in two stages so that the global optimization scheme is

  1. 1.

    Representation step

    1. 1.

      Stage 1: fix the matrix \(\hat{U}\) of membership and the vector of weights \(\hat{W}\) Find the solution \(Z_k= \{z_{k1}, \ldots , z_{km}\}\) of the optimization problem \(\Delta (\hat{U},\hat{W},Z)\).

    2. 2.

      Stage 2: fix the matrix \(\hat{U}\) of membership and the vector of centroids \(\hat{Z}\) Find the vector of weights \(W_k=\{w_{k1}, \ldots , w_{km}\}\) that minimizes the criterion \(\Delta (\hat{U},W,\hat{Z})\).

  2. 2.

    Allocation step Fix the set of vectors of weights \(\hat{W}\) and the set of vectors of centroids \(\hat{Z}\). Find the membership matrix U that minimizes the criterion \(\Delta (U,\hat{W},\hat{Z})\)

The paper employs an adaptive distance metric, specifically a weighted Euclidean distance, to calculate the distance between subgroups within a given cluster. Explicit formulas for the optimum cluster centroids, as well as for the weights of the adaptive distances, are found based on a new objective function criterion. By integrating the procedures of data partitioning and centroids selection with adaptive distances, DC algorithm provides a comprehensive and flexible approach to clustering analysis for apriori classes.

3.2 Dynamic Clustering Algorithm to Partition Apriori Groups

In this section, a new objective function is proposed to discover new information on the original data by combining both intra-cluster compactness of the subgroups of the same apriori group and inter-cluster separation between one subgroup and the subgroups of other apriori classes, as illustrated in Fig. 1. Therefore, it may be ineffectual to evaluate the weights of the variables of a subgroup using only the variation within the groups of a data set. Under these conditions, inter-cluster separation can play a significant role in differentiating the significance of various patterns and taking into account the heterogeneity among the subgroups of each original group.

We apply inter-cluster separation by introducing the global subgroups centroids of a data set. In contrast to the conventional DC, our proposed DC algorithm maximizes the distances between the subgroup’s centroid of an apriori group and the global subgroups centroid of the other apriori groups partition, while minimizing the distances between objects and their subgroups centroid.

Fig. 1
figure 1

Scatter plot illustrating subgroups within two apriori groups

Let N be the total number of apriori classes, and \(U = \{U_{1}, \ldots ,U_{g}, \ldots , U_{N}\}\) the set of N matrices. Let \(n_g\) be the number of elements of class g and \(c_g\) be the number of subgroups in the class g. Each \(U_g\) is an \(n_g \times c_{g}\) indicator matrix containing the membership of each element i of the apriori class g to the subgroup p; such that where \(u_{gip}=1\) denotes that the i-th object belonging to group g is assigned to subgroup p; otherwise, \(u_{gip} = 0\), indicating that the object is not assigned to subgroup p. Let \(Z=\{Z_1,\ldots ,Z_{g},\ldots ,Z_{N}\}\) be a set of vectors representing the centroids of each original group. For group g, let \(Z_{g} = \{Z_{g1}, \dots , Z_{gc_g}\}\) be a set of \(c_g\) vectors that represent the subgroups’ centroids and let \(W_g = \{W_{g1}, W_{g2}, \dots , W_{gc_g}\}\) be a set of weight vectors associated with the subgroups, where \(w_{gpj}\) represents the weight of the j-th variable related to the p-th subgroup for class g. Let \(\beta\) represent a parameter used for adjusting the weights.

With the aim of achieving both intra-cluster compactness and inter-cluster separation, the optimization process is performed using a DC algorithm in which the objective function is modified to emphasize the separation between clusters belonging to different apriori classes:

$$\begin{aligned} \begin{aligned} P(U,W,Z) =&\sum _{g=1}^{N}\sum _{p=1}^{c_{g}}\frac{\sum _{i=1}^{n_{g}}u_{gip}d^{2}_{g}(X_{i},Z_{gp})}{n_{g}d^{2}(Z_{gp},Z_{gG})}\\ =&\sum _{j=1}^{m}(\sum _{g=1}^{N}\sum _{p=1}^{c_{g}}\frac{w^{\beta }_{gpj}\sum _{i=1}^{n_{g}}u_{gip}(x_{ij}-z_{gpj})^{2}}{n_{g}(z_{gpj}-z_{gGj})^{2}}), \end{aligned} \end{aligned}$$
(7)

such that \(u_{gip}\in \{0,1\}\), \(\displaystyle \sum _{p=1}^{c_{k}} u_{gip}=1\), and \(\displaystyle \sum _{j=1}^{m}w_{gpj}=1\).

In the context of our study, the distance metric \(d_{g}\) is defined as a weighted sum of distances \(d_{w_{gpj}}\), where \(d_{w_{gpj}}\) represents the distance metric for the p-th subgroup of the g-th apriori class. The vector of weights \(W_{gp}\) demonstrates the adaptivity of the distance metric \(d_{w_{gpj}}\):

$$\begin{aligned} \begin{aligned} d_{g}(X_i,Z_{gp}) = \sum _{j=1}^{m}d_{w_{gpj}}(x_{ij},z_{gpj}) = \sum _{j=1}^{m} w_{gpj}(x_{ij},z_{gpj})^2. \end{aligned} \end{aligned}$$
(8)

Let us assume that the present group is the \(g^{th}\) group. \(z_{gGj}\) represents the \(j^{th}\) feature of the global subgroups centroid of all other apriori groups, excluding the current group g.

We calculate \(z_{gGj}\) as

$$\begin{aligned} z_{gGj}=\frac{\displaystyle \sum _{h\in \{1,\ldots ,N\}\setminus \{g\}}c_{h}\sum _{q=1}^{c_{h}}z_{hqj}}{c_{1}+\cdots +c_{g-1}+c_{g+1}+\cdots +c_{N}}. \end{aligned}$$
(9)

To initiate the solution process of the objective function, it is necessary to initialize the parameters \(\hat{U}\), \(\hat{W}\), and \(\hat{Z}\) of all groups (for \(g=1, \ldots , N)\). Subsequently, the partition of the group g is evaluated, thus reducing the minimization issue as

$$\begin{aligned} \begin{aligned} P(U,W,Z) =&\sum _{p=1}^{c_{g}}\sum _{i=1}^{n_{g}}u_{gip}\sum _{j=1}^{m}w^{\beta }_{gpj}\frac{(x_{ij}-z_{gpj})^{2}}{n_{g}(z_{gpj}-z_{gGj})^{2}}, \end{aligned} \end{aligned}$$
(10)

such that \(u_{ip}\in \{0,1\}\), \(\displaystyle \sum _{p=1}^{c_{g}} u_{gip}=1\), and \(\displaystyle \sum _{j=1}^{m}w_{gpj}=1\), \(1\le p\le n_g\).

To minimize Eq. 10, it is necessary to solve the problems P1, P2, and P3 iteratively.

  1. 1.

    Specifically, the representation step requires solving two distinct problems P1 and P2: Problem P1: fix \(U=\hat{U}\), \(W=\hat{W}\) and solve the reduced problem \(P(\hat{U}, Z, \hat{W})\) Problem P2: fix \(U=\hat{U}\), \(Z=\hat{Z}\) and solve the reduced problem \(P(\hat{U}, \hat{Z}, W)\)

  2. 2.

    To address the allocation problem, it is necessary to solve the problem denoted as P3: Problem P3: fix \(Z=\hat{Z}\), \(W=\hat{W}\) and solve the reduced problem \(P(U, \hat{Z}, \hat{W})\)

To solve the problem P1, we calculate the gradient of P with respect to \(z_{gpj}\) as

$$\begin{aligned} \frac{\partial P(\hat{U},\hat{W},Z)}{\partial z_{gpj}}=-2w_{gpj}^{\beta }\sum _{i=1}^{n_{g}}u_{gip}\frac{(x_{ij}-z_{gpj})(z_{gpj}-z_{gGj})^{2}+(z_{gpj}-z_{gGj})(x_{ij}-z_{gpj})^{2}}{n_{k}(z_{gpj}-z_{gGj})^{4}}; \end{aligned}$$
(11)

by setting Eq. 11 to zero, we have:

$$\begin{aligned} z_{gpj}=\frac{\sum _{i=1}^{n_{k}}u_{gip}x_{ij}(x_{ij}-z_{gGj})}{\sum _{i=1}^{n_{g}}u_{gip}(x_{ij}-z_{gGj})}. \end{aligned}$$
(12)

The initial section of the supplementary material contains the proof and the necessary and sufficient conditions required for the realization of this finding.

It is worth noticing that \(z_{gpj}\), the representative (e.g., centroid) of \(C_{gp}\), can be interpreted as a weighted average of the elements of the \(p^{th}\) subgroup, with weights being the difference between \(x_{ij}\) and the global centroid \(z_{gG_j}\) computed as in Eq. 9 on all other apriori groups, excluding the current group.

The higher the difference, the more the subgroup element contributes to the determination of the subgroup’s centroid. This result is due to the optimization of the discriminant component of the criterion which emphasizes the separation between classes.

The internality condition of the centroid \(z_{gpj}\) of the subgroup \(C_{gp}\) (for each variable j) is guaranteed under the conditions demonstrated in the Appendix, whereas it can become external to the cluster interval of values (for each j) the closer \(z_{gGj}\) is to the mean of the elements of the cluster \(C_{gp}\).

The problem P2 will be solved by setting up a Lagrangian equation to \(P(\hat{U}, \hat{Z}, W)\) with multiplier \(\lambda\). Let \(L(W, \lambda )\) be the Lagrangian

$$\begin{aligned} L(W, \lambda )=\sum _{p=1}^{c_{g}}\sum _{j=1}^{m}w^{\beta }_{gpj}D_{gpj}-\lambda (\sum _{j=1}^{m}w_{gpj}-1), \end{aligned}$$
(13)

where \(D_{gpj}=\sum _{i=1}^{n_{k}}u_{gip}\frac{(x_{ij}-z_{gpj})^{2}}{n_{g}(z_{gpj}-z_{gGj})^{2}}\). Setting the gradient of Eq. 13 with respect to \(w_{gpj}\) and \(\lambda\) to zero, we obtain

$$\begin{aligned} \frac{\partial L(W, \lambda )}{\partial w_{gpj}}=\beta w_{gpj}^{\beta -1}D_{gpj}-\lambda =0; \end{aligned}$$
(14)

from Eq. 14, we obtain

$$\begin{aligned} w_{gpj}=\left( \frac{\lambda }{\beta D_{gpj}}\right) ^{\frac{1}{\beta -1}}. \end{aligned}$$
(15)

The gradient with respect to \(\lambda\)

$$\begin{aligned} \frac{\partial L(W, \lambda )}{\partial \lambda }=-(\sum _{j=1}^{m}w_{gpj}-1)=0; \end{aligned}$$
(16)

substituting Eq. 15 into Eq. 16, we obtain

$$\begin{aligned} \lambda ^{\frac{1}{\beta -1}}=\displaystyle \frac{\beta ^{\frac{1}{\beta -1}}}{\displaystyle \sum _{j=1}^{m}D_{gpj}^{-\frac{1}{\beta -1}}}; \end{aligned}$$
(17)

substituting Eq. 17 into Eq. 15, we have

$$\begin{aligned} w_{gpj}=\frac{1}{\displaystyle (D_{gpj})^{\frac{1}{\beta -1}}\sum _{l=1}^{m}D_{gpl}^{-\frac{1}{\beta -1}}}, \end{aligned}$$
(18)

The minimizer \(W_k\) of the optimization problem P2 is given by

$$\begin{aligned} w_{gpj} = {\left\{ \begin{array}{ll} 0 &{} \text {if}\, {\displaystyle (z_{gpj}-z_{gGj})^2=0}, \\ 0 &{} \text {if}\, {\displaystyle D_{gpj}\ne 0, \;\;\text {but}\;\;D_{gpl}=0,\;\;\text {for some}\;l},\\ \displaystyle \frac{1}{(D_{gpj})^{\frac{1}{\beta -1}}\displaystyle \sum _{l=1}^{m}D_{gpl}^{-\frac{1}{\beta -1}}} &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(19)

The problem P3 is solved by

$$\begin{aligned} u_{gip} = \left\{ \begin{array}{ll} 1 &{} \text {if}\, {\displaystyle \sum _{j=1}^{m}w^{\beta }_{gpj}{\frac{(x_{ij}-z_{gpj})^{2}}{n_{g}(z_{gpj}-z_{gGj})^{2}}}\le \sum _{j=1}^{m}w^{\beta }_{gpj}\frac{(x_{ij}-z_{grj})^{2}}{n_{g}(z_{grj}-z_{gGj})^{2}}}, \\ 0 &{} \text {otherwise,} \end{array}\right. \end{aligned}$$
(20)

where \(1\le r \le g\), \(r\ne p\).

The same process is provided to the partition of the other groups \(q \ne g\) for \(q=1, \ldots , N\) of the primary datasets, by optimally computing \(u_{lip}\), \(z_{lpj}\), and \(w_{lpj}\).

3.3 New DC Variant Algorithm

In this section, we provide a comprehensive explanation of the algorithm used in the novel DC variant clustering method. The aim of this algorithm is to create new subgroups from the available labeled data by detecting hidden patterns.

The algorithm is designed to associate a unique distance metric with each cluster, which is used to compare clusters and their representatives. This distance measure is not fixed and varies from subgroup to subgroup, changing with each iteration until convergence. The adaptive nature of this distance measure offers the advantage of assigning weights to the variables that are more representative or informative of a particular cluster, resulting in a more accurate clustering algorithm.

This adaptive approach aims to identify a partition of each original class, denoted as \(G_{1},\ldots ,G_{N}\), respectively into \(n_{c_{1}}, \ldots ,n_{c_{N}}\) subgroups. Here, \(G_{g}=\{C_{g1},\ldots ,C_{gn_{c_{g}}}\}\) specifies the partitions for group g, and the corresponding centroids, denoted as \(Z_{g}=\{Z_{g1},\ldots , Z_{gn_{c_{g}}}\}\), for each group are computed using the formula for centroids Eq. 12. Additionally, for each subgroup, a set of weights is assigned from the set \(W_{g}=\{W_{g1},\ldots ,W_{gn_{c_{g}}}\}\).

The algorithm we propose looks for a local minimum of the objective function in Eq. 7.

It requires, as an input, the training dataset, with N apriori classes, and the number of subgroups for each apriori class. It starts from an initial random partitioning of the apriori classes into subgroups, then initializes the weights of variables for each subgroup to 1/m (where m is the number of variables). An initial set of centroids Z is computed according to Eq. 12, based on the initial random partition and on the weights in W.

The iterative part of the algorithm alternates, at each iteration t, the representation and allocation step introduced in Sect. 3.2, in order to provide the partition of the N apriori classes \(G_{1},\ldots ,G_{N}\), into \(n_{c_{1}}, \ldots ,n_{c_{N}}\) subgroups; the set of centroids Z; the weights W.

At each iteration t, a check of the convergence of the algorithm is performed by evaluating the criterion \(P_t\):

$$\begin{aligned} {\begin{matrix} P_{t} =&\sum _{g=1}^{N}\sum _{p=1}^{c_{g}}\frac{\sum _{i=1}^{n_{g}}u_{gip}d^{2}_{g}(X_{i},Z_{gp})}{n_{g}d^{2}(Z_{gp},Z_{gG})}, \end{matrix}} \end{aligned}$$
(21)

where \(X_i\), \(Z_{gp}\), and \(u_{gip}\) are defined as before.

The algorithm will run when \(\Vert P_{t+1}-P_{t}\Vert >0\), which means that a further iteration improves the criterion. In other words, the algorithm continues as long as there is a decrease in the intra-cluster distances between subgroups of an original group and/or an increment in the inter-cluster distances.

Algorithm 1 displays the pseudocode of the suggested DC clustering algorithm.

Algorithm 1
figure a

Weighted dynamic clustering algorithm with adaptive euclidean distance.

3.4 KNN Classifier Based on Adaptive Distances and Novel Patterns

The K-nearest neighbor is a supervised learning technique that uses training data and a predetermined k value to find the k nearest data based on the idea of using distance computation to discover the nearby points of the query from the training set and assign a class label to the query through the majority voting rule. However, its efficacy is comparable to the most complex classifiers in the literature. This classifier relies heavily on measuring the distance or similarity between the tested examples and the training examples. This raises an essential question about which distance or similarity measures should be used for the KNN classifier out of the numerous options available. Therefore, we propose an adaptive distance parameterized by weight vectors. The weights are estimated during the first clustering step on apriori classes so that each subgroup is associated with its weight vector. The main idea of the KNN classifier with adaptive distances is that there is a distance to compare the test objects and their nearest points from the training dataset, which changes with each training point. However, the objects belonging to the same subgroup have the same vector weights.

Let \(T = {(X_{i}, y'_{ip})}_{i=1}^{n}\) represent a training set consisting of n training instances, with each instance belonging to one of N classes. Each training instance \(X_{i}\) is an element of a m-dimensional space \(R^{m}\), and its corresponding class label \(y'_{ip}\) is obtained from the initial clustering step, with p represents the subgroup of \(X_i\) in the original apriori class \(y_i\). When a new query \(S_{h}\) is given, we first compute the adaptive distances between \(S_{h}\) and each training instance in T. The adaptive distance for a given data point \(X_{i}\) and query \(S_{h}\) is defined as follows:

$$\begin{aligned} \begin{aligned} d_{y'_{ip}}(X_{i},S_{h})= d_{W_{y_{i}p}}(X_{i},S_{h}) = \sum _{j=1}^{m}w_{y_{i}pj}(x_{ij}-s_{hj})^{2}. \end{aligned} \end{aligned}$$
(22)

Here, \(W_{y_{i}p}\) is a vector of weights corresponding to the p-th subgroup of the apriori class \(y_{i}\).

The n distances are then arranged ascendingly. \(N_{K}(S_{h})=\{(X_{j},y'_{jp})\}_{j=1}^{K}\) denotes the K-nearest neighbors of \(S_{h}\), which are the K training instances with the top K smallest distances. Ultimately, the majority voting rule is used to assign the query \(S_{h}\) to subgroup \(C_{gp}\):

$$\begin{aligned} C_{gp} = \underset{C_{ip'}}{{{\,\textrm{argmax}\,}}} \sum _{(x,y)\in N_{k}(S_{h})}\mathbbm {1}_{C_{ip'}}(y),\;\;i=1,\dots ,N\;\;and\;\;p'=1,\dots ,c_i \end{aligned}$$
(23)

where \(\mathbbm {1}_{C}(.)\) is the indicator function:

$$\begin{aligned} \mathbbm {1}_{C}(y)={\left\{ \begin{array}{ll} 1 &{} \text {if } y\in C,\\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(24)

3.5 The DC-KNN Combined Algorithm

DC-KNN (dynamic clustering and K-nearest neighbors) is an algorithm that combines the efficiency of the DC algorithm with the classification by KNN. The basic idea behind DC-KNN is to use DC with adaptive distances to re-cluster the classes of the training dataset into different subgroups and then use KNN to classify the test set based on the new labels obtained from the clustering step.

The algorithm starts by using DC to cluster each original group into a specific number of clusters; the optimal number of subgroups can be determined by the Silhouette or Elbow method. The resulting clusters will be used as preprocessing for the KNN algorithm, allowing it to work on more homogeneous subsets of data. This combination improves the accuracy and efficiency of the KNN algorithm, as it enables the algorithm to learn all the patterns of the dataset necessary for its learning process.

After the clustering step, KNN is used to classify new data points based on the newly discovered patterns. The KNN algorithm finds the K nearest neighbors of a given data point and determines the class of the majority of those neighbors. In this way, the DC-KNN algorithm is able to effectively combine the strengths of both DC and KNN, making it a powerful tool for data classification.

It is worth noting that the DC-KNN algorithm can be sensitive to the initial partitions. Hence, choosing the appropriate number of clusters and setting the parameters for the DC algorithm is essential.

The execution of the DC-KNN classification algorithm follows the steps outlined in Algorithm 2.

Algorithm 2
figure b

DC-KNN combining algorithm.

3.6 The DC-KNN Using the Centroids as the Nearest Neighbors

A new variant of the KNN algorithm is introduced in this study for more accurate predictions of new data points based on the new subgroups. The proposed classification approach focuses on classifying instances to their nearest neighbor class by computing the distance between new instances and subgroups centroids. The allocation is then determined using the label of the apriori class to which the subgroups belong.

The DC-KNN classifier algorithm utilizes the centroids and weights of DC to determine the nearest neighbor of a new element. Here, K denotes the number of neighborhoods of centroids that are closest to the new query. The optimum value of K is chosen from the range of 1 to (\(\displaystyle \min _{1\le i\le N}(c_i)+1\)). The algorithm follows the sequential steps outlined in Algorithm 3.

Algorithm 3
figure c

DC-KNN combining algorithm using centroids as NN.

4 Experiments

This section conducts extensive experiments on different real and synthetic datasets to validate the classification performance of the proposed DC-KNN classifiers. The DC-KNN approach is compared to the KNN and Kmeans-KNN methods, in terms of classification accuracy.

4.1 Experimental Results on Real Datasets

To thoroughly assess the performance and robustness of the proposed DC-KNN algorithms, experiments are conducted comparing them with classical KNN and Kmeans-KNN. The latter uses the K-means clustering algorithm as a first step and then applies KNN using the results of the clustering. The comprehensive experiments are conducted on real datasets sourced from the UCI Machine Learning Repository Bache and Lichman (2013), KEEL attribute noise datasets Alcala-Fdez et al. (2011), and UCR Time Series Classification Repository Dau et al. (2018). The classification accuracy was used to measure the performance of all the approaches in each experiment. Note that DC-KNN1 represents Algorithm 2 that employs data points as nearest neighbors. It is important to note that in the KNN algorithm, the selection of the k nearest neighbors involves an aggregation from all available classes of data points. Similarly, in the KNN algorithm with classical K-means and DC-KNN1 algorithms, the nearest neighbors are selected from the data points of all the subgroups of the apriori classes. However, in the DC-KNN2 algorithm, the nearest neighbors are the centroids of the subgroups obtained from DC.

The objective of the experiments is to demonstrate the powerful classification capabilities of the proposed techniques on different kinds of datasets that represent real datasets, noisy numerical datasets obtained from the KEEL machine learning repository, and time series datasets from the UCR database. The utilized datasets mentioned here are clearly outlined in Table 1. Their information shows variations in the quantities of total samples, features, classes, and test samples.

We utilize the Wisconsin Breast Cancer Wisconsin (Diagnostic) dataset, abbreviated as “Breast,” “ILPD,” and the noise “Yeast” datasets from the UCI database. Additionally, we perform tests on the“’Computers,” “ScreenType,” and “StarLightCurves” datasets from the UCR repository for our time series analysis. Following the approach employed in Maturo and Verde (2022), we utilize functional data analysis to describe the time series datasets and extract the coefficients of the b-spline decomposition as features. The six noise datasets from the KEEL repository are “Sonar,” “Iono,” “Heart,” “Pima,” “Spambase,” and “Iris.” The experiments employ the abbreviations Yeast-n, Sonar-n, Iono-n, Heart-n, Pima-n, Spambase-n, and Iris-n to distinguish the noisy data. The six noise datasets consist of samples that are exposed to a noise intensity of 10%. To clarify, around 10% of the samples in each dataset are chosen at random, and around one-third of the total samples from each dataset are chosen as test samples, while the rest of the samples are designated as training samples. The values of a particular attribute for these samples are then assigned using random values that are within the minimum and maximum range of the attribute’s domain. This assignment follows a uniform distribution. Each noise dataset within the KEEL repository has been divided into five unique subsets for training and testing purposes. The quantity of testing samples for each set is presented in Table 1. The ultimate classification evaluation of each competing method is determined by calculating the average of the classification results from five divisions on each noise dataset. In addition, the majority of the datasets have a small number of samples. However, these datasets can effectively be utilized to validate the classification performance in scenarios with a small sample size.

The classical KNN algorithm is recognized for its effectiveness in scenarios with a clear separation between classes. However, accurate classification of datasets containing noise poses a greater challenge. To address this, we evaluated the proposed DC-KNN methods on selected noisy datasets from the KEEL repository. The evaluation is based on the classification accuracy as shown in Table 2.

Table 1 Experimental datasets from UCR, UCI, and KEEL repositories used in the study
Table 2 Classification accuracy (%) of different methods on various datasets: Values of K with the number of subgroups are presented in parenthesis

In the experiments, we assess the classification performance of the proposed DC-KNN methods by varying the neighborhood size (K) on each dataset. The parameters \((n_{c_1}, \ldots , n_{c_N})\) of the DC step, which represents the number of subgroups determined by the Silhouette method, are also taken into consideration. The values of K range from 1 to 20, and for DC-KNN2, range from 1 to 7, incrementing by 1, for all the datasets. The classification results of the suggested techniques, with different values of K, are shown in Fig. 2.

Fig. 2
figure 2

The classification accuracy of each approach is evaluated on different datasets, with varying values of K

The DC-KNN2 Algorithm 3, utilizing centroids of DC for detecting the nearest neighbors, outperforms other algorithms with lower numbers of neighbors and achieves the highest level of accuracy across the majority of datasets in comparison to other algorithms.

Moreover, on most of the real data sets, the classification accuracy remains consistently constant when the value of K increases. This is because the DC-KNN2 algorithm requires that the number of neighbors be smaller than \((\displaystyle \min _{1\le i\le N}(c_i)+1)\). On the other hand, the DC-KNN1 Algorithm 2 consistently achieves satisfactory classification results when varying the value of K in comparison to the classical methods, particularly at higher values of K. It implies that the suggested DC-KNN1 and DC-KNN2 algorithms exhibit more robustness when the values of K are changed, while still achieving accurate classification. The explanation for this advantage may be attributed to the utilization of DC and a novel objective function in the initial phase of both approaches, which allows for the good performance of the classifier. The classification results depicted in Fig. 2 clearly show the good classification performance of the two proposed approaches. In most instances, the proposed method outperforms the other comparison methods.

4.2 Experimental Simulation Results

In order to show the effectiveness of the proposed DC-KNN classifiers, we generate and adapt different models in this subsection. In this experiment, we performed simulations using various data generating processes (DGPs) with distinct characteristics. The specifications of the DGPs are detailed in the second section of the Supplementary Material and illustrated in Fig. 3. Specifically, we generated datasets based on different DGPs and examined scenarios where the number of clusters remained fixed at the true value, as well as scenarios where the number of clusters was estimated.

Fig. 3
figure 3

Two-dimensional representation using principal component analysis (PCA) simulation

In order to determine the number of subgroups inside each apriori class, we can employ either the Silhouette or Elbow approach. Nevertheless, the traditional approaches do not ensure the enhancement of the approach’s performance. Consequently, our forthcoming work will concentrate on ameliorating this aspect.

The results of the simulations performed on the data generating processes (DGPs) are displayed in Table 3. The table presents a detailed analysis of the effectiveness of the new techniques developed utilizing DC-KNN1 and DC-KNN2, which utilize the clustering results, as stated in Algorithms 2 and 3. The comparison is made over the classical KNN and Kmeans-KNN methods. As expected, utilizing the DC-KNN algorithms leads to increased accuracy values. The unsatisfactory results of classical KNN can be due to the complexity and the overlap** of data. However, a more effective approach is to first use clustering to discover hidden patterns within the classes before doing classification.

Table 3 Classification accuracy (%) of different methods on simulated datasets: Values of K with the number of subgroups are presented in parenthesis

A detailed analysis of the classification effectiveness of DC-KNN techniques with varying K values can be found in Section 3 of the Supplementary Material.

The effectiveness of the proposed DC-KNN approaches in classifying diverse dataset types, including real data sets, time series data sets, noisy data sets, and simulated datasets, has been thoroughly shown through extensive experiments. Based on the results of these classification studies, it is essential to highlight important observations that emphasize the significant contributions of our research:

  1. 1.

    The DC-KNN techniques demonstrate robustness to changes in the neighborhood size K, as compared to its competitors. The experimental results consistently show that the proposed DC-KNN1 and DC-KNN2 algorithms consistently outperform other approaches and achieve good classification performance. In particular, the DC-KNN2 algorithm demonstrates strong and consistent performance when the value of K is smaller than \((\displaystyle \min _{1\le i\le N}(c_i)+1)\); this indicates a sensitivity of the DC-KNN2 algorithm to the number of centroid neighbors.

  2. 2.

    The process of learning clustering with adaptive distances aims to uncover concealed patterns within training groups. This is achieved by optimizing a newly proposed objective function and utilizing the outcomes of the clustering step in conjunction with the KNN classifier. This approach effectively enhances the performance of KNN-based classification.

  3. 3.

    DC-KNN exhibits strong performance in scenarios with limited training data. The performance of KNN-based classification can be significantly influenced by the selection of neighbors, particularly when working with various data sets and small sample sizes. Nevertheless, the experimental results in these situations demonstrate that our DC-KNN outperforms the competing methods.

  4. 4.

    The DC-KNN algorithms exhibit greater resilience to noisy data. Our DC-KNN algorithms outperform existing algorithms when applied to data containing noise.

The excellent performance of our methods can be attributed to several factors. Firstly, we utilize DC to uncover hidden patterns by adjusting the distances between data points. This allows us to effectively weigh the features of different subclasses. Secondly, we introduce a novel objective function that considers both the compactness within each apriori group and the separation between apriori classes. This enables us to accurately cluster the training predefined groups. Lastly, we adapt the KNN classifier to incorporate the augmented labels obtained from the clustering step, enhancing the accuracy of classification tasks. Therefore, our DC-KNN algorithms exhibit strong potential as a KNN-based classifier due to their robustness and efficacy in pattern classification.

5 Conclusions and Future Works

This research presents the DC-KNN algorithm, a novel supervised approach that combines dynamic clustering and the K-nearest neighbor classifier. The unsupervised clustering phase is used to discover new information from the original datasets that can help to improve supervised classification accuracy. DC in the unsupervised phase uses a new objective function that takes into account both intra-cluster and inter-cluster similarity, with cluster weights for variables being computed automatically and optimized as the algorithm converges. These weights can be used to identify important variables for clustering and eliminate variables that could introduce noise in the classification process. In the supervised phase, the new weights are employed to determine the nearest neighbor of new data points. Overall, the DC-KNN algorithm provides a unique and effective classification technique by combining DC and KNN.

Based on the results of the application on the real dataset test using the usual K-means and dynamical clustering algorithm to enhance the KNN supervised classification, better results were obtained than using dynamical clustering before training the supervised classifier. The reason is that the proposed method can provide more precise information and homogeneous clusters using clustering in the first step as a preprocessing step to discover the hidden patterns. As a result, the classification results obtained by the proposed method provide better results and increase the classifier’s accuracy.

The focus of this study is to discover hidden information that can be comprehended through novel patterns, leading to the identification of subgroups of instances classified by these new sub-patterns within the previously established classes. The objective is to ascertain whether the implementation of the DC-KNN methodology enhances the accuracy of classification. Initially, a DC algorithm is employed to identify novel patterns within the original classes. This research demonstrates the utilization of this algorithm across a range of datasets unaffected by data points that deviate from the norm. It is worth noting that alternative metrics can be chosen to handle outliers and determine the adaptive distances between the data points and centroids.

The primary aim of this investigation is to examine the theoretical aspect of integrating unsupervised and supervised classification and to evaluate whether this novel approach enhances classification performance compared to traditional classifiers. Additionally, several techniques can be employed to ascertain the optimal number of subgroups for each original class.

In this two-stage study, the unsupervised method utilized is DC, while the supervised strategy employed is KNN. Future research aims may concentrate on the combination of different clustering techniques with alternative classifiers to investigate the performance of combining unsupervised and supervised classification using various strategies, as well as determining how such combinations can impact the final outcome. Moreover, attempts could be made to formulate an objective function that condenses the process into a single-step strategy.