A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors

Sabri, Mohammed; Verde, Rosanna; Balzanella, Antonio; Maturo, Fabrizio; Tairi, Hamid; Yahyaouy, Ali; Riffi, Jamal

doi:10.1007/s00357-024-09471-5

A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors

Open access
Published: 11 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Classification Aims and scope Submit manuscript

A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors

Download PDF

Mohammed Sabri^1,2,
Rosanna Verde²,
Antonio Balzanella ORCID: orcid.org/0000-0001-8370-6006²,
Fabrizio Maturo³,
Hamid Tairi¹,
Ali Yahyaouy¹ &
…
Jamal Riffi¹

450 Accesses
Explore all metrics

Abstract

This paper introduces a novel supervised classification method based on dynamic clustering (DC) and K-nearest neighbor (KNN) learning algorithms, denoted DC-KNN. The aim is to improve the accuracy of a classifier by using a DC method to discover the hidden patterns of the apriori groups of the training set. It provides a partitioning of each group into a predetermined number of subgroups. A new objective function is designed for the DC variant, based on a trade-off between the compactness and separation of all subgroups in the original groups. Moreover, the proposed DC method uses adaptive distances which assign a set of weights to the variables of each cluster, which depend on both their intra-cluster and inter-cluster structure. DC-KNN performs the minimization of a suitable objective function. Next, the KNN algorithm takes into account objects by assigning them to the label of subgroups. Furthermore, the classification step is performed according to two KNN competing algorithms. The proposed strategies have been evaluated using both synthetic data and widely used real datasets from public repositories. The achieved results have confirmed the effectiveness and robustness of the strategy in improving classification accuracy in comparison to alternative approaches.

Identifying stable objects for accelerating the classification phase of k-means

Dynamic k-NN Classification Based on Region Homogeneity

A novel supervised cluster adjustment method using a fast exact nearest neighbor search algorithm

Article 16 December 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Classification is a fundamental task in machine learning, involving assigning data objects to apriori classes based on the values they assume for a set of features. It has received significant interest and has been extensively utilized in fields such as healthcare and medical diagnosis (Sivasankari et al., 2022; Malakouti, 2023), as well as image and video recognition (Wang et al., 2023; Chen et al., 2021).

The accuracy of a classifier depends on the availability of relevant and informative features, as well as on the choice of the classification algorithm. However, in many real-world scenarios, the data is complex, and the available features may not provide enough information to achieve high accuracy.

The statistical literature on supervised classification is growing rapidly. The use of a single algorithm often produces unsatisfactory results, often due to the complex structure of the groups to be classified (e.g., unbalanced distributions, non-linear relationships from predictors, presence of anomalous values). In recent years, techniques integrating or merging multiple algorithms from both supervised and unsupervised learning have been developed to enhance the decision rules provided by the model (Soheily-Khah et al., 2018; Sarker, 2021).

The K-nearest neighbor (KNN) algorithm has gained recognition as a powerful tool in the field of machine learning, providing an effective and straightforward method for classification in various pattern recognition scenarios (Zhang, 2016; Taunk et al., 2019). The primary approach employed by KNN involves determining the class of query samples by measuring the distance to the objects in the training set. The label of the query sample is then set by majority voting on the membership of the k-nearest objects in the training set. Recently, several novel adaptations of the KNN algorithm have been developed (Zhang et al., 2017b; Luo et al., 2020; Rastin et al., 2021a).

The Euclidean distance is often used with KNN algorithm to measure the dissimilarity between training and testing data. This procedure involves computing the dissimilarity, determining the nearest k neighbors based on these dissimilarities, and subsequently classifying the test sample based on the dominant class among the k neighbors. Although the Euclidean distance is easy to understand, it assigns equal importance to all sample features by considering them equally when calculating the distance. The use of equal weighting might be a limitation, particularly in situations where distinct features have differing degrees of importance to the categorization objective. To tackle this problem, various research has suggested alternative distance metrics that provide a more sophisticated method for calculating distances in KNN-based classification. These metrics have the potential to improve the performance of KNN-based classification (Chomboon et al., 2015; Ruan et al., 2018) presented a new approach for addressing interval-valued data clustering, wherein they proposed an adaptive fuzzy c-means algorithm that incorporates the consideration of interval membership across various clusters within the partition. de Carvalho et al. (2022) presented a batch self-organizing map (SOM) algorithm for distributional-valued data based on a weighted Wasserstein distance, where the weights are computed through the optimization of the clustering loss function.

3 Methodology

This section presents two novel approaches to supervised classification using clustering. The first step involves develo** a new DC variant to cluster apriori classes. The approach incorporates a new objective function that employs intra-cluster compactness, which uses an adaptive distance metric to compute dispersion information between subgroups of a given class, as well as inter-cluster separation, which measures the distance between a subgroup of a specific class and the subgroups of other apriori classes. Finally, the results obtained from DC are used for classification with the new KNN algorithm. In this process, subgroup weights are utilized with the adaptive distance to measure the similarity between the testing sample and their neighboring subgroup centroid points. Subsequently, the k nearest neighbors are determined based on the calculated similarities. The allocation of the new instances to a class is based on the majority vote among the neighbors of the k subgroup centroids.

3.1 Dynamic Clustering Algorithm

Clustering is a widely utilized technique in various applications, including image processing (Chang et al., 2017), video processing (Alayrac et al., 2016), gene analysis (Dapas et al., 2020), healthcare (Liao et al., 2016), and community detection (Li et al., 2022), among others. It involves dividing a dataset into groups, or clusters, based on similarity criteria, where objects within the same cluster are more alike than those in different clusters.

In this paper, our focus is on DC, an unsupervised learning algorithm that aims to partition data into clusters while simultaneously finding cluster representatives consistent with the distance function used for allocating units. Typically, the representatives are obtained as the minimizers of the sum of distances. The classic K-means algorithm can be seen as a specific case of DC where the distance metric is the Euclidean distance, and the representatives (centroids) are calculated as cluster averages.

The original concept of dynamic clustering was introduced by Diday (1971) and involves a two-step process of constructing clusters and selecting the best prototype for each cluster based on an adequacy criterion (Diday & Simon, 1976). The advantages of this scheme are mainly its flexibility with respect to the nature of the analyzed data and the choice of the distance function and the focus on providing cluster representatives, named prototypes. For instance, DC methods exist for datasets described by interval variables (De Carvalho & Lechevallier, 2009) and histogram variables (Balzanella & Verde, 2020; de Carvalho et al., 2022).

Let $X=\{X_{1},\ldots ,X_{i},\ldots ,X_{n}\}$ be a set of n objects, where each object $X_{i}=\{x_{i1},\ldots ,x_{im}\}$ is described by a set of m features. The general DC looks for the partition $G=\{C_1,\ldots , C_K\}$ in K clusters and the set $Z=\{Z_1,\ldots , Z_K\}$ of K prototypes representing the clusters in G, such that the following $\Delta$ fitting criterion between the set Z of prototypes and the partition G is minimized:

$$\begin{aligned} \Delta (G,Z)=\sum _{k=1}^{K}\sum _{X_i\in C_k}d(X_i,Z_k) \end{aligned}$$

(1)

The fitting criterion is defined as the sum of dissimilarities or distance measures between each object $X_i$ belonging to a class $C_k\in G$ and the class representation $Z_k \in Z$.

In this context, the DC algorithm iteratively implements the following representation and allocation steps:

1.
The representation step describes the K clusters $(C_1,\dots ,C_K)$ of the partition G through a vector $Z=(Z_1,\dots ,Z_K)$ of prototypes. Kee** the partition $G=\hat{G}$ fixed for the current iteration of the algorithm, Z is obtained from the minimization of $\Delta (\hat{G},Z)$, which is equivalent to finding the $Z_k\;\;(k=1,\dots ,K)$ that minimize $\sum _{i\in C_k}d(X_i,Z_k)$.
2.
The allocation step assigns each element $X_i$ to a cluster $C_k$ according to the proximity to the prototype $Z_k\in Z$. Kee** $Z=\hat{Z}$ fixed for the current iteration of the algorithm, it finds the partition of G that minimizes $\Delta (G,\hat{Z})$, by finding the cluster $C_k=\{X_i\in X\mid d(X_i,Z_k)\le d(X_i,Z_l),\forall l=1,\dots ,K; l \ne k \}$.

3.1.1 DC as a Generalization of K-Means Algorithm

The K-means clustering methodology is broadly utilized as a partitioning strategy. The proposed dynamic clustering method is a generalization of the K-means algorithm, which puts forth a compelling notion that cluster centers do not essentially have to be the centroids of clusters in $R^m$. Rather, it suggests substituting them with centers that can take various forms, based on the problem that needs to be addressed.

The K-means algorithm starts by selecting K initial cluster centers and then assigns each object to the closest cluster through the optimization of an objective function. As mentioned previously, the classical K-means algorithm only considers the intracluster compactness and the distances between the cluster centroids and individual data points. The membership matrix U, a $n\times K$ binary matrix, indicates which objects are assigned to which clusters, and $Z=\{Z_{1},\ldots , Z_{k},\ldots , Z_{K}\}$ represents the centroids of the K clusters, with elements $Z_{k}=\{z_{k1},\ldots , z_{kj},\ldots , z_{km}\}$ for each feature $j=1, \ldots , m$.

The objective function of the classic K-means without considering the inter-cluster separation is the following:

$$\begin{aligned} \begin{aligned} \Delta (U,Z) =&\sum _{k=1}^{K}\sum _{i=1}^{n}u_{ip}\sum _{j=1}^{m}(x_{ij}-z_{kj})^{2}, \end{aligned} \end{aligned}$$

(2)

such that $\displaystyle \sum _{k=1}^{K} u_{ik}=1$, (with $u_{ik}\in \{0,1\}$), and $X_i=\{ x_{i1},\ldots ,x_{ij},\ldots , x_{im} \}$ is an object of X described by m features.

DC algorithm optimizes the objective function by alternating the representation and allocation steps:

1.
Representation step (the matrix of membership $\hat{U}$ is fixed) The solution for the optimization problem $\Delta (\hat{U},Z)$ is provided by the minimizer Z
$$\begin{aligned} z_{kj}=\frac{\sum _{i=1}^{n}u_{ik}x_{ij}}{\sum _{i=1}^{n}u_{ik}}, \end{aligned}$$
(3)
where $1\le k\le K$
2.
Allocation step (the vector of centroids $\hat{Z}$ is fixed) According to Chan et al. (2004), the minimizer U of the optimization problem $\Delta (U,\hat{Z})$ is given by
$$\begin{aligned} u_{ik} = \left\{ \begin{array}{ll} 1 &{} \text {if}\, {\displaystyle \sum _{j=1}^{m}(x_{ij}-z_{kj})^{2}\le \sum _{j=1}^{m}(x_{ij}-z_{lj})^{2}}, \\ 0 &{} \text {otherwise.} \end{array}\right. \end{aligned}$$
(4)

The partitioning criterion Eq. 2 decreases at each iteration, converging to a stationary value.

3.1.2 The Need for Adaptive Distances in DC

The central concept of dynamic clustering with adaptive distances is to assign a specific distance measure, denoted as $d_k$, to each cluster $C_k$ and to minimize the sum of distances $d_k(X_i, Z_k)$ between objects $X_i$ belonging to cluster $C_k$ and the centroid $Z_k$. Importantly, the distances employed in the DC algorithm are not fixed in advance but rather are tailored to each cluster.

In this clustering algorithm, a weighting step is introduced. It assigns a weight to each variable for each cluster, reflecting the relevance of the variable in a cluster. The use of adaptive distance can also be viewed as a means of automatically scaling variables, as scaling can greatly impact the dissimilarity values and clustering outcomes in clustering analysis.

The DC criterion, which incorporates adaptive distances, is expressed as follows:

$$\begin{aligned} \begin{aligned} \Delta (U,W,Z) =&\sum _{k=1}^{K}\sum _{X_i\in C_k}u_{ik}d_{k}(X_i,Z_k), \end{aligned} \end{aligned}$$

(5)

such that $u_{ik}\in \{0,1\}$, $\displaystyle \sum _{k=1}^{K} u_{ik}=1$

In this context, distance $d_{k}$ is a weighted sum of distances $d_{w_{kj}}$

$$\begin{aligned} \begin{aligned} d_{k}(X_i,Z_k) =&\sum _{j=1}^{m}d_{w_{kj}}(x_{ij},z_{kj}) = \sum _{j=1}^{m} w_{kj}d(x_{ij},z_{kj}) \end{aligned} \end{aligned}$$

(6)

The adaptivity of the distance $d_{w_{kj}}$ is expressed by the vector of weights $W_k$.

When using adaptive distances, the representation step is divided in two stages so that the global optimization scheme is

1.
Representation step
1. 1.
  Stage 1: fix the matrix $\hat{U}$ of membership and the vector of weights $\hat{W}$ Find the solution $Z_k= \{z_{k1}, \ldots , z_{km}\}$ of the optimization problem $\Delta (\hat{U},\hat{W},Z)$.
2. 2.
  Stage 2: fix the matrix $\hat{U}$ of membership and the vector of centroids $\hat{Z}$ Find the vector of weights $W_k=\{w_{k1}, \ldots , w_{km}\}$ that minimizes the criterion $\Delta (\hat{U},W,\hat{Z})$.
2.
Allocation step Fix the set of vectors of weights $\hat{W}$ and the set of vectors of centroids $\hat{Z}$. Find the membership matrix U that minimizes the criterion $\Delta (U,\hat{W},\hat{Z})$

The paper employs an adaptive distance metric, specifically a weighted Euclidean distance, to calculate the distance between subgroups within a given cluster. Explicit formulas for the optimum cluster centroids, as well as for the weights of the adaptive distances, are found based on a new objective function criterion. By integrating the procedures of data partitioning and centroids selection with adaptive distances, DC algorithm provides a comprehensive and flexible approach to clustering analysis for apriori classes.

3.2 Dynamic Clustering Algorithm to Partition Apriori Groups

In this section, a new objective function is proposed to discover new information on the original data by combining both intra-cluster compactness of the subgroups of the same apriori group and inter-cluster separation between one subgroup and the subgroups of other apriori classes, as illustrated in Fig. 1. Therefore, it may be ineffectual to evaluate the weights of the variables of a subgroup using only the variation within the groups of a data set. Under these conditions, inter-cluster separation can play a significant role in differentiating the significance of various patterns and taking into account the heterogeneity among the subgroups of each original group.

We apply inter-cluster separation by introducing the global subgroups centroids of a data set. In contrast to the conventional DC, our proposed DC algorithm maximizes the distances between the subgroup’s centroid of an apriori group and the global subgroups centroid of the other apriori groups partition, while minimizing the distances between objects and their subgroups centroid.

Let N be the total number of apriori classes, and $U = \{U_{1}, \ldots ,U_{g}, \ldots , U_{N}\}$ the set of N matrices. Let $n_g$ be the number of elements of class g and $c_g$ be the number of subgroups in the class g. Each $U_g$ is an $n_g \times c_{g}$ indicator matrix containing the membership of each element i of the apriori class g to the subgroup p; such that where $u_{gip}=1$ denotes that the i-th object belonging to group g is assigned to subgroup p; otherwise, $u_{gip} = 0$, indicating that the object is not assigned to subgroup p. Let $Z=\{Z_1,\ldots ,Z_{g},\ldots ,Z_{N}\}$ be a set of vectors representing the centroids of each original group. For group g, let $Z_{g} = \{Z_{g1}, \dots , Z_{gc_g}\}$ be a set of $c_g$ vectors that represent the subgroups’ centroids and let $W_g = \{W_{g1}, W_{g2}, \dots , W_{gc_g}\}$ be a set of weight vectors associated with the subgroups, where $w_{gpj}$ represents the weight of the j-th variable related to the p-th subgroup for class g. Let $\beta$ represent a parameter used for adjusting the weights.

With the aim of achieving both intra-cluster compactness and inter-cluster separation, the optimization process is performed using a DC algorithm in which the objective function is modified to emphasize the separation between clusters belonging to different apriori classes:

$$\begin{aligned} \begin{aligned} P(U,W,Z) =&\sum _{g=1}^{N}\sum _{p=1}^{c_{g}}\frac{\sum _{i=1}^{n_{g}}u_{gip}d^{2}_{g}(X_{i},Z_{gp})}{n_{g}d^{2}(Z_{gp},Z_{gG})}\\ =&\sum _{j=1}^{m}(\sum _{g=1}^{N}\sum _{p=1}^{c_{g}}\frac{w^{\beta }_{gpj}\sum _{i=1}^{n_{g}}u_{gip}(x_{ij}-z_{gpj})^{2}}{n_{g}(z_{gpj}-z_{gGj})^{2}}), \end{aligned} \end{aligned}$$

(7)

such that $u_{gip}\in \{0,1\}$, $\displaystyle \sum _{p=1}^{c_{k}} u_{gip}=1$, and $\displaystyle \sum _{j=1}^{m}w_{gpj}=1$.

In the context of our study, the distance metric $d_{g}$ is defined as a weighted sum of distances $d_{w_{gpj}}$, where $d_{w_{gpj}}$ represents the distance metric for the p-th subgroup of the g-th apriori class. The vector of weights $W_{gp}$ demonstrates the adaptivity of the distance metric $d_{w_{gpj}}$:

$$\begin{aligned} \begin{aligned} d_{g}(X_i,Z_{gp}) = \sum _{j=1}^{m}d_{w_{gpj}}(x_{ij},z_{gpj}) = \sum _{j=1}^{m} w_{gpj}(x_{ij},z_{gpj})^2. \end{aligned} \end{aligned}$$

(8)

Let us assume that the present group is the $g^{th}$ group. $z_{gGj}$ represents the $j^{th}$ feature of the global subgroups centroid of all other apriori groups, excluding the current group g.

We calculate $z_{gGj}$ as

$$\begin{aligned} z_{gGj}=\frac{\displaystyle \sum _{h\in \{1,\ldots ,N\}\setminus \{g\}}c_{h}\sum _{q=1}^{c_{h}}z_{hqj}}{c_{1}+\cdots +c_{g-1}+c_{g+1}+\cdots +c_{N}}. \end{aligned}$$

(9)

To initiate the solution process of the objective function, it is necessary to initialize the parameters $\hat{U}$, $\hat{W}$, and $\hat{Z}$ of all groups (for $g=1, \ldots , N)$. Subsequently, the partition of the group g is evaluated, thus reducing the minimization issue as

$$\begin{aligned} \begin{aligned} P(U,W,Z) =&\sum _{p=1}^{c_{g}}\sum _{i=1}^{n_{g}}u_{gip}\sum _{j=1}^{m}w^{\beta }_{gpj}\frac{(x_{ij}-z_{gpj})^{2}}{n_{g}(z_{gpj}-z_{gGj})^{2}}, \end{aligned} \end{aligned}$$

(10)

such that $u_{ip}\in \{0,1\}$, $\displaystyle \sum _{p=1}^{c_{g}} u_{gip}=1$, and $\displaystyle \sum _{j=1}^{m}w_{gpj}=1$, $1\le p\le n_g$.

To minimize Eq. 10, it is necessary to solve the problems P1, P2, and P3 iteratively.

1.
Specifically, the representation step requires solving two distinct problems P1 and P2: Problem P1: fix $U=\hat{U}$, $W=\hat{W}$ and solve the reduced problem $P(\hat{U}, Z, \hat{W})$ Problem P2: fix $U=\hat{U}$, $Z=\hat{Z}$ and solve the reduced problem $P(\hat{U}, \hat{Z}, W)$
2.
To address the allocation problem, it is necessary to solve the problem denoted as P3: Problem P3: fix $Z=\hat{Z}$, $W=\hat{W}$ and solve the reduced problem $P(U, \hat{Z}, \hat{W})$

To solve the problem P1, we calculate the gradient of P with respect to $z_{gpj}$ as

$$\begin{aligned} \frac{\partial P(\hat{U},\hat{W},Z)}{\partial z_{gpj}}=-2w_{gpj}^{\beta }\sum _{i=1}^{n_{g}}u_{gip}\frac{(x_{ij}-z_{gpj})(z_{gpj}-z_{gGj})^{2}+(z_{gpj}-z_{gGj})(x_{ij}-z_{gpj})^{2}}{n_{k}(z_{gpj}-z_{gGj})^{4}}; \end{aligned}$$

(11)

by setting Eq. 11 to zero, we have:

$$\begin{aligned} z_{gpj}=\frac{\sum _{i=1}^{n_{k}}u_{gip}x_{ij}(x_{ij}-z_{gGj})}{\sum _{i=1}^{n_{g}}u_{gip}(x_{ij}-z_{gGj})}. \end{aligned}$$

(12)

The initial section of the supplementary material contains the proof and the necessary and sufficient conditions required for the realization of this finding.

It is worth noticing that $z_{gpj}$, the representative (e.g., centroid) of $C_{gp}$, can be interpreted as a weighted average of the elements of the $p^{th}$ subgroup, with weights being the difference between $x_{ij}$ and the global centroid $z_{gG_j}$ computed as in Eq. 9 on all other apriori groups, excluding the current group.

The higher the difference, the more the subgroup element contributes to the determination of the subgroup’s centroid. This result is due to the optimization of the discriminant component of the criterion which emphasizes the separation between classes.

The internality condition of the centroid $z_{gpj}$ of the subgroup $C_{gp}$ (for each variable j) is guaranteed under the conditions demonstrated in the Appendix, whereas it can become external to the cluster interval of values (for each j) the closer $z_{gGj}$ is to the mean of the elements of the cluster $C_{gp}$.

The problem P2 will be solved by setting up a Lagrangian equation to $P(\hat{U}, \hat{Z}, W)$ with multiplier $\lambda$. Let $L(W, \lambda )$ be the Lagrangian

$$\begin{aligned} L(W, \lambda )=\sum _{p=1}^{c_{g}}\sum _{j=1}^{m}w^{\beta }_{gpj}D_{gpj}-\lambda (\sum _{j=1}^{m}w_{gpj}-1), \end{aligned}$$

(13)

where $D_{gpj}=\sum _{i=1}^{n_{k}}u_{gip}\frac{(x_{ij}-z_{gpj})^{2}}{n_{g}(z_{gpj}-z_{gGj})^{2}}$. Setting the gradient of Eq. 13 with respect to $w_{gpj}$ and $\lambda$ to zero, we obtain

$$\begin{aligned} \frac{\partial L(W, \lambda )}{\partial w_{gpj}}=\beta w_{gpj}^{\beta -1}D_{gpj}-\lambda =0; \end{aligned}$$

(14)

from Eq. 14, we obtain

$$\begin{aligned} w_{gpj}=\left( \frac{\lambda }{\beta D_{gpj}}\right) ^{\frac{1}{\beta -1}}. \end{aligned}$$

(15)

The gradient with respect to $\lambda$

$$\begin{aligned} \frac{\partial L(W, \lambda )}{\partial \lambda }=-(\sum _{j=1}^{m}w_{gpj}-1)=0; \end{aligned}$$

(16)

substituting Eq. 15 into Eq. 16, we obtain

$$\begin{aligned} \lambda ^{\frac{1}{\beta -1}}=\displaystyle \frac{\beta ^{\frac{1}{\beta -1}}}{\displaystyle \sum _{j=1}^{m}D_{gpj}^{-\frac{1}{\beta -1}}}; \end{aligned}$$

(17)

substituting Eq. 17 into Eq. 15, we have

$$\begin{aligned} w_{gpj}=\frac{1}{\displaystyle (D_{gpj})^{\frac{1}{\beta -1}}\sum _{l=1}^{m}D_{gpl}^{-\frac{1}{\beta -1}}}, \end{aligned}$$

(18)

The minimizer $W_k$ of the optimization problem P2 is given by

$$\begin{aligned} w_{gpj} = {\left\{ \begin{array}{ll} 0 &{} \text {if}\, {\displaystyle (z_{gpj}-z_{gGj})^2=0}, \\ 0 &{} \text {if}\, {\displaystyle D_{gpj}\ne 0, \;\;\text {but}\;\;D_{gpl}=0,\;\;\text {for some}\;l},\\ \displaystyle \frac{1}{(D_{gpj})^{\frac{1}{\beta -1}}\displaystyle \sum _{l=1}^{m}D_{gpl}^{-\frac{1}{\beta -1}}} &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

(19)

The problem P3 is solved by

$$\begin{aligned} u_{gip} = \left\{ \begin{array}{ll} 1 &{} \text {if}\, {\displaystyle \sum _{j=1}^{m}w^{\beta }_{gpj}{\frac{(x_{ij}-z_{gpj})^{2}}{n_{g}(z_{gpj}-z_{gGj})^{2}}}\le \sum _{j=1}^{m}w^{\beta }_{gpj}\frac{(x_{ij}-z_{grj})^{2}}{n_{g}(z_{grj}-z_{gGj})^{2}}}, \\ 0 &{} \text {otherwise,} \end{array}\right. \end{aligned}$$

(20)

where $1\le r \le g$, $r\ne p$.

The same process is provided to the partition of the other groups $q \ne g$ for $q=1, \ldots , N$ of the primary datasets, by optimally computing $u_{lip}$, $z_{lpj}$, and $w_{lpj}$.

3.3 New DC Variant Algorithm

In this section, we provide a comprehensive explanation of the algorithm used in the novel DC variant clustering method. The aim of this algorithm is to create new subgroups from the available labeled data by detecting hidden patterns.

The algorithm is designed to associate a unique distance metric with each cluster, which is used to compare clusters and their representatives. This distance measure is not fixed and varies from subgroup to subgroup, changing with each iteration until convergence. The adaptive nature of this distance measure offers the advantage of assigning weights to the variables that are more representative or informative of a particular cluster, resulting in a more accurate clustering algorithm.

This adaptive approach aims to identify a partition of each original class, denoted as $G_{1},\ldots ,G_{N}$, respectively into $n_{c_{1}}, \ldots ,n_{c_{N}}$ subgroups. Here, $G_{g}=\{C_{g1},\ldots ,C_{gn_{c_{g}}}\}$ specifies the partitions for group g, and the corresponding centroids, denoted as $Z_{g}=\{Z_{g1},\ldots , Z_{gn_{c_{g}}}\}$, for each group are computed using the formula for centroids Eq. 12. Additionally, for each subgroup, a set of weights is assigned from the set $W_{g}=\{W_{g1},\ldots ,W_{gn_{c_{g}}}\}$.

The algorithm we propose looks for a local minimum of the objective function in Eq. 7.

It requires, as an input, the training dataset, with N apriori classes, and the number of subgroups for each apriori class. It starts from an initial random partitioning of the apriori classes into subgroups, then initializes the weights of variables for each subgroup to 1/m (where m is the number of variables). An initial set of centroids Z is computed according to Eq. 12, based on the initial random partition and on the weights in W.

The iterative part of the algorithm alternates, at each iteration t, the representation and allocation step introduced in Sect. 3.2, in order to provide the partition of the N apriori classes $G_{1},\ldots ,G_{N}$, into $n_{c_{1}}, \ldots ,n_{c_{N}}$ subgroups; the set of centroids Z; the weights W.

At each iteration t, a check of the convergence of the algorithm is performed by evaluating the criterion $P_t$:

$$\begin{aligned} {\begin{matrix} P_{t} =&\sum _{g=1}^{N}\sum _{p=1}^{c_{g}}\frac{\sum _{i=1}^{n_{g}}u_{gip}d^{2}_{g}(X_{i},Z_{gp})}{n_{g}d^{2}(Z_{gp},Z_{gG})}, \end{matrix}} \end{aligned}$$

(21)

where $X_i$, $Z_{gp}$, and $u_{gip}$ are defined as before.

The algorithm will run when $\Vert P_{t+1}-P_{t}\Vert >0$, which means that a further iteration improves the criterion. In other words, the algorithm continues as long as there is a decrease in the intra-cluster distances between subgroups of an original group and/or an increment in the inter-cluster distances.

Algorithm 1 displays the pseudocode of the suggested DC clustering algorithm.

3.4 KNN Classifier Based on Adaptive Distances and Novel Patterns

The K-nearest neighbor is a supervised learning technique that uses training data and a predetermined k value to find the k nearest data based on the idea of using distance computation to discover the nearby points of the query from the training set and assign a class label to the query through the majority voting rule. However, its efficacy is comparable to the most complex classifiers in the literature. This classifier relies heavily on measuring the distance or similarity between the tested examples and the training examples. This raises an essential question about which distance or similarity measures should be used for the KNN classifier out of the numerous options available. Therefore, we propose an adaptive distance parameterized by weight vectors. The weights are estimated during the first clustering step on apriori classes so that each subgroup is associated with its weight vector. The main idea of the KNN classifier with adaptive distances is that there is a distance to compare the test objects and their nearest points from the training dataset, which changes with each training point. However, the objects belonging to the same subgroup have the same vector weights.

Let $T = {(X_{i}, y'_{ip})}_{i=1}^{n}$ represent a training set consisting of n training instances, with each instance belonging to one of N classes. Each training instance $X_{i}$ is an element of a m-dimensional space $R^{m}$, and its corresponding class label $y'_{ip}$ is obtained from the initial clustering step, with p represents the subgroup of $X_i$ in the original apriori class $y_i$. When a new query $S_{h}$ is given, we first compute the adaptive distances between $S_{h}$ and each training instance in T. The adaptive distance for a given data point $X_{i}$ and query $S_{h}$ is defined as follows:

$$\begin{aligned} \begin{aligned} d_{y'_{ip}}(X_{i},S_{h})= d_{W_{y_{i}p}}(X_{i},S_{h}) = \sum _{j=1}^{m}w_{y_{i}pj}(x_{ij}-s_{hj})^{2}. \end{aligned} \end{aligned}$$

(22)

Here, $W_{y_{i}p}$ is a vector of weights corresponding to the p-th subgroup of the apriori class $y_{i}$.

The n distances are then arranged ascendingly. $N_{K}(S_{h})=\{(X_{j},y'_{jp})\}_{j=1}^{K}$ denotes the K-nearest neighbors of $S_{h}$, which are the K training instances with the top K smallest distances. Ultimately, the majority voting rule is used to assign the query $S_{h}$ to subgroup $C_{gp}$:

$$\begin{aligned} C_{gp} = \underset{C_{ip'}}{{{\,\textrm{argmax}\,}}} \sum _{(x,y)\in N_{k}(S_{h})}\mathbbm {1}_{C_{ip'}}(y),\;\;i=1,\dots ,N\;\;and\;\;p'=1,\dots ,c_i \end{aligned}$$

(23)

where $\mathbbm {1}_{C}(.)$ is the indicator function:

$$\begin{aligned} \mathbbm {1}_{C}(y)={\left\{ \begin{array}{ll} 1 &{} \text {if } y\in C,\\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

(24)

3.5 The DC-KNN Combined Algorithm

DC-KNN (dynamic clustering and K-nearest neighbors) is an algorithm that combines the efficiency of the DC algorithm with the classification by KNN. The basic idea behind DC-KNN is to use DC with adaptive distances to re-cluster the classes of the training dataset into different subgroups and then use KNN to classify the test set based on the new labels obtained from the clustering step.

The algorithm starts by using DC to cluster each original group into a specific number of clusters; the optimal number of subgroups can be determined by the Silhouette or Elbow method. The resulting clusters will be used as preprocessing for the KNN algorithm, allowing it to work on more homogeneous subsets of data. This combination improves the accuracy and efficiency of the KNN algorithm, as it enables the algorithm to learn all the patterns of the dataset necessary for its learning process.

After the clustering step, KNN is used to classify new data points based on the newly discovered patterns. The KNN algorithm finds the K nearest neighbors of a given data point and determines the class of the majority of those neighbors. In this way, the DC-KNN algorithm is able to effectively combine the strengths of both DC and KNN, making it a powerful tool for data classification.

It is worth noting that the DC-KNN algorithm can be sensitive to the initial partitions. Hence, choosing the appropriate number of clusters and setting the parameters for the DC algorithm is essential.

The execution of the DC-KNN classification algorithm follows the steps outlined in Algorithm 2.

3.6 The DC-KNN Using the Centroids as the Nearest Neighbors

A new variant of the KNN algorithm is introduced in this study for more accurate predictions of new data points based on the new subgroups. The proposed classification approach focuses on classifying instances to their nearest neighbor class by computing the distance between new instances and subgroups centroids. The allocation is then determined using the label of the apriori class to which the subgroups belong.

The DC-KNN classifier algorithm utilizes the centroids and weights of DC to determine the nearest neighbor of a new element. Here, K denotes the number of neighborhoods of centroids that are closest to the new query. The optimum value of K is chosen from the range of 1 to ($\displaystyle \min _{1\le i\le N}(c_i)+1$). The algorithm follows the sequential steps outlined in Algorithm 3.

4 Experiments

This section conducts extensive experiments on different real and synthetic datasets to validate the classification performance of the proposed DC-KNN classifiers. The DC-KNN approach is compared to the KNN and Kmeans-KNN methods, in terms of classification accuracy.

4.1 Experimental Results on Real Datasets

To thoroughly assess the performance and robustness of the proposed DC-KNN algorithms, experiments are conducted comparing them with classical KNN and Kmeans-KNN. The latter uses the K-means clustering algorithm as a first step and then applies KNN using the results of the clustering. The comprehensive experiments are conducted on real datasets sourced from the UCI Machine Learning Repository Bache and Lichman (2013), KEEL attribute noise datasets Alcala-Fdez et al. (2011), and UCR Time Series Classification Repository Dau et al. (2018). The classification accuracy was used to measure the performance of all the approaches in each experiment. Note that DC-KNN1 represents Algorithm 2 that employs data points as nearest neighbors. It is important to note that in the KNN algorithm, the selection of the k nearest neighbors involves an aggregation from all available classes of data points. Similarly, in the KNN algorithm with classical K-means and DC-KNN1 algorithms, the nearest neighbors are selected from the data points of all the subgroups of the apriori classes. However, in the DC-KNN2 algorithm, the nearest neighbors are the centroids of the subgroups obtained from DC.

The objective of the experiments is to demonstrate the powerful classification capabilities of the proposed techniques on different kinds of datasets that represent real datasets, noisy numerical datasets obtained from the KEEL machine learning repository, and time series datasets from the UCR database. The utilized datasets mentioned here are clearly outlined in Table 1. Their information shows variations in the quantities of total samples, features, classes, and test samples.

We utilize the Wisconsin Breast Cancer Wisconsin (Diagnostic) dataset, abbreviated as “Breast,” “ILPD,” and the noise “Yeast” datasets from the UCI database. Additionally, we perform tests on the“’Computers,” “ScreenType,” and “StarLightCurves” datasets from the UCR repository for our time series analysis. Following the approach employed in Maturo and Verde (2022), we utilize functional data analysis to describe the time series datasets and extract the coefficients of the b-spline decomposition as features. The six noise datasets from the KEEL repository are “Sonar,” “Iono,” “Heart,” “Pima,” “Spambase,” and “Iris.” The experiments employ the abbreviations Yeast-n, Sonar-n, Iono-n, Heart-n, Pima-n, Spambase-n, and Iris-n to distinguish the noisy data. The six noise datasets consist of samples that are exposed to a noise intensity of 10%. To clarify, around 10% of the samples in each dataset are chosen at random, and around one-third of the total samples from each dataset are chosen as test samples, while the rest of the samples are designated as training samples. The values of a particular attribute for these samples are then assigned using random values that are within the minimum and maximum range of the attribute’s domain. This assignment follows a uniform distribution. Each noise dataset within the KEEL repository has been divided into five unique subsets for training and testing purposes. The quantity of testing samples for each set is presented in Table 1. The ultimate classification evaluation of each competing method is determined by calculating the average of the classification results from five divisions on each noise dataset. In addition, the majority of the datasets have a small number of samples. However, these datasets can effectively be utilized to validate the classification performance in scenarios with a small sample size.

The classical KNN algorithm is recognized for its effectiveness in scenarios with a clear separation between classes. However, accurate classification of datasets containing noise poses a greater challenge. To address this, we evaluated the proposed DC-KNN methods on selected noisy datasets from the KEEL repository. The evaluation is based on the classification accuracy as shown in Table 2.

Table 1 Experimental datasets from UCR, UCI, and KEEL repositories used in the study

Full size table

Table 2 Classification accuracy (%) of different methods on various datasets: Values of K with the number of subgroups are presented in parenthesis

Full size table

In the experiments, we assess the classification performance of the proposed DC-KNN methods by varying the neighborhood size (K) on each dataset. The parameters $(n_{c_1}, \ldots , n_{c_N})$ of the DC step, which represents the number of subgroups determined by the Silhouette method, are also taken into consideration. The values of K range from 1 to 20, and for DC-KNN2, range from 1 to 7, incrementing by 1, for all the datasets. The classification results of the suggested techniques, with different values of K, are shown in Fig. 2.

The DC-KNN2 Algorithm 3, utilizing centroids of DC for detecting the nearest neighbors, outperforms other algorithms with lower numbers of neighbors and achieves the highest level of accuracy across the majority of datasets in comparison to other algorithms.

Moreover, on most of the real data sets, the classification accuracy remains consistently constant when the value of K increases. This is because the DC-KNN2 algorithm requires that the number of neighbors be smaller than $(\displaystyle \min _{1\le i\le N}(c_i)+1)$. On the other hand, the DC-KNN1 Algorithm 2 consistently achieves satisfactory classification results when varying the value of K in comparison to the classical methods, particularly at higher values of K. It implies that the suggested DC-KNN1 and DC-KNN2 algorithms exhibit more robustness when the values of K are changed, while still achieving accurate classification. The explanation for this advantage may be attributed to the utilization of DC and a novel objective function in the initial phase of both approaches, which allows for the good performance of the classifier. The classification results depicted in Fig. 2 clearly show the good classification performance of the two proposed approaches. In most instances, the proposed method outperforms the other comparison methods.

4.2 Experimental Simulation Results

In order to show the effectiveness of the proposed DC-KNN classifiers, we generate and adapt different models in this subsection. In this experiment, we performed simulations using various data generating processes (DGPs) with distinct characteristics. The specifications of the DGPs are detailed in the second section of the Supplementary Material and illustrated in Fig. 3. Specifically, we generated datasets based on different DGPs and examined scenarios where the number of clusters remained fixed at the true value, as well as scenarios where the number of clusters was estimated.

In order to determine the number of subgroups inside each apriori class, we can employ either the Silhouette or Elbow approach. Nevertheless, the traditional approaches do not ensure the enhancement of the approach’s performance. Consequently, our forthcoming work will concentrate on ameliorating this aspect.

The results of the simulations performed on the data generating processes (DGPs) are displayed in Table 3. The table presents a detailed analysis of the effectiveness of the new techniques developed utilizing DC-KNN1 and DC-KNN2, which utilize the clustering results, as stated in Algorithms 2 and 3. The comparison is made over the classical KNN and Kmeans-KNN methods. As expected, utilizing the DC-KNN algorithms leads to increased accuracy values. The unsatisfactory results of classical KNN can be due to the complexity and the overlap** of data. However, a more effective approach is to first use clustering to discover hidden patterns within the classes before doing classification.

Table 3 Classification accuracy (%) of different methods on simulated datasets: Values of K with the number of subgroups are presented in parenthesis

Full size table

A detailed analysis of the classification effectiveness of DC-KNN techniques with varying K values can be found in Section 3 of the Supplementary Material.

The effectiveness of the proposed DC-KNN approaches in classifying diverse dataset types, including real data sets, time series data sets, noisy data sets, and simulated datasets, has been thoroughly shown through extensive experiments. Based on the results of these classification studies, it is essential to highlight important observations that emphasize the significant contributions of our research:

1.
The DC-KNN techniques demonstrate robustness to changes in the neighborhood size K, as compared to its competitors. The experimental results consistently show that the proposed DC-KNN1 and DC-KNN2 algorithms consistently outperform other approaches and achieve good classification performance. In particular, the DC-KNN2 algorithm demonstrates strong and consistent performance when the value of K is smaller than $(\displaystyle \min _{1\le i\le N}(c_i)+1)$; this indicates a sensitivity of the DC-KNN2 algorithm to the number of centroid neighbors.
2.
The process of learning clustering with adaptive distances aims to uncover concealed patterns within training groups. This is achieved by optimizing a newly proposed objective function and utilizing the outcomes of the clustering step in conjunction with the KNN classifier. This approach effectively enhances the performance of KNN-based classification.
3.
DC-KNN exhibits strong performance in scenarios with limited training data. The performance of KNN-based classification can be significantly influenced by the selection of neighbors, particularly when working with various data sets and small sample sizes. Nevertheless, the experimental results in these situations demonstrate that our DC-KNN outperforms the competing methods.
4.
The DC-KNN algorithms exhibit greater resilience to noisy data. Our DC-KNN algorithms outperform existing algorithms when applied to data containing noise.

The excellent performance of our methods can be attributed to several factors. Firstly, we utilize DC to uncover hidden patterns by adjusting the distances between data points. This allows us to effectively weigh the features of different subclasses. Secondly, we introduce a novel objective function that considers both the compactness within each apriori group and the separation between apriori classes. This enables us to accurately cluster the training predefined groups. Lastly, we adapt the KNN classifier to incorporate the augmented labels obtained from the clustering step, enhancing the accuracy of classification tasks. Therefore, our DC-KNN algorithms exhibit strong potential as a KNN-based classifier due to their robustness and efficacy in pattern classification.

5 Conclusions and Future Works

This research presents the DC-KNN algorithm, a novel supervised approach that combines dynamic clustering and the K-nearest neighbor classifier. The unsupervised clustering phase is used to discover new information from the original datasets that can help to improve supervised classification accuracy. DC in the unsupervised phase uses a new objective function that takes into account both intra-cluster and inter-cluster similarity, with cluster weights for variables being computed automatically and optimized as the algorithm converges. These weights can be used to identify important variables for clustering and eliminate variables that could introduce noise in the classification process. In the supervised phase, the new weights are employed to determine the nearest neighbor of new data points. Overall, the DC-KNN algorithm provides a unique and effective classification technique by combining DC and KNN.

Based on the results of the application on the real dataset test using the usual K-means and dynamical clustering algorithm to enhance the KNN supervised classification, better results were obtained than using dynamical clustering before training the supervised classifier. The reason is that the proposed method can provide more precise information and homogeneous clusters using clustering in the first step as a preprocessing step to discover the hidden patterns. As a result, the classification results obtained by the proposed method provide better results and increase the classifier’s accuracy.

The focus of this study is to discover hidden information that can be comprehended through novel patterns, leading to the identification of subgroups of instances classified by these new sub-patterns within the previously established classes. The objective is to ascertain whether the implementation of the DC-KNN methodology enhances the accuracy of classification. Initially, a DC algorithm is employed to identify novel patterns within the original classes. This research demonstrates the utilization of this algorithm across a range of datasets unaffected by data points that deviate from the norm. It is worth noting that alternative metrics can be chosen to handle outliers and determine the adaptive distances between the data points and centroids.

The primary aim of this investigation is to examine the theoretical aspect of integrating unsupervised and supervised classification and to evaluate whether this novel approach enhances classification performance compared to traditional classifiers. Additionally, several techniques can be employed to ascertain the optimal number of subgroups for each original class.

In this two-stage study, the unsupervised method utilized is DC, while the supervised strategy employed is KNN. Future research aims may concentrate on the combination of different clustering techniques with alternative classifiers to investigate the performance of combining unsupervised and supervised classification using various strategies, as well as determining how such combinations can impact the final outcome. Moreover, attempts could be made to formulate an objective function that condenses the process into a single-step strategy.

Data Availability

The experimental data and simulation results that support the findings of this paper are available on public repositories or upon request from the corresponding author. In particular, real datasets are sourced from the UCI Machine Learning Repository (Bache & Lichman, 2013) and the UCR Time Series Classification Repository (Dau et al., 2018). KEEL attribute noise datasets are proposed in Alcala-Fdez et al. (2011). For simulated data, we provide the data generation schema in the Supplementary Material; however, datasets are available upon request.

References

Abavisani, M., & Patel, V. M. (2019). Deep sparse representation-based classification. IEEE Signal Processing Letters, 26(6), 948–952.
Article Google Scholar
Alayrac, J. B., Bojanowski, P., & Agrawal, N., et al. (2016). Unsupervised learning from narrated instruction videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 4575–4583)
Alcala-Fdez, J., Fernandez, A., & Luengo, J., et al. (2011). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing,17(2–3), 255–287. http://sci2s.ugr.es/KEEL/datasets.php
Bache, K., & Lichman, M. (2013). UCI machine learning repository. https://archive.ics.uci.edu/
Balzanella, A., & Verde, R. (2020). Histogram-based clustering of multiple data streams. Knowledge and Information Systems, 62(1), 203–238. https://doi.org/10.1007/s10115-019-01350-5
Article Google Scholar
Bao, C., Peng, H., He, D., et al. (2018). Adaptive fuzzy c-means clustering algorithm for interval data type based on interval-dividing technique. Pattern Analysis and Applications, 21, 803–812. https://doi.org/10.1007/s10044-017-0663-2
Article MathSciNet Google Scholar
Breiman, L. (2017). Classification and regression trees. Routledge. https://doi.org/10.1201/9781315139470
Chan, E. Y., Ching, W. K., Ng, M. K., et al. (2004). An optimization algorithm for clustering using weighted dissimilarity measures. Pattern recognition, 37(5), 943–952. https://doi.org/10.1016/j.patcog.2003.11.003
Article Google Scholar
Chang, J., Wang, L., & Meng G., et al. (2017). Deep adaptive image clustering. In: Proceedings of the IEEE international conference on computer vision, (pp. 5879–5887). https://doi.org/10.1109/ICCV.2017.626
Chen, L., Li, S., Bai, Q., et al. (2021). Review of image classification algorithms based on convolutional neural networks. Remote Sensing, 13(22), 4712. https://doi.org/10.3390/rs13224712
Article Google Scholar
Cherif, W. (2018). Optimization of K-NN algorithm by clustering and reliability coefficients: Application to breast-cancer diagnosis. Procedia Computer Science, 127, 293–299.
Article Google Scholar
Chomboon, K., Chujai, P., & Teerarassamee, P., et al. (2015). An empirical study of distance metrics for k-nearest neighbor algorithm. In: Proceedings of the 3rd international conference on industrial application engineering
Dapas, M., Lin, F. T., Nadkarni, G. N., et al. (2020). Distinct subtypes of polycystic ovary syndrome with novel genetic associations: An unsupervised, phenotypic clustering analysis. PLoS medicine, 17(6), e1003132.
Article Google Scholar
Dau, H. A., Keogh, E., & Kamgar, K., et al. (2018). The UCR time series classification archive. https://www.cs.ucr.edu/~eamonn/time_series_data_2018
de Carvalho, Fd. A., Irpino, A., Verde, R., et al. (2022). Batch self-organizing maps for distributional data with an automatic weighting of variables and components. Journal of Classification, 39(2), 343–375. https://doi.org/10.1007/s00357-022-09411-1
Article MathSciNet Google Scholar
De Carvalho, Fd. A., & Lechevallier, Y. (2009). Partitional clustering algorithms for symbolic interval data based on single adaptive distances. Pattern Recognition, 42(7), 1223–1236. https://doi.org/10.1016/j.patcog.2008.11.016
Article Google Scholar
Diday, E., Govaert, G., & Lechevallier, Y., et al. (1981). Clustering in pattern recognition. In: Digital Image Processing: Proceedings of the NATO Advanced Study Institute held at Bonas, France, June 23-July 4, 1980, (pp. 19–58). Springer
Diday, E., & Simon, J. (1976). Clustering analysis. Digital pattern recognition, (pp. 47–94)
Diday, E. (1971). Une nouvelle méthode en classification automatique et reconnaissance des formes la méthode des nuées dynamiques. Revue de statistique appliquée, 19(2), 19–33.
MathSciNet Google Scholar
Diday, E., & Govaert, G. (1977). Classification automatique avec distances adaptatives. RAIRO Informatique Computer Science, 11(4), 329–349.
MathSciNet Google Scholar
Duda, R. O., Hart, P. E., et al. (2006). Pattern classification. John Wiley & Sons.
Google Scholar
Fix, E., & Hodges, J. L. (1989). Discriminatory analysis nonparametric discrimination: Consistency properties. International Statistical Review / Revue Internationale de Statistique,57(3), 238–247. http://www.jstor.org/stable/1403797
Friedman, J. H. (2002). Stochastic gradient boosting. Computational statistics & data analysis, 38(4), 367–378. https://doi.org/10.1016/S0167-9473(01)00065-2
Article MathSciNet Google Scholar
Gou, J., Du, L., Zhang, Y., et al. (2012). A new distance-weighted k-nearest neighbor classifier. J Inf Comput Sci, 9(6), 1429–1436.
Google Scholar
Gou, J., Ma, H., Ou, W., et al. (2019a). A generalized mean distance-based k-nearest neighbor classifier. Expert Systems with Applications, 115, 356–372. https://doi.org/10.1016/j.eswa.2018.08.021
Article Google Scholar
Gou, J., Qiu, W., Yi, Z., et al. (2019b). Locality constrained representation-based k-nearest neighbor classification. Knowledge-Based Systems, 167, 38–52. https://doi.org/10.1016/j.knosys.2019.01.016
Article Google Scholar
Gou, J., Qiu, W., Yi, Z., et al. (2019c). A local mean representation-based k-nearest neighbor classifier. ACM Transactions on Intelligent Systems and Technology (TIST), 10(3), 1–25. https://doi.org/10.1145/3319532
Article Google Scholar
Gou, J., Sun, L., Du, L., et al. (2022). A representation coefficient-based k-nearest centroid neighbor classifier. Expert Systems with Applications, 194(116), 529. https://doi.org/10.1016/j.eswa.2022.116529
Article Google Scholar
Irpino, A., Verde, R., & De Carvalho, Fd. A. (2014). Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Systems with Applications, 41(7), 3351–3366. https://doi.org/10.1016/j.eswa.2013.12.001
Article Google Scholar
Liao, M., Li, Y., Kianifard, F., et al. (2016). Cluster analysis and its application to healthcare claims data: A study of end-stage renal disease patients who initiated hemodialysis. BMC nephrology, 17(1), 1–14.
Article Google Scholar
Li, C., Chen, H., Li, T., et al. (2022). A stable community detection approach for complex network based on density peak clustering and label propagation. Applied Intelligence, 52(2), 1188–1208.
Article Google Scholar
Li, H., & Wei, M. (2020). Fuzzy clustering based on feature weights for multivariate time series. Knowledge-Based Systems, 197(105), 907.
Google Scholar
Luo, S., Miao, D., Zhang, Z., et al. (2020). Non-numerical nearest neighbor classifiers with value-object hierarchical embedding. Expert Systems with Applications, 150(113), 206.
Google Scholar
Malakouti, S. M. (2023). Heart disease classification based on ECG using machine learning models. Biomedical Signal Processing and Control, 84(104), 796.
Google Scholar
Maturo, F., & Verde, R. (2022). Combining unsupervised and supervised learning techniques for enhancing the performance of functional data classifiers. Computational Statistics. https://doi.org/10.1007/s00180-022-01259-8
Article Google Scholar
Pan, Z., Wang, Y., & Pan, Y. (2020). A new locally adaptive k-nearest neighbor algorithm based on discrimination class. Knowledge-Based Systems, 204(106), 185.
Google Scholar
Quinlan, J. R., et al. (1996). Bagging, boosting, and c4. 5. Aaai/Iaai, 1, 725–730.
Google Scholar
Rastin, N., Jahromi, M. Z., & Taheri, M. (2021). A generalized weighted distance k-nearest neighbor for multi-label problems. Pattern Recognition, 114(107), 526.
Google Scholar
Rastin, N., Taheri, M., & Jahromi, M. Z. (2021). A stacking weighted k-nearest neighbour with thresholding. Information Sciences, 571, 605–622.
Article MathSciNet Google Scholar
Rodríguez, S. I. R., & de Carvalho, Fd. A. T. (2022). Clustering interval-valued data with adaptive Euclidean and city-block distances. Expert Systems with Applications, 198(116), 774.
Google Scholar
Ruan, Y., **ao, Y., Hao, Z., et al. (2021). A nearest-neighbor search model for distance metric learning. Information Sciences, 552, 261–277.
Article MathSciNet Google Scholar
Sarker, I. H. (2021). Machine learning: Algorithms, real-world applications and research directions. SN computer science, 2(3), 160.
Article Google Scholar
Sinaga, K. P., & Yang, M. S. (2020). Unsupervised k-means clustering algorithm. IEEE. Access, 8, 80716–80727.
Article Google Scholar
Sivasankari, S., Surendiran, J., & Yuvaraj, N., et al. (2022). Classification of diabetes using multilayer perceptron. In: 2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), (pp. 1–5). IEEE
Soheily-Khah, S., Marteau, P. F., & Béchet, N. (2018). Intrusion detection in network systems through hybrid supervised and unsupervised machine learning process: A case study on the ISCX dataset. In: 2018 1st International Conference on Data Intelligence and Security (ICDIS), (pp. 219–226). IEEE
Taunk, K., De, S., & Verma, S., et al. (2019). A brief review of nearest neighbor algorithm for learning and classification. In: 2019 international conference on intelligent computing and control systems (ICCS), (pp. 1255–1260). IEEE
Uddin, S., Haque, I., Lu, H., et al. (2022). Comparative performance analysis of k-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Scientific Reports, 12(1), 1–11.
Article Google Scholar
Wang, Z., Huang, B., & Wang, G., et al. (2023). Masked face recognition dataset and application. IEEE Transactions on Biometrics, Behavior, and Identity Science
Zhang, Z. (2016). Introduction to machine learning: K-nearest neighbors. Annals of translational medicine 4(11)
Zhang, C., Liu, C., Zhang, X., et al. (2017a). An up-to-date comparison of state-of-the-art classification algorithms. Expert Systems with Applications, 82, 128–150.
Article Google Scholar
Zhang, S., Li, X., Zong, M., et al. (2017b). Efficient KNN classification with different numbers of nearest neighbors. IEEE transactions on neural networks and learning systems, 29(5), 1774–1785.
Article MathSciNet Google Scholar
Zhao, Y., & Yang, L. (2023). Distance metric learning based on the class center and nearest neighbor relationship. Neural Networks, 164, 631–644.
Article Google Scholar

Download references

Funding

Open access funding provided by Università degli Studi della Campania Luigi Vanvitelli within the CRUI-CARE Agreement.

Author information

Authors and Affiliations

LISAC Laboratory, Faculty of Sciences Dhar EL Mehraz, Sidi Mohamed Ben Abdellah University, B.P. 1796 Atlas, Fez, 30003, Morocco
Mohammed Sabri, Hamid Tairi, Ali Yahyaouy & Jamal Riffi
Department of Mathematics and Physics, University of Campania L. Vanvitelli, Viale Lincoln 5, Caserta, 81100, Italy
Mohammed Sabri, Rosanna Verde & Antonio Balzanella
Faculty of Economics, Universitas Mercatorum, Piazza Mattei 10, Roma, 00186, Italy
Fabrizio Maturo

Authors

Mohammed Sabri
View author publications
You can also search for this author in PubMed Google Scholar
Rosanna Verde
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Balzanella
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Maturo
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Tairi
View author publications
You can also search for this author in PubMed Google Scholar
Ali Yahyaouy
View author publications
You can also search for this author in PubMed Google Scholar
Jamal Riffi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mohammed Sabri or Antonio Balzanella.

Ethics declarations

Ethics Approval

The research study did not involve any human participants or animals.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 298 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sabri, M., Verde, R., Balzanella, A. et al. A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors. J Classif (2024). https://doi.org/10.1007/s00357-024-09471-5

Download citation

Accepted: 18 April 2024
Published: 11 May 2024
DOI: https://doi.org/10.1007/s00357-024-09471-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors

Abstract

Similar content being viewed by others

Identifying stable objects for accelerating the classification phase of k-means

Dynamic k-NN Classification Based on Region Homogeneity

A novel supervised cluster adjustment method using a fast exact nearest neighbor search algorithm

1 Introduction

3 Methodology

3.1 Dynamic Clustering Algorithm

3.1.1 DC as a Generalization of K-Means Algorithm

3.1.2 The Need for Adaptive Distances in DC

3.2 Dynamic Clustering Algorithm to Partition Apriori Groups

3.3 New DC Variant Algorithm

3.4 KNN Classifier Based on Adaptive Distances and Novel Patterns

3.5 The DC-KNN Combined Algorithm

3.6 The DC-KNN Using the Centroids as the Nearest Neighbors

4 Experiments

4.1 Experimental Results on Real Datasets

4.2 Experimental Simulation Results

5 Conclusions and Future Works

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Ethics Approval

Conflict of Interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 298 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Novel Classification Algorithm Based on the Synergy Between Dynamic Clustering with Adaptive Distances and K-Nearest Neighbors

Abstract

Similar content being viewed by others

Identifying stable objects for accelerating the classification phase of k-means

Dynamic k-NN Classification Based on Region Homogeneity

A novel supervised cluster adjustment method using a fast exact nearest neighbor search algorithm

1 Introduction

3 Methodology

3.1 Dynamic Clustering Algorithm

3.1.1 DC as a Generalization of K-Means Algorithm

3.1.2 The Need for Adaptive Distances in DC

3.2 Dynamic Clustering Algorithm to Partition Apriori Groups

3.3 New DC Variant Algorithm

3.4 KNN Classifier Based on Adaptive Distances and Novel Patterns

3.5 The DC-KNN Combined Algorithm

3.6 The DC-KNN Using the Centroids as the Nearest Neighbors

4 Experiments

4.1 Experimental Results on Real Datasets

4.2 Experimental Simulation Results

5 Conclusions and Future Works

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Ethics Approval

Conflict of Interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 298 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation