Double-weighted kNN: a simple and efficient variant with embedded feature selection

Moreno-Ribera, Almudena; Calviño, Aida

doi:10.1057/s41270-024-00302-5

Double-weighted kNN: a simple and efficient variant with embedded feature selection

Original Article
Published: 06 April 2024

(2024)
Cite this article

Journal of Marketing Analytics Aims and scope Submit manuscript

25 Accesses
Explore all metrics

Abstract

Predictive modeling aims at providing estimates of an unknown variable, the target, from a set of known ones, the input. The k Nearest Neighbors (kNN) is one of the best-known predictive algorithms due to its simplicity and well behavior. However, this class of models has some drawbacks, such as the non-robustness to the existence of irrelevant input features or the need to transform qualitative variables into dummies, with the corresponding loss of information for ordinal ones. In this work, a kNN regression variant, easily adaptable for classification purposes, is suggested. The proposal allows dealing with all types of input variables while embedding feature selection in a simple and efficient manner, reducing the tuning phase. More precisely, making use of the weighted Gower distance, we develop a powerful tool to cope with these inconveniences. Finally, to boost the tool predictive power, a second weighting scheme is added to the neighbors. The proposed method is applied to a collection of 20 data sets, different in size, data type, and distribution of the target variable. Moreover, the results are compared with the previously proposed kNN variants, showing its supremacy, particularly when the weighting scheme is based on non-linear association measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Improved nearest neighbor classifiers by weighting and selection of predictors

Article 05 July 2015

A Comparison of Robust Model Choice Criteria Within a Metalearning Study

Improving the Behavior of the Nearest Neighbor Classifier against Noisy Data with Feature Weighting Schemes

Notes

For example, when dummy variables are created, each one is considered as an independent addend and, thus, the overall contribution of categorical variables is multiplied by the number of dummy variables, overpowering that of the interval ones.
In this sense, it can be argued that in the case of the Euclidean distance, ordinal variables can also be replaced with the corresponding position index in their natural ordering and treated as quantitative ones. However, this strategy does not generally outperform the classic dummy approach, as it can be seen in the “Appendix.”
We remind the reader that the values ${d}_{ij}$ are Gower distances and, thus, they take values from 0 to 1. The values of zero in the distance matrix were previously converted to 10⁻⁷.
We note that, as we are working with the Gower distance, the maximum and minimum values are fixed and thus, Eq. (3) is equivalent to ${nw}_{ij}=1-{d}_{ij}.$

References

Bhattacharya, G., K. Ghosh, and A.S. Chowdhury. 2017. Granger causality driven AHP for feature weighted kNN. Pattern Recognition 66: 425–436.
Article Google Scholar
Cover, T., and P. Hart. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13 (1): 21–27.
Article Google Scholar
Davis, J.V., B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. 2007. Information theoretic metric learning. In Proceedings of the 24th international conference on machine learning, 2007, 209–216.
D’Orazio, M. 2021. Distances with mixed type variables some modified Gower’s coefficients. Ar**v abs/2101.02481.
Dua, D., and C. Graff. 2019. UCI machine learning repository. Irvine: University of California.
Google Scholar
Duarte, V., S. Zuniga-Jara, and S. Contreras. 2022. Machine learning and marketing: A systematic literature review. IEEE Access 10: 93273.
Article Google Scholar
Dudani, S.A. 1976. The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics SMC-6 (4): 325–327.
Article Google Scholar
Dzyabura, D., S. Jagabathula, and E. Muller. 2019. Accounting for discrepancies between online and offline product evaluations. Marketing Science (providence, RI) 38 (1): 88–106.
Article Google Scholar
Gao, Y., and F. Gao. 2010. Edited AdaBoost by weighted kNN. Neurocomputing 73: 3079–3088.
Article Google Scholar
Goldberger, J., S. Roweis, G. Hinton, and R. Salakhutdinov. 2005. Neighbourhood components analysis. In Proceedings of the 17th international conference on neural information processing systems, NIPS’04, 2005, 513–520. Cambridge: MIT Press.
Gower, J.C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27 (4): 857–871.
Article Google Scholar
Huang, J., Y. Wei, J. Yi, and M. Liu. 2018. An improved kNN based on class contribution and feature weighting. In 2018 10th International conference on measuring technology and mechatronics automation (ICMTMA), 2018, 313–316.
Kuhn, M., and K. Johnson. 2013. Applied predictive modeling. New York: Springer.
Book Google Scholar
Mahalanobis, P.C. 1936. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (calcutta) 2: 49–55.
Google Scholar
Prabadevi, B., R. Shalini, and B.R. Kavitha. 2023. Customer churning analysis using machine learning algorithms. International Journal of Intelligent Networks 4: 145.
Article Google Scholar
R Core Team. 2021. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.
Google Scholar
Räsänen, O., and J. Pohjalainen. 2013. Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. In INTERSPEECH, 2013, 210–214.
Robnik-Sikonja, M., and I. Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 53: 23–69.
Article Google Scholar
Sezen, B., K. Pauwels, and B. Ataman. 2023. How do line extensions impact brand sales? The role of feature similarity and brand architecture. Journal of Marketing Analytics. https://doi.org/10.1057/s41270-023-00265-z.
Article Google Scholar
Wu, X., V. Kumar, R. Quinlan, et al. 2008. Top 10 algorithms in data mining. Knowledge and Information Systems 14: 1–37.
Article Google Scholar
**ng, E., M. Jordan, S.J. Russell, and A. Ng. 2002. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, vol. 15, ed. S. Becker, S. Thrun, and K. Obermayer. Cambridge: MIT Press.
Google Scholar
Zhang, S., D. Cheng, Z. Deng, M. Zong, and X. Deng. 2018. A novel kNN algorithm with data-driven k parameter computation. Pattern Recognition Letters 109: 44–54.
Article Google Scholar

Download references

Funding

NextGenerationEU Funds, Programa Investigo (Grant No. CT36/22-04-UCM-INV).

Author information

Authors and Affiliations

Department of Statistics and Data Science, Complutense University of Madrid, Madrid, Spain
Almudena Moreno-Ribera & Aida Calviño
Facultad de Estudios Estadísticos, Avenida Puerta de Hierro s/n, 28040, Madrid, Spain
Aida Calviño

Authors

Almudena Moreno-Ribera
View author publications
You can also search for this author in PubMed Google Scholar
Aida Calviño
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aida Calviño.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

In this “Appendix” we further explore the different possibilities that exist regarding the treatment given to ordinal variables. The default strategy when working with this type of variables and the Euclidean distance consists of ignoring the natural order of the levels and generating several dummy variables that can be treated as interval ones afterward.

An alternative to this strategy, which is inspired by the treatment given in the Gower distance to this kind of variables, consists of replacing them with the corresponding position index in their natural ordering and then treat them as quantitative ones (which includes the standardization phase).

From a theoretical point of view, the fact that interval variables need to be standardized before use in the Euclidean distance, implies that the values the distances take are determined by the variable's variance:

$${d}_{{\text{euc}}}\left({z}_{i},{z}_{j}\right)=\sqrt{{\sum }_{k=1}^{p}{\left({z}_{ik}-{z}_{jk}\right)}^{2}}=\sqrt{{\sum }_{k=1}^{p}{\left(\frac{{x}_{ik}-\overline{{x }_{k}}}{{\text{sd}}\left({x}_{k}\right)}-\frac{{x}_{jk}-\overline{{x }_{k}}}{{\text{sd}}\left({x}_{k}\right)}\right)}^{2}}=\sqrt{{\sum }_{k=1}^{p}{\left(\frac{{x}_{ik}-{x}_{jk}}{{\text{sd}}\left({x}_{k}\right)}\right)}^{2}.}$$

This translates to the fact that, for instance, two ordinal variables (considered as interval) with the same number of levels, will contribute differently to the overall distance simply because of the differences in the variance (which, in turn, depends on the frequency of the categories of the original variable). This does not take place when using the Gower distance [see Eq. (2)], where the distance in the variables depends solely on the values that it takes (and not its distribution). As an illustrative example, let us assume that there are two ordinal variables with three categories, which are transformed into 1–2–3 according to its natural order. The first of them, X₁, has frequencies equal to 10–80–10% and the second, X₂, 33–33–33%. This leads to variances of 0.2 and 2.8867, respectively. If two observations take the furthest possible values in both variables, the contribution of both variables would be 20 and 1.3857, respectively. In other words, even when the difference among values is the same, variable X₁ contribution is more than 14 times higher than that of X₂.

Furthermore, in order to show that this strategy (coding ordinal variables as interval ones) does not overpower the classic one, we have conducted another experiment, comparing the performance of both strategies when applying kNN with the Euclidean distance. For this purpose, we have repeated the experiment explained in “Experimental results” section but only on the data sets that have at least one ordinal variable (see Table 2 to check the data sets and their order in the plot). The results are included in Fig. 5, which can be interpreted in the same fashion as Fig. 4B. As it can be seen, no significant differences arise between both strategies, except for the case of the third data set (where some improvement can be seen but not enough to surpass the Gower-based proposal, see Fig. 4B), which implies that regardless of the treatment applied to ordinal variables in the Euclidean context, the Gower-based proposal leads to better results.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Moreno-Ribera, A., Calviño, A. Double-weighted kNN: a simple and efficient variant with embedded feature selection. J Market Anal (2024). https://doi.org/10.1057/s41270-024-00302-5

Download citation

Revised: 28 September 2023
Accepted: 23 February 2024
Published: 06 April 2024
DOI: https://doi.org/10.1057/s41270-024-00302-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Double-weighted kNN: a simple and efficient variant with embedded feature selection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improved nearest neighbor classifiers by weighting and selection of predictors

A Comparison of Robust Model Choice Criteria Within a Metalearning Study

Improving the Behavior of the Nearest Neighbor Classifier against Noisy Data with Feature Weighting Schemes

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Double-weighted kNN: a simple and efficient variant with embedded feature selection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improved nearest neighbor classifiers by weighting and selection of predictors

A Comparison of Robust Model Choice Criteria Within a Metalearning Study

Improving the Behavior of the Nearest Neighbor Classifier against Noisy Data with Feature Weighting Schemes

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation