Abstract
Predictive modeling aims at providing estimates of an unknown variable, the target, from a set of known ones, the input. The k Nearest Neighbors (kNN) is one of the best-known predictive algorithms due to its simplicity and well behavior. However, this class of models has some drawbacks, such as the non-robustness to the existence of irrelevant input features or the need to transform qualitative variables into dummies, with the corresponding loss of information for ordinal ones. In this work, a kNN regression variant, easily adaptable for classification purposes, is suggested. The proposal allows dealing with all types of input variables while embedding feature selection in a simple and efficient manner, reducing the tuning phase. More precisely, making use of the weighted Gower distance, we develop a powerful tool to cope with these inconveniences. Finally, to boost the tool predictive power, a second weighting scheme is added to the neighbors. The proposed method is applied to a collection of 20 data sets, different in size, data type, and distribution of the target variable. Moreover, the results are compared with the previously proposed kNN variants, showing its supremacy, particularly when the weighting scheme is based on non-linear association measures.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1057%2Fs41270-024-00302-5/MediaObjects/41270_2024_302_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1057%2Fs41270-024-00302-5/MediaObjects/41270_2024_302_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1057%2Fs41270-024-00302-5/MediaObjects/41270_2024_302_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1057%2Fs41270-024-00302-5/MediaObjects/41270_2024_302_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1057%2Fs41270-024-00302-5/MediaObjects/41270_2024_302_Fig5_HTML.png)
Similar content being viewed by others
Notes
For example, when dummy variables are created, each one is considered as an independent addend and, thus, the overall contribution of categorical variables is multiplied by the number of dummy variables, overpowering that of the interval ones.
In this sense, it can be argued that in the case of the Euclidean distance, ordinal variables can also be replaced with the corresponding position index in their natural ordering and treated as quantitative ones. However, this strategy does not generally outperform the classic dummy approach, as it can be seen in the “Appendix.”
We remind the reader that the values \({d}_{ij}\) are Gower distances and, thus, they take values from 0 to 1. The values of zero in the distance matrix were previously converted to 10−7.
We note that, as we are working with the Gower distance, the maximum and minimum values are fixed and thus, Eq. (3) is equivalent to \({nw}_{ij}=1-{d}_{ij}.\)
References
Bhattacharya, G., K. Ghosh, and A.S. Chowdhury. 2017. Granger causality driven AHP for feature weighted kNN. Pattern Recognition 66: 425–436.
Cover, T., and P. Hart. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13 (1): 21–27.
Davis, J.V., B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. 2007. Information theoretic metric learning. In Proceedings of the 24th international conference on machine learning, 2007, 209–216.
D’Orazio, M. 2021. Distances with mixed type variables some modified Gower’s coefficients. Ar**v abs/2101.02481.
Dua, D., and C. Graff. 2019. UCI machine learning repository. Irvine: University of California.
Duarte, V., S. Zuniga-Jara, and S. Contreras. 2022. Machine learning and marketing: A systematic literature review. IEEE Access 10: 93273.
Dudani, S.A. 1976. The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics SMC-6 (4): 325–327.
Dzyabura, D., S. Jagabathula, and E. Muller. 2019. Accounting for discrepancies between online and offline product evaluations. Marketing Science (providence, RI) 38 (1): 88–106.
Gao, Y., and F. Gao. 2010. Edited AdaBoost by weighted kNN. Neurocomputing 73: 3079–3088.
Goldberger, J., S. Roweis, G. Hinton, and R. Salakhutdinov. 2005. Neighbourhood components analysis. In Proceedings of the 17th international conference on neural information processing systems, NIPS’04, 2005, 513–520. Cambridge: MIT Press.
Gower, J.C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27 (4): 857–871.
Huang, J., Y. Wei, J. Yi, and M. Liu. 2018. An improved kNN based on class contribution and feature weighting. In 2018 10th International conference on measuring technology and mechatronics automation (ICMTMA), 2018, 313–316.
Kuhn, M., and K. Johnson. 2013. Applied predictive modeling. New York: Springer.
Mahalanobis, P.C. 1936. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (calcutta) 2: 49–55.
Prabadevi, B., R. Shalini, and B.R. Kavitha. 2023. Customer churning analysis using machine learning algorithms. International Journal of Intelligent Networks 4: 145.
R Core Team. 2021. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.
Räsänen, O., and J. Pohjalainen. 2013. Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. In INTERSPEECH, 2013, 210–214.
Robnik-Sikonja, M., and I. Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 53: 23–69.
Sezen, B., K. Pauwels, and B. Ataman. 2023. How do line extensions impact brand sales? The role of feature similarity and brand architecture. Journal of Marketing Analytics. https://doi.org/10.1057/s41270-023-00265-z.
Wu, X., V. Kumar, R. Quinlan, et al. 2008. Top 10 algorithms in data mining. Knowledge and Information Systems 14: 1–37.
**ng, E., M. Jordan, S.J. Russell, and A. Ng. 2002. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, vol. 15, ed. S. Becker, S. Thrun, and K. Obermayer. Cambridge: MIT Press.
Zhang, S., D. Cheng, Z. Deng, M. Zong, and X. Deng. 2018. A novel kNN algorithm with data-driven k parameter computation. Pattern Recognition Letters 109: 44–54.
Funding
NextGenerationEU Funds, Programa Investigo (Grant No. CT36/22-04-UCM-INV).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
In this “Appendix” we further explore the different possibilities that exist regarding the treatment given to ordinal variables. The default strategy when working with this type of variables and the Euclidean distance consists of ignoring the natural order of the levels and generating several dummy variables that can be treated as interval ones afterward.
An alternative to this strategy, which is inspired by the treatment given in the Gower distance to this kind of variables, consists of replacing them with the corresponding position index in their natural ordering and then treat them as quantitative ones (which includes the standardization phase).
From a theoretical point of view, the fact that interval variables need to be standardized before use in the Euclidean distance, implies that the values the distances take are determined by the variable's variance:
This translates to the fact that, for instance, two ordinal variables (considered as interval) with the same number of levels, will contribute differently to the overall distance simply because of the differences in the variance (which, in turn, depends on the frequency of the categories of the original variable). This does not take place when using the Gower distance [see Eq. (2)], where the distance in the variables depends solely on the values that it takes (and not its distribution). As an illustrative example, let us assume that there are two ordinal variables with three categories, which are transformed into 1–2–3 according to its natural order. The first of them, X1, has frequencies equal to 10–80–10% and the second, X2, 33–33–33%. This leads to variances of 0.2 and 2.8867, respectively. If two observations take the furthest possible values in both variables, the contribution of both variables would be 20 and 1.3857, respectively. In other words, even when the difference among values is the same, variable X1 contribution is more than 14 times higher than that of X2.
Furthermore, in order to show that this strategy (coding ordinal variables as interval ones) does not overpower the classic one, we have conducted another experiment, comparing the performance of both strategies when applying kNN with the Euclidean distance. For this purpose, we have repeated the experiment explained in “Experimental results” section but only on the data sets that have at least one ordinal variable (see Table 2 to check the data sets and their order in the plot). The results are included in Fig. 5, which can be interpreted in the same fashion as Fig. 4B. As it can be seen, no significant differences arise between both strategies, except for the case of the third data set (where some improvement can be seen but not enough to surpass the Gower-based proposal, see Fig. 4B), which implies that regardless of the treatment applied to ordinal variables in the Euclidean context, the Gower-based proposal leads to better results.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Moreno-Ribera, A., Calviño, A. Double-weighted kNN: a simple and efficient variant with embedded feature selection. J Market Anal (2024). https://doi.org/10.1057/s41270-024-00302-5
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1057/s41270-024-00302-5