Log in

Double-weighted kNN: a simple and efficient variant with embedded feature selection

  • Original Article
  • Published:
Journal of Marketing Analytics Aims and scope Submit manuscript

Abstract

Predictive modeling aims at providing estimates of an unknown variable, the target, from a set of known ones, the input. The k Nearest Neighbors (kNN) is one of the best-known predictive algorithms due to its simplicity and well behavior. However, this class of models has some drawbacks, such as the non-robustness to the existence of irrelevant input features or the need to transform qualitative variables into dummies, with the corresponding loss of information for ordinal ones. In this work, a kNN regression variant, easily adaptable for classification purposes, is suggested. The proposal allows dealing with all types of input variables while embedding feature selection in a simple and efficient manner, reducing the tuning phase. More precisely, making use of the weighted Gower distance, we develop a powerful tool to cope with these inconveniences. Finally, to boost the tool predictive power, a second weighting scheme is added to the neighbors. The proposed method is applied to a collection of 20 data sets, different in size, data type, and distribution of the target variable. Moreover, the results are compared with the previously proposed kNN variants, showing its supremacy, particularly when the weighting scheme is based on non-linear association measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. For example, when dummy variables are created, each one is considered as an independent addend and, thus, the overall contribution of categorical variables is multiplied by the number of dummy variables, overpowering that of the interval ones.

  2. In this sense, it can be argued that in the case of the Euclidean distance, ordinal variables can also be replaced with the corresponding position index in their natural ordering and treated as quantitative ones. However, this strategy does not generally outperform the classic dummy approach, as it can be seen in the “Appendix.”

  3. We remind the reader that the values \({d}_{ij}\) are Gower distances and, thus, they take values from 0 to 1. The values of zero in the distance matrix were previously converted to 10−7.

  4. We note that, as we are working with the Gower distance, the maximum and minimum values are fixed and thus, Eq. (3) is equivalent to \({nw}_{ij}=1-{d}_{ij}.\)

References

  • Bhattacharya, G., K. Ghosh, and A.S. Chowdhury. 2017. Granger causality driven AHP for feature weighted kNN. Pattern Recognition 66: 425–436.

    Article  Google Scholar 

  • Cover, T., and P. Hart. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13 (1): 21–27.

    Article  Google Scholar 

  • Davis, J.V., B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. 2007. Information theoretic metric learning. In Proceedings of the 24th international conference on machine learning, 2007, 209–216.

  • D’Orazio, M. 2021. Distances with mixed type variables some modified Gower’s coefficients. Ar**v abs/2101.02481.

  • Dua, D., and C. Graff. 2019. UCI machine learning repository. Irvine: University of California.

    Google Scholar 

  • Duarte, V., S. Zuniga-Jara, and S. Contreras. 2022. Machine learning and marketing: A systematic literature review. IEEE Access 10: 93273.

    Article  Google Scholar 

  • Dudani, S.A. 1976. The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics SMC-6 (4): 325–327.

    Article  Google Scholar 

  • Dzyabura, D., S. Jagabathula, and E. Muller. 2019. Accounting for discrepancies between online and offline product evaluations. Marketing Science (providence, RI) 38 (1): 88–106.

    Article  Google Scholar 

  • Gao, Y., and F. Gao. 2010. Edited AdaBoost by weighted kNN. Neurocomputing 73: 3079–3088.

    Article  Google Scholar 

  • Goldberger, J., S. Roweis, G. Hinton, and R. Salakhutdinov. 2005. Neighbourhood components analysis. In Proceedings of the 17th international conference on neural information processing systems, NIPS’04, 2005, 513–520. Cambridge: MIT Press.

  • Gower, J.C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27 (4): 857–871.

    Article  Google Scholar 

  • Huang, J., Y. Wei, J. Yi, and M. Liu. 2018. An improved kNN based on class contribution and feature weighting. In 2018 10th International conference on measuring technology and mechatronics automation (ICMTMA), 2018, 313–316.

  • Kuhn, M., and K. Johnson. 2013. Applied predictive modeling. New York: Springer.

    Book  Google Scholar 

  • Mahalanobis, P.C. 1936. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (calcutta) 2: 49–55.

    Google Scholar 

  • Prabadevi, B., R. Shalini, and B.R. Kavitha. 2023. Customer churning analysis using machine learning algorithms. International Journal of Intelligent Networks 4: 145.

    Article  Google Scholar 

  • R Core Team. 2021. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.

    Google Scholar 

  • Räsänen, O., and J. Pohjalainen. 2013. Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech. In INTERSPEECH, 2013, 210–214.

  • Robnik-Sikonja, M., and I. Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 53: 23–69.

    Article  Google Scholar 

  • Sezen, B., K. Pauwels, and B. Ataman. 2023. How do line extensions impact brand sales? The role of feature similarity and brand architecture. Journal of Marketing Analytics. https://doi.org/10.1057/s41270-023-00265-z.

    Article  Google Scholar 

  • Wu, X., V. Kumar, R. Quinlan, et al. 2008. Top 10 algorithms in data mining. Knowledge and Information Systems 14: 1–37.

    Article  Google Scholar 

  • **ng, E., M. Jordan, S.J. Russell, and A. Ng. 2002. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, vol. 15, ed. S. Becker, S. Thrun, and K. Obermayer. Cambridge: MIT Press.

    Google Scholar 

  • Zhang, S., D. Cheng, Z. Deng, M. Zong, and X. Deng. 2018. A novel kNN algorithm with data-driven k parameter computation. Pattern Recognition Letters 109: 44–54.

    Article  Google Scholar 

Download references

Funding

NextGenerationEU Funds, Programa Investigo (Grant No. CT36/22-04-UCM-INV).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aida Calviño.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

In this “Appendix” we further explore the different possibilities that exist regarding the treatment given to ordinal variables. The default strategy when working with this type of variables and the Euclidean distance consists of ignoring the natural order of the levels and generating several dummy variables that can be treated as interval ones afterward.

An alternative to this strategy, which is inspired by the treatment given in the Gower distance to this kind of variables, consists of replacing them with the corresponding position index in their natural ordering and then treat them as quantitative ones (which includes the standardization phase).

From a theoretical point of view, the fact that interval variables need to be standardized before use in the Euclidean distance, implies that the values the distances take are determined by the variable's variance:

$${d}_{{\text{euc}}}\left({z}_{i},{z}_{j}\right)=\sqrt{{\sum }_{k=1}^{p}{\left({z}_{ik}-{z}_{jk}\right)}^{2}}=\sqrt{{\sum }_{k=1}^{p}{\left(\frac{{x}_{ik}-\overline{{x }_{k}}}{{\text{sd}}\left({x}_{k}\right)}-\frac{{x}_{jk}-\overline{{x }_{k}}}{{\text{sd}}\left({x}_{k}\right)}\right)}^{2}}=\sqrt{{\sum }_{k=1}^{p}{\left(\frac{{x}_{ik}-{x}_{jk}}{{\text{sd}}\left({x}_{k}\right)}\right)}^{2}.}$$

This translates to the fact that, for instance, two ordinal variables (considered as interval) with the same number of levels, will contribute differently to the overall distance simply because of the differences in the variance (which, in turn, depends on the frequency of the categories of the original variable). This does not take place when using the Gower distance [see Eq. (2)], where the distance in the variables depends solely on the values that it takes (and not its distribution). As an illustrative example, let us assume that there are two ordinal variables with three categories, which are transformed into 1–2–3 according to its natural order. The first of them, X1, has frequencies equal to 10–80–10% and the second, X2, 33–33–33%. This leads to variances of 0.2 and 2.8867, respectively. If two observations take the furthest possible values in both variables, the contribution of both variables would be 20 and 1.3857, respectively. In other words, even when the difference among values is the same, variable X1 contribution is more than 14 times higher than that of X2.

Furthermore, in order to show that this strategy (coding ordinal variables as interval ones) does not overpower the classic one, we have conducted another experiment, comparing the performance of both strategies when applying kNN with the Euclidean distance. For this purpose, we have repeated the experiment explained in “Experimental results” section but only on the data sets that have at least one ordinal variable (see Table 2 to check the data sets and their order in the plot). The results are included in Fig. 5, which can be interpreted in the same fashion as Fig. 4B. As it can be seen, no significant differences arise between both strategies, except for the case of the third data set (where some improvement can be seen but not enough to surpass the Gower-based proposal, see Fig. 4B), which implies that regardless of the treatment applied to ordinal variables in the Euclidean context, the Gower-based proposal leads to better results.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moreno-Ribera, A., Calviño, A. Double-weighted kNN: a simple and efficient variant with embedded feature selection. J Market Anal (2024). https://doi.org/10.1057/s41270-024-00302-5

Download citation

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1057/s41270-024-00302-5

Keywords

Navigation