Log in

Towards exploiting linear regression for multi-class/multi-label classification: an empirical analysis

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Regression and classification are the two main learning tasks in supervised learning, and both of them can be solved by learning a hyperplane from training samples. However, the hyperplane in regression task aims at approximating the labels of samples as much as possible, while the hyperplane in classification task aims at separating the samples belonging to different classes as much as possible. From this perspective, regression and classification are two completely different learning tasks. However, linear regression is often used to solve multi-class/multi-label classification problems, which can be decomposed into a set of binary classification problems. In this paper, we focus on analyzing the issues of regression models in classification tasks. Firstly, when \(\{-1, +1\}\) is used to denote negative and positive class, we derive that it is essentially equivalent to optimizing square loss as the surrogate loss function of zero-one loss to solve binary classification problem via learning linear regression model. Then, we also derive what will happen to the model when \(\{-1, +1\}\) is replaced with \(\{0, 1\}\) for three different versions of linear regression. Finally, extensive experiments are conducted over multi-label/multi-class classification tasks and corresponding discussions are further conducted according to the experimental results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability statement

The data sets used in Sect. 6 are publicly available at: https://mulan.sourceforge.net/datasets-mlc.html The data sets used in Sect. 7 are publicly available at: https://www.csie.ntu.edu.tw/\,cjlin/libsvmtools/datasets/multiclass.html.

Notes

  1. Multi-class/multi-label classification tasks can be solved by decomposing them into a set of binary classification problems [10,11,12], so we only focus on the basic binary classification task here.

  2. According to the different problem settings, for each instance, there is only one relevant label in multi-class classification while there can be multiple relevant labels in multi-label classification. Here, we use \(+1\) and \(-1\) to represent relevant and irrelevant labels, and we can also use 1 and 0 instead.

  3. For example, \({\mathcal{R}}({{\textbf{W}}}) = \left\| {{\textbf{W}}} \right\| _F^2\) and \({\mathcal{R}}({{\textbf{W}}}) = \left\| {{\textbf{W}}} \right\| _1\) aim at obtaining balanced weights and sparse weights, respectively.

  4. The threshold should be set to 0.5 if we use 1 and 0 to represent relevant and irrelevant labels.

  5. Strictly speaking, linear regression [5] does not include the second term \(\left\| {{{\varvec{w}}}} \right\| _2^2\) while the regularized version in formulation (10) is usually termed as ridge regression [44, 45].

  6. Generally speaking, \(\ell _2\)-regularization \(\left\| {{{\varvec{w}}}} \right\| _2^2\) aims at obtaining balanced model parameters to avoid overfitting to some features. For example, if the j-th item \(w_j\) of \({{\varvec{w}}}\) is very large compared to the other items, then a relatively small variation in the j-th feature will lead to large difference in modeling outputs. However, the bias term b functions equally to all instances and thus it is unnecessary to regularize the bias term b.

  7. In fact, the label set \(\{0, 1\}\) can be regarded as the ground-truth posterior probability of any instance \({{\varvec{x}}}_i\) belonging to positive class, i.e., \(p(+1 \mid {{{\varvec{x}}}}_i) = 1\) holds for positive samples while \(p(+1 \mid {{{\varvec{x}}}}_i) = 0\) holds for negative samples. Thus, we use \(p_i\) to denote \({{\varvec{x}}}_i\)’s label if it takes the value in \(\{0,1\}\) in this section.

  8. In experiments, the linear equation \(\textbf{A}{{{\varvec{x}}}} = {{\varvec{b}}}\) is solved via the command “\({{{\varvec{x}}}} = \textbf{A} \setminus {{\varvec{b}}}\)” which is more recommended than “\({{{\varvec{x}}}} = \text{pinv}(\textbf{A}) * {{\varvec{b}}}\)” in Matlab.

  9. https://mulan.sourceforge.net/datasets-mlc.html.

  10. https://www.csie.ntu.edu.tw/\,cjlin/libsvmtools/datasets/multiclass.html.

References

  1. Zhou Z-H (2021) Machine learning. Springer, Singapore. https://doi.org/10.1007/978-981-15-1967-3

    Book  Google Scholar 

  2. Han J, Pei J, Tong H (2022) Data mining: concepts and techniques, 4th edn. Morgan Kaufmann, Cambridge

    Google Scholar 

  3. Bzdok D, Krzywinski M, Altman N (2018) Machine learning: supervised methods. Nat Methods 15:5–6. https://doi.org/10.1038/nmeth.4551

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Verdhan V (2020) Supervised learning with Python. Apress, Berkeley. https://doi.org/10.1007/978-1-4842-6156-9

    Book  Google Scholar 

  5. Bingham NH, Fry JM (2010) Regression: linear models in statistics. Springer, London. https://doi.org/10.1007/978-1-84882-969-5

    Book  Google Scholar 

  6. Drummond C (2017) Classification. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, Boston, pp 205–208. https://doi.org/10.1007/978-1-4899-7687-1_111

    Chapter  Google Scholar 

  7. Zhang M-L, Zhou Z-H (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837. https://doi.org/10.1109/TKDE.2013.39

    Article  Google Scholar 

  8. Gibaja E, Ventura S (2015) A tutorial on multilabel learning. ACM Comput Surv 47(3):52. https://doi.org/10.1145/2716262

    Article  Google Scholar 

  9. Liu W, Wang H, Shen X, Tsang IW (2022) The emerging trends of multi-label learning. IEEE Trans Pattern Anal Mach Intell 44(11):7955–7974. https://doi.org/10.1109/TPAMI.2021.3119334

    Article  PubMed  Google Scholar 

  10. Dietterich TG, Bakiri G (1995) Solving multiclass learning problems via error-correcting output codes. J Artif Intell Res 2:263–286. https://doi.org/10.1613/jair.105

    Article  Google Scholar 

  11. Jia B-B, Liu J-Y, Hang J-Y, Zhang M-L (2023) Learning label-specific features for decomposition-based multi-class classification. Front Comput Sci 17(6):176348. https://doi.org/10.1007/s11704-023-3076-y

    Article  Google Scholar 

  12. Zhang M-L, Li Y-K, Liu X-Y, Geng X (2018) Binary relevance for multi-label learning: an overview. Front Comput Sci 12(2):191–202. https://doi.org/10.1007/s11704-017-7031-7

    Article  Google Scholar 

  13. Aggarwal CC (2018) Linear classification and regression for text. In: Machine learning for text. Springer, Cham, pp 159–208. https://doi.org/10.1007/978-3-319-73531-3_6

  14. Xue H, Chen S, Yang Q (2009) Discriminatively regularized least-squares classification. Pattern Recognit 42(1):93–104. https://doi.org/10.1016/j.patcog.2008.07.010

    Article  ADS  Google Scholar 

  15. **ang S, Nie F, Meng G, Pan C, Zhang C (2012) Discriminative least squares regression for multiclass classification and feature selection. IEEE Trans Neural Netw Learn Syst 23(11):1738–1754. https://doi.org/10.1109/TNNLS.2012.2212721

    Article  PubMed  Google Scholar 

  16. Zhang X-Y, Wang L, **ang S, Liu C-L (2015) Retargeted least squares regression algorithm. IEEE Trans Neural Netw Learn Syst 26(9):2206–2213. https://doi.org/10.1109/TNNLS.2014.2371492

    Article  MathSciNet  PubMed  Google Scholar 

  17. Liu M, Zhang D, Chen S, Xue H (2016) Joint binary classifier learning for ECOC-based multi-class classification. IEEE Trans Pattern Anal Mach Intell 38(11):2335–2341. https://doi.org/10.1109/TPAMI.2015.2430325

    Article  Google Scholar 

  18. Ma Z, Chen S (2018) Multi-dimensional classification via a metric approach. Neurocomputing 275:1121–1131. https://doi.org/10.1016/j.neucom.2017.09.057

    Article  Google Scholar 

  19. Yang C, Wang W, Feng X, He R (2020) Group discriminative least square regression for multicategory classification. Neurocomputing 407:175–184. https://doi.org/10.1016/j.neucom.2020.05.016

    Article  Google Scholar 

  20. Zhan S, Wu J, Han N, Wen J, Fang X (2020) Group low-rank representation-based discriminant linear regression. IEEE Trans Circuits Syst Video Technol 30(3):760–770. https://doi.org/10.1109/TCSVT.2019.2897072

    Article  Google Scholar 

  21. Huang J, Li G, Huang Q, Wu X (2016) Learning label-specific features and class-dependent labels for multi-label classification. IEEE Trans Knowl Data Eng 28(12):3309–3323. https://doi.org/10.1109/TKDE.2016.2608339

    Article  Google Scholar 

  22. Yu Z-B, Zhang M-L (2022) Multi-label classification with label-specific feature generation: a wrapped approach. IEEE Trans Pattern Anal Mach Intell 44(9):5199–5210. https://doi.org/10.1109/TPAMI.2021.3070215

    Article  PubMed  Google Scholar 

  23. Zhou W-J, Yu Y, Zhang M-L (2017) Binary linear compression for multi-label classification. In: Proceedings of the 26th international joint conference on artificial intelligence. ijcai.org, Melbourne, Australia, pp 3546–3552. https://doi.org/10.24963/ijcai.2017/496

  24. Jia B-B, Zhang M-L (2023) Multi-dimensional classification via decomposed label encoding. IEEE Trans Knowl Data Eng 35(2):1844–1856. https://doi.org/10.1109/TKDE.2021.3100436

    Article  Google Scholar 

  25. Naseem I, Togneri R, Bennamoun M (2010) Linear regression for face recognition. IEEE Trans Pattern Anal Mach Intell 32(11):2106–2112. https://doi.org/10.1109/TPAMI.2010.128

    Article  PubMed  Google Scholar 

  26. Liu W, Tsang IW (2015) Large margin metric learning for multi-label prediction. In: Proceedings of the 29th AAAI conference on artificial intelligence. AAAI Press, Austin, pp 2800–2806. https://doi.org/10.1609/aaai.v29i1.9610

  27. Lv J, Wu T, Peng C-L, Liu Y, Xu N, Geng X (2021) Compact learning for multi-label classification. Pattern Recognit 113:107833. https://doi.org/10.1016/j.patcog.2021.107833

    Article  Google Scholar 

  28. Bishop CM (2006) Pattern recognition and machine learning. Springer, Singapore

    Google Scholar 

  29. Mohri M, Rostamizadeh A, Talwalkar A (2018) Foundations of machine learning, 2nd edn. MIT Press, Cambridge

    Google Scholar 

  30. Zhang T (2004) Statistical behavior and consistency of classification methods based on convex risk minimization. Ann Stat 32(5):56–85. https://doi.org/10.1214/aos/1079120130

    Article  MathSciNet  CAS  Google Scholar 

  31. Cai X, Ding CHQ, Nie F, Huang H (2013) On the equivalent of low-rank linear regressions and linear discriminant analysis based regressions. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Chicago, pp 1124–1132. https://doi.org/10.1145/2487575.2487701

  32. Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434

    MathSciNet  Google Scholar 

  33. Wen J, Xu Y, Li Z, Ma Z, Xu Y (2018) Inter-class sparsity based discriminative least square regression. Neural Netw 102:36–47. https://doi.org/10.1016/j.neunet.2018.02.002

    Article  PubMed  Google Scholar 

  34. Shao R, Xu N, Geng X (2018) Multi-label learning with label enhancement. In: Proceedings of the IEEE international conference on data mining. IEEE, Singapore, pp 437–446. https://doi.org/10.1109/ICDM.2018.00059

  35. Tao A, Xu N, Geng X (2018) Labeling information enhancement for multi-label learning with low-rank subspace. In: Proceedings of the 15th Pacific rim international conference on artificial intelligence. Springer, Nan**g, pp 671–683. https://doi.org/10.1007/978-3-319-97304-3_51

  36. Xu N, Liu Y-P, Geng X (2021) Label enhancement for label distribution learning. IEEE Trans Knowl Data Eng 33(4):1632–1643. https://doi.org/10.1109/TKDE.2019.2947040

    Article  Google Scholar 

  37. Hou P, Geng X, Zhang M-L (2016) Multi-label manifold learning. In: Proceedings of the 30th AAAI conference on artificial intelligence. AAAI Press, Phoenix, pp 1680–1686. https://doi.org/10.1609/aaai.v30i1.10258

  38. Zhang M-L, Zhou B-B, Liu X-Y (2016) Partial label learning via feature−aware disambiguation. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, San Francisco, pp 1335–1344. https://doi.org/10.1145/2939672.2939788

  39. Zhang Q-W, Zhong Y, Zhang M-L (2018) Feature−induced labeling information enrichment for multi-label learning. In: Proceedings of the 32nd AAAI conference on artificial intelligence. AAAI Press, New Orleans, pp 4446–4453. https://doi.org/10.1609/aaai.v32i1.11656

  40. Lv J, Xu N, Zheng R, Geng X (2019) Weakly supervised multi-label learning via label enhancement. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence. ijcai.org, Macao, China, pp 3101–3107. https://doi.org/10.24963/ijcai.2019/430

  41. Xu N, Lv J, Geng X (2019) Partial label learning via label enhancement. In: Proceedings of the 33rd AAAI conference on artificial intelligence. AAAI Press, Honolulu, pp 5557–5564. https://doi.org/10.1609/aaai.v33i01.33015557

  42. Xu N, Liu Y-P, Geng X (2020) Partial multi-label learning with label distribution. In: Proceedings of the 34th AAAI conference on artificial intelligence. AAAI Press, New York, pp 6510–6517. https://doi.org/10.1609/aaai.v34i04.6124

  43. Wang L, Pan C (2018) Groupwise retargeted least-squares regression. IEEE Trans Neural Netw Learn Syst 29(4):1352–1358. https://doi.org/10.1109/TNNLS.2017.2651169

    Article  MathSciNet  PubMed  Google Scholar 

  44. Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67

    Article  Google Scholar 

  45. van Wieringen WN (2015) Lecture notes on ridge regression. ar**v preprint. ar**v:1509.09169

  46. Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  47. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. https://doi.org/10.1007/BF00994018

    Article  Google Scholar 

  48. Kleinbaum DG, Klein M (2010) Logistic regression: a self-learning text. Springer, New York. https://doi.org/10.1007/978-1-4419-1742-3

    Book  Google Scholar 

  49. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. https://doi.org/10.1006/jcss.1997.1504

    Article  MathSciNet  Google Scholar 

  50. Vapnik V, Chervonenkis A (1991) The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognit Image Anal 1(3):283–305

    Google Scholar 

  51. Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874

    Google Scholar 

  52. Zhang M-L, Wu L (2015) LIFT: multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120. https://doi.org/10.1109/TPAMI.2014.2339815

    Article  CAS  Google Scholar 

  53. Wu X-Z, Zhou Z-H (2017) A unified view of multi-label performance measures. In: Proceedings of the 34th international conference on machine learning. PMLR, Sydney, pp 3780–3788

  54. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  Google Scholar 

  55. Lorena AC, Carvalho ACPLF, Gama J (2008) A review on the combination of binary classifiers in multiclass problems. Artif Intell Rev 30(1–4):19–37. https://doi.org/10.1007/s10462-009-9114-9

    Article  Google Scholar 

  56. Patro SGK, Sahu KK (2015) Normalization: a preprocessing stage. Int Adv Res J Sci Eng Technol 2(3):20–22. https://doi.org/10.17148/IARJSET.2015.2305

    Article  Google Scholar 

  57. Henderi H, Wahyuningsih T, Rahwanto E (2021) Comparison of min-max normalization and z-score normalization in the k-nearest neighbor (kNN) algorithm to test the accuracy of types of breast cancer. Int J Inform Inf Syst 4(1):13–20. https://doi.org/10.47738/IJIIS.V4I1.73

    Article  Google Scholar 

  58. Liu J-Y, Jia B-B (2020) Combining one−vs-one decomposition and instance−based learning for multi-class classification. IEEE Access 8:197499–197507. https://doi.org/10.1109/ACCESS.2020.3034448

    Article  Google Scholar 

Download references

Acknowledgements

The authors wish to thank the associate editor and anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Science Foundation of China (62306131, 62176055), the Fundamental Research Funds for the Central Universities, and the Red Willow Outstanding Youth Talent Support Program of Lanzhou University of Technology. We thank the Big Data Center of Southeast University for providing the facility support on the numerical calculations in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun-Ying Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1: Detailed experimental results

Appendix 1: Detailed experimental results

In the appendix, Tables 18, 19, 20, 21, 22, 23 and 24 report the detailed experimental results on multi-label data sets and Tables 25 and 26 report the detailed experimental results on multi-class data sets, where the corresponding Wilcoxon signed-ranks test results have been reported and discussed in the paper. Here, we purposefully keep five decimal digits to compare the possibly existing tiny performance differences among different approaches. For convenience, the performance ranks of all compared approaches over each data set are shown in parentheses and the average ranks over all data sets are also shown in the last row of each table. Note that when RBF kernel is used, KerRegBias and KerNoBias will achieve identical performance because the extra bias does not affect the distance between any two instances, thus we show their experimental results together and denote them as KerNo/RegBias.

Table 18 Experimental results in terms of hamming loss (the smaller, the better)
Table 19 Experimental results in terms of macro-F1 (the larger, the better)
Table 20 Experimental results in terms of micro-F1 (the larger, the better)
Table 21 Experimental results in terms of average precision (the larger, the better)
Table 22 Experimental results in terms of ranking loss (the smaller, the better)
Table 23 Experimental results in terms of one error (the smaller, the better)
Table 24 Experimental results in terms of coverage (the smaller, the better)
Table 25 Experimental results in terms of accuracy (the larger, the better)
Table 26 Experimental results in terms of average−F1 (the larger, the better)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jia, BB., Liu, JY. & Zhang, ML. Towards exploiting linear regression for multi-class/multi-label classification: an empirical analysis. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02114-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13042-024-02114-6

Keywords

Navigation