Abstract
Regression and classification are the two main learning tasks in supervised learning, and both of them can be solved by learning a hyperplane from training samples. However, the hyperplane in regression task aims at approximating the labels of samples as much as possible, while the hyperplane in classification task aims at separating the samples belonging to different classes as much as possible. From this perspective, regression and classification are two completely different learning tasks. However, linear regression is often used to solve multi-class/multi-label classification problems, which can be decomposed into a set of binary classification problems. In this paper, we focus on analyzing the issues of regression models in classification tasks. Firstly, when \(\{-1, +1\}\) is used to denote negative and positive class, we derive that it is essentially equivalent to optimizing square loss as the surrogate loss function of zero-one loss to solve binary classification problem via learning linear regression model. Then, we also derive what will happen to the model when \(\{-1, +1\}\) is replaced with \(\{0, 1\}\) for three different versions of linear regression. Finally, extensive experiments are conducted over multi-label/multi-class classification tasks and corresponding discussions are further conducted according to the experimental results.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13042-024-02114-6/MediaObjects/13042_2024_2114_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13042-024-02114-6/MediaObjects/13042_2024_2114_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13042-024-02114-6/MediaObjects/13042_2024_2114_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13042-024-02114-6/MediaObjects/13042_2024_2114_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13042-024-02114-6/MediaObjects/13042_2024_2114_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs13042-024-02114-6/MediaObjects/13042_2024_2114_Fig6_HTML.png)
Similar content being viewed by others
Data availability statement
The data sets used in Sect. 6 are publicly available at: https://mulan.sourceforge.net/datasets-mlc.html The data sets used in Sect. 7 are publicly available at: https://www.csie.ntu.edu.tw/\,cjlin/libsvmtools/datasets/multiclass.html.
Notes
According to the different problem settings, for each instance, there is only one relevant label in multi-class classification while there can be multiple relevant labels in multi-label classification. Here, we use \(+1\) and \(-1\) to represent relevant and irrelevant labels, and we can also use 1 and 0 instead.
For example, \({\mathcal{R}}({{\textbf{W}}}) = \left\| {{\textbf{W}}} \right\| _F^2\) and \({\mathcal{R}}({{\textbf{W}}}) = \left\| {{\textbf{W}}} \right\| _1\) aim at obtaining balanced weights and sparse weights, respectively.
The threshold should be set to 0.5 if we use 1 and 0 to represent relevant and irrelevant labels.
Generally speaking, \(\ell _2\)-regularization \(\left\| {{{\varvec{w}}}} \right\| _2^2\) aims at obtaining balanced model parameters to avoid overfitting to some features. For example, if the j-th item \(w_j\) of \({{\varvec{w}}}\) is very large compared to the other items, then a relatively small variation in the j-th feature will lead to large difference in modeling outputs. However, the bias term b functions equally to all instances and thus it is unnecessary to regularize the bias term b.
In fact, the label set \(\{0, 1\}\) can be regarded as the ground-truth posterior probability of any instance \({{\varvec{x}}}_i\) belonging to positive class, i.e., \(p(+1 \mid {{{\varvec{x}}}}_i) = 1\) holds for positive samples while \(p(+1 \mid {{{\varvec{x}}}}_i) = 0\) holds for negative samples. Thus, we use \(p_i\) to denote \({{\varvec{x}}}_i\)’s label if it takes the value in \(\{0,1\}\) in this section.
In experiments, the linear equation \(\textbf{A}{{{\varvec{x}}}} = {{\varvec{b}}}\) is solved via the command “\({{{\varvec{x}}}} = \textbf{A} \setminus {{\varvec{b}}}\)” which is more recommended than “\({{{\varvec{x}}}} = \text{pinv}(\textbf{A}) * {{\varvec{b}}}\)” in Matlab.
References
Zhou Z-H (2021) Machine learning. Springer, Singapore. https://doi.org/10.1007/978-981-15-1967-3
Han J, Pei J, Tong H (2022) Data mining: concepts and techniques, 4th edn. Morgan Kaufmann, Cambridge
Bzdok D, Krzywinski M, Altman N (2018) Machine learning: supervised methods. Nat Methods 15:5–6. https://doi.org/10.1038/nmeth.4551
Verdhan V (2020) Supervised learning with Python. Apress, Berkeley. https://doi.org/10.1007/978-1-4842-6156-9
Bingham NH, Fry JM (2010) Regression: linear models in statistics. Springer, London. https://doi.org/10.1007/978-1-84882-969-5
Drummond C (2017) Classification. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, Boston, pp 205–208. https://doi.org/10.1007/978-1-4899-7687-1_111
Zhang M-L, Zhou Z-H (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng 26(8):1819–1837. https://doi.org/10.1109/TKDE.2013.39
Gibaja E, Ventura S (2015) A tutorial on multilabel learning. ACM Comput Surv 47(3):52. https://doi.org/10.1145/2716262
Liu W, Wang H, Shen X, Tsang IW (2022) The emerging trends of multi-label learning. IEEE Trans Pattern Anal Mach Intell 44(11):7955–7974. https://doi.org/10.1109/TPAMI.2021.3119334
Dietterich TG, Bakiri G (1995) Solving multiclass learning problems via error-correcting output codes. J Artif Intell Res 2:263–286. https://doi.org/10.1613/jair.105
Jia B-B, Liu J-Y, Hang J-Y, Zhang M-L (2023) Learning label-specific features for decomposition-based multi-class classification. Front Comput Sci 17(6):176348. https://doi.org/10.1007/s11704-023-3076-y
Zhang M-L, Li Y-K, Liu X-Y, Geng X (2018) Binary relevance for multi-label learning: an overview. Front Comput Sci 12(2):191–202. https://doi.org/10.1007/s11704-017-7031-7
Aggarwal CC (2018) Linear classification and regression for text. In: Machine learning for text. Springer, Cham, pp 159–208. https://doi.org/10.1007/978-3-319-73531-3_6
Xue H, Chen S, Yang Q (2009) Discriminatively regularized least-squares classification. Pattern Recognit 42(1):93–104. https://doi.org/10.1016/j.patcog.2008.07.010
**ang S, Nie F, Meng G, Pan C, Zhang C (2012) Discriminative least squares regression for multiclass classification and feature selection. IEEE Trans Neural Netw Learn Syst 23(11):1738–1754. https://doi.org/10.1109/TNNLS.2012.2212721
Zhang X-Y, Wang L, **ang S, Liu C-L (2015) Retargeted least squares regression algorithm. IEEE Trans Neural Netw Learn Syst 26(9):2206–2213. https://doi.org/10.1109/TNNLS.2014.2371492
Liu M, Zhang D, Chen S, Xue H (2016) Joint binary classifier learning for ECOC-based multi-class classification. IEEE Trans Pattern Anal Mach Intell 38(11):2335–2341. https://doi.org/10.1109/TPAMI.2015.2430325
Ma Z, Chen S (2018) Multi-dimensional classification via a metric approach. Neurocomputing 275:1121–1131. https://doi.org/10.1016/j.neucom.2017.09.057
Yang C, Wang W, Feng X, He R (2020) Group discriminative least square regression for multicategory classification. Neurocomputing 407:175–184. https://doi.org/10.1016/j.neucom.2020.05.016
Zhan S, Wu J, Han N, Wen J, Fang X (2020) Group low-rank representation-based discriminant linear regression. IEEE Trans Circuits Syst Video Technol 30(3):760–770. https://doi.org/10.1109/TCSVT.2019.2897072
Huang J, Li G, Huang Q, Wu X (2016) Learning label-specific features and class-dependent labels for multi-label classification. IEEE Trans Knowl Data Eng 28(12):3309–3323. https://doi.org/10.1109/TKDE.2016.2608339
Yu Z-B, Zhang M-L (2022) Multi-label classification with label-specific feature generation: a wrapped approach. IEEE Trans Pattern Anal Mach Intell 44(9):5199–5210. https://doi.org/10.1109/TPAMI.2021.3070215
Zhou W-J, Yu Y, Zhang M-L (2017) Binary linear compression for multi-label classification. In: Proceedings of the 26th international joint conference on artificial intelligence. ijcai.org, Melbourne, Australia, pp 3546–3552. https://doi.org/10.24963/ijcai.2017/496
Jia B-B, Zhang M-L (2023) Multi-dimensional classification via decomposed label encoding. IEEE Trans Knowl Data Eng 35(2):1844–1856. https://doi.org/10.1109/TKDE.2021.3100436
Naseem I, Togneri R, Bennamoun M (2010) Linear regression for face recognition. IEEE Trans Pattern Anal Mach Intell 32(11):2106–2112. https://doi.org/10.1109/TPAMI.2010.128
Liu W, Tsang IW (2015) Large margin metric learning for multi-label prediction. In: Proceedings of the 29th AAAI conference on artificial intelligence. AAAI Press, Austin, pp 2800–2806. https://doi.org/10.1609/aaai.v29i1.9610
Lv J, Wu T, Peng C-L, Liu Y, Xu N, Geng X (2021) Compact learning for multi-label classification. Pattern Recognit 113:107833. https://doi.org/10.1016/j.patcog.2021.107833
Bishop CM (2006) Pattern recognition and machine learning. Springer, Singapore
Mohri M, Rostamizadeh A, Talwalkar A (2018) Foundations of machine learning, 2nd edn. MIT Press, Cambridge
Zhang T (2004) Statistical behavior and consistency of classification methods based on convex risk minimization. Ann Stat 32(5):56–85. https://doi.org/10.1214/aos/1079120130
Cai X, Ding CHQ, Nie F, Huang H (2013) On the equivalent of low-rank linear regressions and linear discriminant analysis based regressions. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Chicago, pp 1124–1132. https://doi.org/10.1145/2487575.2487701
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 7:2399–2434
Wen J, Xu Y, Li Z, Ma Z, Xu Y (2018) Inter-class sparsity based discriminative least square regression. Neural Netw 102:36–47. https://doi.org/10.1016/j.neunet.2018.02.002
Shao R, Xu N, Geng X (2018) Multi-label learning with label enhancement. In: Proceedings of the IEEE international conference on data mining. IEEE, Singapore, pp 437–446. https://doi.org/10.1109/ICDM.2018.00059
Tao A, Xu N, Geng X (2018) Labeling information enhancement for multi-label learning with low-rank subspace. In: Proceedings of the 15th Pacific rim international conference on artificial intelligence. Springer, Nan**g, pp 671–683. https://doi.org/10.1007/978-3-319-97304-3_51
Xu N, Liu Y-P, Geng X (2021) Label enhancement for label distribution learning. IEEE Trans Knowl Data Eng 33(4):1632–1643. https://doi.org/10.1109/TKDE.2019.2947040
Hou P, Geng X, Zhang M-L (2016) Multi-label manifold learning. In: Proceedings of the 30th AAAI conference on artificial intelligence. AAAI Press, Phoenix, pp 1680–1686. https://doi.org/10.1609/aaai.v30i1.10258
Zhang M-L, Zhou B-B, Liu X-Y (2016) Partial label learning via feature−aware disambiguation. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, San Francisco, pp 1335–1344. https://doi.org/10.1145/2939672.2939788
Zhang Q-W, Zhong Y, Zhang M-L (2018) Feature−induced labeling information enrichment for multi-label learning. In: Proceedings of the 32nd AAAI conference on artificial intelligence. AAAI Press, New Orleans, pp 4446–4453. https://doi.org/10.1609/aaai.v32i1.11656
Lv J, Xu N, Zheng R, Geng X (2019) Weakly supervised multi-label learning via label enhancement. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence. ijcai.org, Macao, China, pp 3101–3107. https://doi.org/10.24963/ijcai.2019/430
Xu N, Lv J, Geng X (2019) Partial label learning via label enhancement. In: Proceedings of the 33rd AAAI conference on artificial intelligence. AAAI Press, Honolulu, pp 5557–5564. https://doi.org/10.1609/aaai.v33i01.33015557
Xu N, Liu Y-P, Geng X (2020) Partial multi-label learning with label distribution. In: Proceedings of the 34th AAAI conference on artificial intelligence. AAAI Press, New York, pp 6510–6517. https://doi.org/10.1609/aaai.v34i04.6124
Wang L, Pan C (2018) Groupwise retargeted least-squares regression. IEEE Trans Neural Netw Learn Syst 29(4):1352–1358. https://doi.org/10.1109/TNNLS.2017.2651169
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67
van Wieringen WN (2015) Lecture notes on ridge regression. ar**v preprint. ar**v:1509.09169
Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. https://doi.org/10.1007/BF00994018
Kleinbaum DG, Klein M (2010) Logistic regression: a self-learning text. Springer, New York. https://doi.org/10.1007/978-1-4419-1742-3
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139. https://doi.org/10.1006/jcss.1997.1504
Vapnik V, Chervonenkis A (1991) The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognit Image Anal 1(3):283–305
Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
Zhang M-L, Wu L (2015) LIFT: multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120. https://doi.org/10.1109/TPAMI.2014.2339815
Wu X-Z, Zhou Z-H (2017) A unified view of multi-label performance measures. In: Proceedings of the 34th international conference on machine learning. PMLR, Sydney, pp 3780–3788
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Lorena AC, Carvalho ACPLF, Gama J (2008) A review on the combination of binary classifiers in multiclass problems. Artif Intell Rev 30(1–4):19–37. https://doi.org/10.1007/s10462-009-9114-9
Patro SGK, Sahu KK (2015) Normalization: a preprocessing stage. Int Adv Res J Sci Eng Technol 2(3):20–22. https://doi.org/10.17148/IARJSET.2015.2305
Henderi H, Wahyuningsih T, Rahwanto E (2021) Comparison of min-max normalization and z-score normalization in the k-nearest neighbor (kNN) algorithm to test the accuracy of types of breast cancer. Int J Inform Inf Syst 4(1):13–20. https://doi.org/10.47738/IJIIS.V4I1.73
Liu J-Y, Jia B-B (2020) Combining one−vs-one decomposition and instance−based learning for multi-class classification. IEEE Access 8:197499–197507. https://doi.org/10.1109/ACCESS.2020.3034448
Acknowledgements
The authors wish to thank the associate editor and anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Science Foundation of China (62306131, 62176055), the Fundamental Research Funds for the Central Universities, and the Red Willow Outstanding Youth Talent Support Program of Lanzhou University of Technology. We thank the Big Data Center of Southeast University for providing the facility support on the numerical calculations in this paper.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix 1: Detailed experimental results
Appendix 1: Detailed experimental results
In the appendix, Tables 18, 19, 20, 21, 22, 23 and 24 report the detailed experimental results on multi-label data sets and Tables 25 and 26 report the detailed experimental results on multi-class data sets, where the corresponding Wilcoxon signed-ranks test results have been reported and discussed in the paper. Here, we purposefully keep five decimal digits to compare the possibly existing tiny performance differences among different approaches. For convenience, the performance ranks of all compared approaches over each data set are shown in parentheses and the average ranks over all data sets are also shown in the last row of each table. Note that when RBF kernel is used, KerRegBias and KerNoBias will achieve identical performance because the extra bias does not affect the distance between any two instances, thus we show their experimental results together and denote them as KerNo/RegBias.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jia, BB., Liu, JY. & Zhang, ML. Towards exploiting linear regression for multi-class/multi-label classification: an empirical analysis. Int. J. Mach. Learn. & Cyber. (2024). https://doi.org/10.1007/s13042-024-02114-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13042-024-02114-6