Abstract
In this paper we describe an experimental study where we analyzed data difficulty factors encountered in imbalanced clinical data sets and examined how selected data preprocessing methods were able to address these factors. We considered five data sets describing various pediatric acute conditions. In all these data sets the minority class was sparse and overlapped with the majority classes, thus difficult to learn. We studied five different preprocessing methods: random under- and oversampling, SMOTE, neighborhood cleaning rule and SPIDER2 that were combined with the following classifiers: k-nearest neighbors, decision trees and rules, naive Bayes, neural networks and support vector machines. Application of preprocessing always improved classification performance, and the largest improvement was observed for random undersampling. Moreover, naive Bayes was the best performing classifier regardless of a used preprocessing method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Bellazzi, R., Zupan, B.: Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Inf. 77(2), 81–97 (2008)
Chawla, N.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.): The Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer (2005)
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 341–378 (2002)
Cios, K., Moore, G.: Uniqueness of medical data mining. Artif. Intell. Med. 26, 1–24 (2002)
Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the Workshop on Learning from Imbalanced Data Sets, ICML 2003, pp. 1–8 (2003)
Drummond, C., Holte, R.: Severe class imbalance: Why better algorithms aren’t the answer. In: Proceedings of the 16th European Conference ECML 2005, pp. 539–546, Springer (2005)
Farion, K., Wilk, S., Michalowski, W., O’Sullivan, D., Sayyad-Shirabad, J.: Comparing predictions made by a prediction model, clinical score, and physicians: pediatric asthma exacerbations in the emergency department. Appl. Clinic. Inform. 4(3), 376–391 (2013)
He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms and Applications. Wiley (2013)
Hoens, T., Chawla, N.: Imbalanced datasets: from sampling to classifiers. In: He, H., Ma, Y. (eds.) Imbalanced Learning: Foundations, Algorithms and Applications. Wiley, pp. 43–59 (2013)
Japkowicz, N.: Class imbalance: are we focusing on the right issue. In: Proceedings of the 2nd Workshop on Learning from Imbalanced Data Sets, ICML 2003, pp. 17–23 (2003)
Klement, W., Wilk, S., Michalowski, W., Matwin, S.: Classifying severely imbalanced data. In: Proceedings of the 24th Canadian Conference on Artificial Intelligence, Canadian AI 2011, pp. 258–264. Springer (2011)
Klement, W., Wilk, S., Michalowski, M., Farion, K., Osmond, M., Verter, V.: Predicting the need for CT imaging in children with minor head injury using an ensemble of naive bayes classifiers. Artif. Intell. Med. 54(3), 163–170 (2012)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference ICML 1997, pp. 179–186 (1997)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Proceedings of the 8th Conference AIME 2001. Volume 2101 of LNCS, pp. 63–66. Springer (2001)
Napierala, K., Stefanowski, J.: The influence of minority class distribution on learning from imbalance data. In: Proceedings of the 7th Conference HAIS 2012. Volume 7209 of LNAI, pp. 139–150. Springer (2012)
Napierala, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. J. Intell. Inform. Syst. (2016, to appear)
Napierala, K., Stefanowski, J., Wilk, S.: Learning from imbalanced data in presence of noisy and borderline examples. In: Proceedings of the 7th International Conference RSCTC 2010. Volume 6086 of LNAI, pp. 158–167. Springer (2010)
Sajda, P.: Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng. 8, 537–565 (2006)
Saez, J., Luengo, J., Stefanowski, J., Herrera, F.: Addressing the noisy and borderline examples problem in classification with imbalanced datasets via a class noise filtering method-based re-sampling technique. Inform. Sci. 291, 184–203 (2015)
Sanchez, V.G.J., Mollineda, R.: An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Proceedings of the 12th Iberoamerican Conference on Progress in Pattern Recognition, Image Analysis and Applications, pp. 397–406. Springer (2007)
Staelin, C.: Parameter selection for support vector machines. Technical Report HPL-2002-354 (R.1). HP Laboratories, Israel (2003)
Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of the 10th International Conference DaWaK 2008. Volume 5182 of LNCS, pp. 283–292. Springer (2008)
Wallace, B., Small, K., Brodley, C., Trikalinos, T.: Class imbalance, redux. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp. 754–763 (2011)
Wei, Q., Dunbrack, R.: The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE 7(8), e67863 (2013)
Wilson, D., Martinez, T.: Improved heterogeneous distance functions. J. Atif. Intell. Res. 6, 1–34 (1997)
Wilson, D., Martinez, T.: Reduction techniques for instance-based learning algorithms. Mach. Learn. J. 38, 257–286 (2000)
Acknowledgments
The first three authors would like to acknowledge support by the Polish National Science Center under Grant No. DEC-2013/11/B/ST6/ 00963.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Wilk, S., Stefanowski, J., Wojciechowski, S., Farion, K.J., Michalowski, W. (2016). Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study. In: Piętka, E., Badura, P., Kawa, J., Wieclawek, W. (eds) Information Technologies in Medicine. ITiB 2016. Advances in Intelligent Systems and Computing, vol 471. Springer, Cham. https://doi.org/10.1007/978-3-319-39796-2_41
Download citation
DOI: https://doi.org/10.1007/978-3-319-39796-2_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39795-5
Online ISBN: 978-3-319-39796-2
eBook Packages: EngineeringEngineering (R0)