Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study

Wilk, Szymon; Stefanowski, Jerzy; Wojciechowski, Szymon; Farion, Ken J.; Michalowski, Wojtek

doi:10.1007/978-3-319-39796-2_41

Szymon Wilk⁶,
Jerzy Stefanowski⁶,
Szymon Wojciechowski⁶,
Ken J. Farion⁷ &
…
Wojtek Michalowski⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 471))

Included in the following conference series:

Conference of Information Technologies in Biomedicine

739 Accesses
9 Citations

Abstract

In this paper we describe an experimental study where we analyzed data difficulty factors encountered in imbalanced clinical data sets and examined how selected data preprocessing methods were able to address these factors. We considered five data sets describing various pediatric acute conditions. In all these data sets the minority class was sparse and overlapped with the majority classes, thus difficult to learn. We studied five different preprocessing methods: random under- and oversampling, SMOTE, neighborhood cleaning rule and SPIDER2 that were combined with the following classifiers: k-nearest neighbors, decision trees and rules, naive Bayes, neural networks and support vector machines. Application of preprocessing always improved classification performance, and the largest improvement was observed for random undersampling. Moreover, naive Bayes was the best performing classifier regardless of a used preprocessing method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 160.49; Price includes VAT (Germany)

Softcover Book: EUR 213.99; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets

Addressing Classification on Highly Imbalanced Clinical Datasets

Attribute Selection, Sampling, and Classifier Methods to Address Class Imbalance Issues on Data Set Having Ratio Less Than Five

Notes

1.
http://www.cs.waikato.ac.nz/ml/weka/.

References

Bellazzi, R., Zupan, B.: Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Inf. 77(2), 81–97 (2008)
Article Google Scholar
Chawla, N.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.): The Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer (2005)
Google Scholar
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 341–378 (2002)
MATH Google Scholar
Cios, K., Moore, G.: Uniqueness of medical data mining. Artif. Intell. Med. 26, 1–24 (2002)
Article Google Scholar
Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the Workshop on Learning from Imbalanced Data Sets, ICML 2003, pp. 1–8 (2003)
Google Scholar
Drummond, C., Holte, R.: Severe class imbalance: Why better algorithms aren’t the answer. In: Proceedings of the 16th European Conference ECML 2005, pp. 539–546, Springer (2005)
Google Scholar
Farion, K., Wilk, S., Michalowski, W., O’Sullivan, D., Sayyad-Shirabad, J.: Comparing predictions made by a prediction model, clinical score, and physicians: pediatric asthma exacerbations in the emergency department. Appl. Clinic. Inform. 4(3), 376–391 (2013)
Article Google Scholar
He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms and Applications. Wiley (2013)
Google Scholar
Hoens, T., Chawla, N.: Imbalanced datasets: from sampling to classifiers. In: He, H., Ma, Y. (eds.) Imbalanced Learning: Foundations, Algorithms and Applications. Wiley, pp. 43–59 (2013)
Google Scholar
Japkowicz, N.: Class imbalance: are we focusing on the right issue. In: Proceedings of the 2nd Workshop on Learning from Imbalanced Data Sets, ICML 2003, pp. 17–23 (2003)
Google Scholar
Klement, W., Wilk, S., Michalowski, W., Matwin, S.: Classifying severely imbalanced data. In: Proceedings of the 24th Canadian Conference on Artificial Intelligence, Canadian AI 2011, pp. 258–264. Springer (2011)
Google Scholar
Klement, W., Wilk, S., Michalowski, M., Farion, K., Osmond, M., Verter, V.: Predicting the need for CT imaging in children with minor head injury using an ensemble of naive bayes classifiers. Artif. Intell. Med. 54(3), 163–170 (2012)
Article Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference ICML 1997, pp. 179–186 (1997)
Google Scholar
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Proceedings of the 8th Conference AIME 2001. Volume 2101 of LNCS, pp. 63–66. Springer (2001)
Google Scholar
Napierala, K., Stefanowski, J.: The influence of minority class distribution on learning from imbalance data. In: Proceedings of the 7th Conference HAIS 2012. Volume 7209 of LNAI, pp. 139–150. Springer (2012)
Google Scholar
Napierala, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. J. Intell. Inform. Syst. (2016, to appear)
Google Scholar
Napierala, K., Stefanowski, J., Wilk, S.: Learning from imbalanced data in presence of noisy and borderline examples. In: Proceedings of the 7th International Conference RSCTC 2010. Volume 6086 of LNAI, pp. 158–167. Springer (2010)
Google Scholar
Sajda, P.: Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng. 8, 537–565 (2006)
Article Google Scholar
Saez, J., Luengo, J., Stefanowski, J., Herrera, F.: Addressing the noisy and borderline examples problem in classification with imbalanced datasets via a class noise filtering method-based re-sampling technique. Inform. Sci. 291, 184–203 (2015)
Article Google Scholar
Sanchez, V.G.J., Mollineda, R.: An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Proceedings of the 12th Iberoamerican Conference on Progress in Pattern Recognition, Image Analysis and Applications, pp. 397–406. Springer (2007)
Google Scholar
Staelin, C.: Parameter selection for support vector machines. Technical Report HPL-2002-354 (R.1). HP Laboratories, Israel (2003)
Google Scholar
Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of the 10th International Conference DaWaK 2008. Volume 5182 of LNCS, pp. 283–292. Springer (2008)
Google Scholar
Wallace, B., Small, K., Brodley, C., Trikalinos, T.: Class imbalance, redux. In: Proceedings of the 11th IEEE International Conference on Data Mining, pp. 754–763 (2011)
Google Scholar
Wei, Q., Dunbrack, R.: The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE 7(8), e67863 (2013)
Article Google Scholar
Wilson, D., Martinez, T.: Improved heterogeneous distance functions. J. Atif. Intell. Res. 6, 1–34 (1997)
MathSciNet MATH Google Scholar
Wilson, D., Martinez, T.: Reduction techniques for instance-based learning algorithms. Mach. Learn. J. 38, 257–286 (2000)
Article MATH Google Scholar

Download references

Acknowledgments

The first three authors would like to acknowledge support by the Polish National Science Center under Grant No. DEC-2013/11/B/ST6/ 00963.

Author information

Authors and Affiliations

Poznan University of Technology, Poznan, Poland
Szymon Wilk, Jerzy Stefanowski & Szymon Wojciechowski
Children’s Hospital of Eastern Ontario, Ottawa, Canada
Ken J. Farion
University of Ottawa, Ottawa, Canada
Wojtek Michalowski

Authors

Szymon Wilk
View author publications
You can also search for this author in PubMed Google Scholar
Jerzy Stefanowski
View author publications
You can also search for this author in PubMed Google Scholar
Szymon Wojciechowski
View author publications
You can also search for this author in PubMed Google Scholar
Ken J. Farion
View author publications
You can also search for this author in PubMed Google Scholar
Wojtek Michalowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Szymon Wilk .

Editor information

Editors and Affiliations

Fac of Biomed Engg, ul. Roosevelta 40, Silesian University of Technology, Gliwice, Poland
Ewa Piętka
Fac of Biomed Engg, ul. Roosevelta 40, Silesian University of Technology, Gliwice, Poland
Pawel Badura
Faculty of Biomedical Engineering, Silesian University of Technology, Gliwice, Poland
Jacek Kawa
Faculty of Biomedical Engineering, Silesian University of Technology, Gliwice, Poland
Wojciech Wieclawek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wilk, S., Stefanowski, J., Wojciechowski, S., Farion, K.J., Michalowski, W. (2016). Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study. In: Piętka, E., Badura, P., Kawa, J., Wieclawek, W. (eds) Information Technologies in Medicine. ITiB 2016. Advances in Intelligent Systems and Computing, vol 471. Springer, Cham. https://doi.org/10.1007/978-3-319-39796-2_41

Download citation

DOI: https://doi.org/10.1007/978-3-319-39796-2_41
Published: 26 May 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39795-5
Online ISBN: 978-3-319-39796-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets

Addressing Classification on Highly Imbalanced Clinical Datasets

Attribute Selection, Sampling, and Classifier Methods to Address Class Imbalance Issues on Data Set Having Ratio Less Than Five

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets

Addressing Classification on Highly Imbalanced Clinical Datasets

Attribute Selection, Sampling, and Classifier Methods to Address Class Imbalance Issues on Data Set Having Ratio Less Than Five

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation