Abstract
Data quality issues have attracted widespread attentions due to the negative impacts of dirty data on regression model results. The relationship between data quality and the accuracy of results could be applied on the selection of appropriate regression model with the consideration of data quality and the determination of data share to clean. However, rare research has focused on exploring such relationship. Motivated by this, we design a generalized framework to evaluate dirty-data impacts on models. Using the framework, we conduct an experimental evaluation for the effects of missing, inconsistent, and conflicting data on regression models. Based on the experimental findings, we provide guidelines for regression model selection and data cleaning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Data sets: https://archive.ics.uci.edu/ml/index.php
Abraham, S., Raisee, M., Ghorbaniasl, G., Contino, F., Lacor, C.: A robust and efficient stepwise regression method for building sparse polynomial chaos expansions. J. Comput. Phys. 332, 461–474 (2017)
Avdis, E., Wachter, J.A.: Maximum likelihood estimation of the equity premium. J. Financ. Econ. 125(3), 589–609 (2017)
Li, L., Zhang, X.: Parsimonious tensor response regression. J. Am. Stat. Assoc. 112(519), 1131–1146 (2017)
Silhavy, R., Silhavy, P., Prokopova, Z.: Analysis and selection of a regression model for the use case points method using a stepwise approach. J. Syst. Softw. 125, 1–14 (2017)
Wang, H., Qi, Z., Shi, R., Li, J., Gao, H.: COSSET+: crowdsourced missing value imputation optimized by knowledge base. JCST 32(5), 845–857 (2017)
Acknowledgment
This paper was partially supported by NSFC grant U1866602, CCF-Huawei Database System Innovation Research Plan CCF-HuaweiDBIR2020007B.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Qi, Z., Wang, H. (2021). Dirty-Data Impacts on Regression Models: An Experimental Evaluation. In: Jensen, C.S., et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12681. Springer, Cham. https://doi.org/10.1007/978-3-030-73194-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-73194-6_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73193-9
Online ISBN: 978-3-030-73194-6
eBook Packages: Computer ScienceComputer Science (R0)