Preprocessing Matters: Automated Pipeline Selection for Fair Classification

  • Conference paper
  • First Online:
Modeling Decisions for Artificial Intelligence (MDAI 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13890))

  • 455 Accesses

Abstract

Improving fairness by manipulating the preprocessing stages of classification pipelines is an active area of research, closely related to AutoML. We propose a genetic optimisation algorithm, FairPipes, which optimises for user-defined combinations of fairness and accuracy and for multiple definitions of fairness, providing flexibility in the fairness-accuracy trade-off. FairPipes heuristically searches through a large space of pipeline configurations, achieving near-optimality efficiently, presenting the user with an estimate of the solutions’ Pareto front. We also observe that the optimal pipelines differ for different datasets, suggesting that no “universal best” pipeline exists and confirming that FairPipes fills a niche in the fairness-aware AutoML space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    FairPipes is available at https://github.com/vladoxNCL/fairPipes.

References

  1. Andersson, F.O., Kaiser, R., Jacobsson, S.P.: Data preprocessing by wavelets and genetic algorithms for enhanced multivariate analysis of LC peptide map**. J. Pharm. Biomed. Anal. 34(3), 531–541 (2004)

    Article  Google Scholar 

  2. Aydin, O.U., et al.: On the usage of average Hausdorff distance for segmentation performance assessment: hidden error when used for ranking. Europ. Radiol. Exp. 5(1), 1–7 (2021)

    Article  Google Scholar 

  3. Berger-Tal, O., Nathan, J., Meron, E., Saltz, D.: The exploration-exploitation dilemma: a multidisciplinary framework. PLoS ONE 9(4), e95693 (2014)

    Article  Google Scholar 

  4. Calmon, F., Wei, D., Vinzamuri, B., Natesan Ramamurthy, K., Varshney, K.R.: Optimized pre-processing for discrimination prevention. Adv. Neural. Inf. Process. Syst. 30, 3992–4001 (2017)

    Google Scholar 

  5. Cason, T.E.: Titanic Dataset. http://biostat.app.vumc.org/wiki/Main/DataSets (1999). Accessed 25 May 2021

  6. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39804-2_12

    Chapter  Google Scholar 

  7. Chiappa, S., Gillam, T.P.: Path-specific counterfactual fairness. ar**v preprint ar**v:1802.08139 (2018)

  8. Crone, S.F., Lessmann, S., Stahlbock, R.: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 173(3), 781–800 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  9. Danks, D., London, A.J.: Algorithmic bias in autonomous systems. In: IJCAI, vol. 17, pp. 4691–4697 (2017)

    Google Scholar 

  10. Demšar, J., et al.: Orange: data mining toolbox in python. J. Mach. Learn. 14(1), 2349–2353 (2013)

    Google Scholar 

  11. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml/

  12. Friedler, S.A., Scheidegger, C., Venkatasubramanian, S.: The (im) possibility of fairness: different value systems require different mechanisms for fair decision making. Commun. ACM 64(4), 136–143 (2021)

    Article  Google Scholar 

  13. García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M., Herrera, F.: Big data preprocessing: methods and prospects. Big Data Anal. 1(1), 9 (2016). https://doi.org/10.1186/s41044-016-0014-0

    Article  Google Scholar 

  14. González-Zelaya, V.: Towards explaining the effects of data preprocessing on machine learning. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 2086–2090. IEEE (2019)

    Google Scholar 

  15. González-Zelaya, V., Salas, J., Prangle, D., Missier, P.: Optimising fairness through parametrised data sampling. In: Proceedings of the 2021 EDBT Conference (2021)

    Google Scholar 

  16. Hassanat, A., Almohammadi, K., Alkafaween, E., Abunawas, E., Hammouri, A., Prasath, V.: Choosing mutation and crossover ratios for genetic algorithms-a review with a new dynamic approach. Information 10(12), 390 (2019)

    Article  Google Scholar 

  17. Ishibuchi, H., Tsukamoto, N., Nojima, Y.: Evolutionary many-objective optimization: A short review. In: 2008 IEEE congress on evolutionary computation (IEEE world congress on computational intelligence), pp. 2419–2426. IEEE (2008)

    Google Scholar 

  18. Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012)

    Article  Google Scholar 

  19. Kusner, M., Loftus, J., Russell, C., Silva, R.: Counterfactual fairness. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4069–4079 (2017)

    Google Scholar 

  20. La Cava, W., Moore, J.H.: Genetic programming approaches to learning fair classifiers. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 967–975 (2020)

    Google Scholar 

  21. Larson, J., Mattu, S., Kirchner, L., Angwin, J.: How we analyzed the compas recidivism algorithm. ProPublica 5, 9 (2016)

    Google Scholar 

  22. Li, M., Yang, S., Liu, X.: Bi-goal evolution for many-objective optimization problems. Artif. Intell. 228, 45–65 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  23. Menon, A.K., Williamson, R.C.: The cost of fairness in binary classification. In: Conference on Fairness, Accountability and Transparency, pp. 107–118. PMLR (2018)

    Google Scholar 

  24. Olson, R.S., Moore, J.H.: TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Workshop on Automatic Machine Learning, pp. 66–74. PMLR (2016)

    Google Scholar 

  25. Pyle, D.: Data preparation for data mining. Morgan Kaufmann (1999)

    Google Scholar 

  26. Salas, J., González-Zelaya, V.: Fair-MDAV: an algorithm for fair privacy by microaggregation. In: Torra, V., Narukawa, Y., Nin, J., Agell, N. (eds.) MDAI 2020. LNCS (LNAI), vol. 12256, pp. 286–297. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57524-3_24

    Chapter  Google Scholar 

  27. Schutze, O., Esquivel, X., Lara, A., Coello, C.A.C.: Using the averaged Hausdorff distance as a performance measure in evolutionary multiobjective optimization. IEEE Trans. Evol. Comput. 16(4), 504–522 (2012)

    Article  Google Scholar 

  28. Smith, M.J., Sala, C., Kanter, J.M., Veeramachaneni, K.: The machine learning bazaar: Harnessing the ml ecosystem for effective system development. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 785–800 (2020)

    Google Scholar 

  29. Stoyanovich, J., Howe, B., Jagadish, H.: Responsible data management. Proceed. VLDB Endow. 13(12), 3474–3488 (2020)

    Article  Google Scholar 

  30. Stoyanovich, J., Howe, B., Jagadish, H., Miklau, G.: Panel: a debate on data and algorithmic ethics. Proceed. VLDB Endow. 11(12), 2165–2167 (2018)

    Article  Google Scholar 

  31. Tan, F., Fu, X., Zhang, Y., Bourgeois, A.G.: A genetic algorithm-based method for feature subset selection. Soft. Comput. 12(2), 111–120 (2008)

    Article  Google Scholar 

  32. Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inf. Process. Manage. 50(1), 104–112 (2014)

    Article  Google Scholar 

  33. Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in python. Nat. Methods 17(3), 261–272 (2020)

    Google Scholar 

  34. Whitley, D.: A genetic algorithm tutorial. Stat. Comput. 4(2), 65–85 (1994)

    Article  Google Scholar 

  35. Yang, K., Huang, B., Stoyanovich, J., Schelter, S.: Fairness-aware instrumentation of preprocessing pipelines for machine learning. In: Workshop on Human-In-the-Loop Data Analytics (HILDA2020) (2020)

    Google Scholar 

  36. Yoo, S., Harman, M.: Pareto efficient multi-objective test case selection. In: Proceedings of the 2007 International Symposium on Software Testing and Analysis, pp. 140–150 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julián Salas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

González-Zelaya, V., Salas, J., Prangle, D., Missier, P. (2023). Preprocessing Matters: Automated Pipeline Selection for Fair Classification. In: Torra, V., Narukawa, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2023. Lecture Notes in Computer Science(), vol 13890. Springer, Cham. https://doi.org/10.1007/978-3-031-33498-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-33498-6_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-33497-9

  • Online ISBN: 978-3-031-33498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation