A Stochastic Alternating Balance k-Means Algorithm forĀ Fair Clustering

  • Conference paper
  • First Online:
Learning and Intelligent Optimization (LION 2022)

Abstract

In the application of data clustering to human-centric decision-making systems, such as loan applications and advertisement recommendations, the clustering outcome might discriminate against people across different demographic groups, leading to unfairness. A natural conflict occurs between the cost of clustering (in terms of distance to cluster centers) and the balance representation of all demographic groups across the clusters, leading to a bi-objective optimization problem that is nonconvex and nonsmooth. To determine the complete trade-off between these two competing goals, we design a novel stochastic alternating balance fair k-means (SAfairKM) algorithm, which consists of alternating classical mini-batch k-means updates and group swap updates. The number of k-means updates and the number of swap updates essentially parameterize the weight put on optimizing each objective function. Our numerical experiments show that the proposed SAfairKM algorithm is robust and computationally efficient in constructing well-spread and high-quality Pareto fronts both on synthetic and real datasets.

L. N. Vicenteā€”Support for this author was partially provided by the Centre for Mathematics of the University of Coimbra under grant FCT/MCTES UIDB/MAT/00324/2020.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (Canada)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Our implementation code is available atĀ https://github.com/sul217/SAfairKM. All the experiments were conducted on a MacBook Pro Intel Core i5 processor.

References

  1. Abbasi, M., Bhaskara, A., Venkatasubramanian, S.: Fair clustering via equitable group representations. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 504ā€“514 (2021)

    Google ScholarĀ 

  2. Abraham, S.S., Sundaram, S.S.: Fairness in clustering with multiple sensitive attributes. ar**v preprint ar**v:1910.05113 (2019)

  3. Ahmadian, S., Epasto, A., Kumar, R., Mahdian, M.: Clustering without over-representation. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 267ā€“275 (2019)

    Google ScholarĀ 

  4. Arthur, D., Vassilvitskii, S.: \(k\)-means++ the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027ā€“1035 (2007)

    Google ScholarĀ 

  5. Backurs, A., Indyk, P., Onak, K., Schieber, B., Vakilian, A., Wagner, T.: Scalable fair clustering. In: International Conference on Machine Learning, pp. 405ā€“413. PMLR (2019)

    Google ScholarĀ 

  6. Barocas, S., Selbst, A.D.: Big dataā€™s disparate impact. Calif. Law Rev. 104, 671 (2016)

    Google ScholarĀ 

  7. Bera, S., Chakrabarty, D., Flores, N., Negahbani, M.: Fair algorithms for clustering. In: Advances in Neural Information Processing Systems, pp. 4954ā€“4965 (2019)

    Google ScholarĀ 

  8. Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grou** Multidimensional Data. Springer, Berlin, Heidelberg (2006). https://doi.org/10.1007/3-540-28349-8_2

  9. Bottou, L., Bengio, Y.: Convergence properties of the \(k\)-means algorithms. In: Advances in Neural Information Processing Systems, pp. 585ā€“592 (1995)

    Google ScholarĀ 

  10. Calders, T., Kamiran, F., Pechenizkiy, M.: Building classifiers with independency constraints. In: 2009 IEEE International Conference on Data Mining Workshops, pp. 13ā€“18. IEEE (2009)

    Google ScholarĀ 

  11. Chen, X., Fain, B., Lyu, L., Munagala, K.: Proportionally fair clustering. In: International Conference on Machine Learning, pp. 1032ā€“1041 (2019)

    Google ScholarĀ 

  12. Chierichetti, F., Kuma, R., Lattanzi, S., Vassilvitskii, S.: Fair clustering through fairlets. In: Advances in Neural Information Processing Systems, pp. 5029ā€“5037 (2017)

    Google ScholarĀ 

  13. Datta, A., Tschantz, M.C., Datta, A.: Automated experiments on ad privacy settings: a tale of opacity, choice, and discrimination. Proc. Priv. Enhancing Technol. 2015, 92ā€“112 (2015)

    ArticleĀ  Google ScholarĀ 

  14. Dua, D., Graff, C.: UCI Machine Learning Repository (2017). http://archive.ics.uci.edu/ml

  15. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.: Fairness through awareness. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214ā€“226. ACM (2012)

    Google ScholarĀ 

  16. Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Philadelphia (2020)

    Google ScholarĀ 

  17. Gass, S., Saaty, T.: The computational algorithm for the parametric objective function. Nav. Res. Logist. Q. 2, 39ā€“45 (1955)

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  18. Ghadiri, M., Samadi, S., Vempala, S.: Socially fair \(k\)-means clustering. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 438ā€“448 (2021)

    Google ScholarĀ 

  19. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. In: Advances in Neural Information Processing Systems, pp. 3315ā€“3323 (2016)

    Google ScholarĀ 

  20. Huang, L., Jiang, S., Vishnoi, N.: Coresets for clustering with fairness constraints. In: Advances in Neural Information Processing Systems, pp. 7589ā€“7600 (2019)

    Google ScholarĀ 

  21. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for \(k\)-means clustering. Comput. Geom. 28, 89ā€“112 (2004)

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  22. Kleindessner, M., Awasthi, P., Morgenstern, J.: Fair \(k\)-center clustering for data summarization. In: International Conference on Machine Learning, pp. 3448ā€“3457. PMLR (2019)

    Google ScholarĀ 

  23. Kleindessner, M., Awasthi, P., Morgenstern, J.: A notion of individual fairness for clustering. ar**v preprint ar**v:2006.04960 (2020)

  24. Kleindessner, M., Samadi, S., Awasthi, P., Morgenstern, J.: Guarantees for spectral clustering with fairness constraints. In: International Conference on Machine Learning, pp. 3458ā€“3467. PMLR (2019)

    Google ScholarĀ 

  25. Kohavi, R.: Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 202ā€“207. KDD1996, AAAI Press (1996)

    Google ScholarĀ 

  26. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129ā€“137 (1982)

    ArticleĀ  MathSciNetĀ  MATHĀ  Google ScholarĀ 

  27. Mahabadi, S., Vakilian, A.: Individual fairness for \(k\)-clustering. In: Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 6586ā€“6596. PMLR, Virtual (13ā€“18 Jul 2020)

    Google ScholarĀ 

  28. Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62, 22ā€“31 (2014)

    ArticleĀ  Google ScholarĀ 

  29. Rƶsner, C., Schmidt, M.: Privacy preserving clustering with constraints. In: 45th International Colloquium on Automata, Languages, and Programming. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2018)

    Google ScholarĀ 

  30. Schmidt, M., Schwiegelshohn, C., Sohler, C.: Fair coresets and streaming algorithms for fair \(k\)-means. In: Bampis, E., Megow, N. (eds.) WAOA 2019. LNCS, vol. 11926, pp. 232ā€“251. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39479-0_16

    ChapterĀ  MATHĀ  Google ScholarĀ 

  31. Selim, S.Z., Ismail, M.A.: \(k\)-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-6(1), 81ā€“87 (1984)

    Google ScholarĀ 

  32. Ziko, I.M., Granger, E., Yuan, J., Ayed, I.B.: Variational fair clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 11202ā€“11209 (2021)

    Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suyun Liu .

Editor information

Editors and Affiliations

Appendices

A Description ofĀ anĀ Existing Approach forĀ Comparison

The authors inĀ [32] considered the fairness error computed by the Kullback-Leibler (KL)-divergence, and added it as a penalized term to the classical clustering objective. When using the k-means clustering cost, the resulting problem takes the form:

$$\begin{aligned} \min ~ f_1(s) + \mu \displaystyle \sum _{k=1}^N \mathcal {D}_{KL}(U\Vert \mathbb {P}_k) \quad \text {s.t. } \sum _{k=1}^K s_{p, k} = 1, \forall p \in [N], \end{aligned}$$
(6)

where \(\mathcal {D}_{KL}\) is the KL divergence between the desired demographic proportion \(U = [u_j, j \in [J]]\) (usually specified by the demographic composition of the whole dataset) and the marginal probability \(\mathbb {P}_k = [\mathbb {P}(j|k) = s_k^\top v_j/{e_N}^\top s_k, j \in [J]]\). The penalty coefficient \(\mu \) associated with the fairness error is the tool to control the trade-offs between the clustering cost and the clustering balance. To solve problemĀ (6) for a fixed \(\mu \ge 0\), the authors inĀ [32] have developed an optimization scheme based on a concave-convex decomposition of the fairness term.

B More Numerical Results

Fig. 2.
figure 2

Demographic composition of four synthetic datasets.

Fig. 3.
figure 3

Syn_equal_ds1 data: SAfairKM: 400 iterations, 10 starting labels, and 3 pairs of \((n_a, n_b)\); VfairKM: \(\mu _{\max } = 202\).

Fig. 4.
figure 4

Syn_unequal_ds1 data: SAfairKM: 400 iterations, 10 starting labels, and 3 pairs of \((n_a, n_b)\); VfairKM: \(\mu _{\max } = 223\).

Fig. 5.
figure 5

Syn_equal_ds2 data: SAfairKM: 400 iterations, 10 starting labels, and 3 pairs of \((n_a, n_b)\); VfairKM: \(\mu _{\max } = 60\).

Fig. 6.
figure 6

Syn_unequal_ds2 data: SAfairKM: 400 iterations, 10 starting labels, and 3 pairs of \((n_a, n_b)\); VfairKM: \(\mu _{\max } = 0\).

Fig. 7.
figure 7

Pareto fronts for \(K = 5\): SAfairKM: 2500 iterations for Adult and 1500 iterations for Bank, 30 starting labels, and 4 pairs of \((n_a, n_b)\); VfairKM: \(\mu _{\max } = 6190\) for Adult and \(\mu _{\max } = 4790\) for Bank.

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, S., Vicente, L.N. (2022). A Stochastic Alternating Balance k-Means Algorithm forĀ Fair Clustering. In: Simos, D.E., Rasskazova, V.A., Archetti, F., Kotsireas, I.S., Pardalos, P.M. (eds) Learning and Intelligent Optimization. LION 2022. Lecture Notes in Computer Science, vol 13621. Springer, Cham. https://doi.org/10.1007/978-3-031-24866-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24866-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24865-8

  • Online ISBN: 978-3-031-24866-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation