Log in

A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering Algorithms

  • Survey
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

The advent of Big Data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI, MapReduce, and Spark. An important step for any parallel clustering algorithm is the distribution of data amongst the cluster nodes. This step governs the methodology and performance of the entire algorithm. Researchers typically use random, or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning, as per the requirements of the algorithm. However, these strategies are generic and are not tailor-made for any specific parallel clustering algorithm. In this paper, we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they employ. We also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS, Cell-Based Dimensional Split for dGridSLINK, which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution, and Projection-Based Split, which is a generic distribution strategy. All of these preserve spatial locality, achieve disjoint partitioning, and ensure good data load balancing. The experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for, based on which we give appropriate recommendations for their usage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Tan P N, Steinbach M, Kumar V. Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Inc., 2005.

  2. MacQueen J. Some methods for classification and analysis of multivariate observations. In Proc. the 5th Berkeley Symposium on Mathematical Statistics and Probability, Jan. 1967, pp.281–297.

  3. Park H S, Jun C H. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 2009, 36(2): 3336–3341. DOI: https://doi.org/10.1016/j.eswa.2008.01.039.

    Article  Google Scholar 

  4. Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. Technical Report, TR 00-034, University of Minnesota, 2000. https://conservancy.umn.edu/handle/11299/215421, Mar. 2024.

  5. Ester M, Kriegel H P, Sander J, Xu X W. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. the 2nd International Conference on Knowledge Discovery and Data Mining, Aug. 1996, pp.226–231. DOI: https://doi.org/10.5555/3001460.3001507.

  6. Ankerst M, Breunig M M, Kriegel H P, Sander J. OPTICS: Ordering points to identify the clustering structure. In Proc. the 1999 ACM SIGMOD International Conference on Management of Data, Jun. 1999, pp.49–60. DOI: https://doi.org/10.1145/304182.304187.

  7. Jarvis R A, Patrick E. Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Computers, 1973, C-22(11): 1025–1034. DOI: https://doi.org/10.1109/T-C.1973.223640.

    Article  Google Scholar 

  8. Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. In Proc. the 4th Int. Conf. Knowledge Discovery and Data Mining, Aug. 1998, pp.58–65. DOI: https://doi.org/10.5555/3000292.3000302.

  9. Sibson R. SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal, 1973, 16(1): 30–34. DOI: https://doi.org/10.1093/comjnl/16.1.30.

    Article  MathSciNet  Google Scholar 

  10. Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record, 1998, 27(2): 94–105. DOI: https://doi.org/10.1145/276305.276314.

    Article  Google Scholar 

  11. Goil S, Nagesh H, Choudhary A. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report, CPDC-TR-9906-010, Northwestern University, 1999. https://grid.cs.gsu.edu/∼wkim/indesfiles/papers/mafia.pdf, Mar. 2024.

  12. Cheng C H, Fu A W, Zhang Y. Entropy-based subspace clustering for mining numerical data. In Proc. the 5th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Aug. 1999, pp.84–93. DOI: https://doi.org/10.1145/312129.312199.

  13. Aggarwal C C, Wolf J L, Yu P S, Procopiuc C, Park J S. Fast algorithms for projected clustering. In Proc. the 1999 ACM SIGMOD Int. Conf. Management of Data, Jun. 1999, pp.61–72. DOI: https://doi.org/10.1145/304182.304188.

  14. Aggarwal C C, Yu P S. Finding generalized projected clusters in high dimensional spaces. In Proc. the 2000 ACM SIGMOD Int. Conf. Management of Data, May 2000, pp.70–81. DOI: https://doi.org/10.1145/342009.335383.

  15. Woo K G, Lee J H, Kim M H, Lee Y J. FINDIT: A fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology, 2004, 46(4): 255–271. DOI: https://doi.org/10.1016/j.infsof.2003.07.003.

    Article  Google Scholar 

  16. Wang W, Yang J, Muntz R R. STING: A statistical information grid approach to spatial data mining. In Proc. the 23rd Int. Conf. Very Large Data Bases. Aug. 1997, pp.186–195. DOI: https://doi.org/10.5555/645923.758369.

  17. Mukhopadhyay A, Maulik U. Unsupervised satellite image segmentation by combining SA based fuzzy clustering with support vector machine. In Proc. the 7th Int. Conf. Advances in Pattern Recognition, Feb. 2009, pp.381–384. DOI: https://doi.org/10.1109/ICAPR.2009.50.

  18. Thang T M, Kim J. The anomaly detection by using DBSCAN clustering with multiple parameters. In Proc. the 2011 Int. Conf. Information Science and Applications, Apr. 2011. DOI: https://doi.org/10.1109/ICISA.2011.5772437.

  19. Madeira S C, Oliveira A L. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Computational Biology and Bioinformatics, 2004, 1(1): 24–45. DOI: https://doi.org/10.1109/TCBB.2004.2.

    Article  Google Scholar 

  20. Huo S. Detecting self-correlation of nonlinear, lognormal, time-series data via DBSCAN clustering method, using stock price data as example [Ph. D. Thesis]. Ohio State University, Columbus, 2011.

    Google Scholar 

  21. Zhang J, Wu G Q, Hu X G, Li S Y, Hao S L. A parallel k-means clustering algorithm with MPI. In Proc. the 4th International Symposium on Parallel Architectures, Algorithms and Programming, Dec. 2011, pp.60–64. DOI: https://doi.org/10.1109/PAAP.2011.17.

  22. Kumari S, Maheshwari A, Goyal P, Goyal N. Parallel framework for efficient k-means clustering. In Proc. the 8th Annual ACM India Conference, Oct. 2015, pp.63–71. DOI: https://doi.org/10.1145/2835043.2835060.

  23. Song H, Lee J G, Han W S. PAMAE: Parallel k-medoids clustering with high accuracy and efficiency. In Proc. the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2017, pp.1087–1096. DOI: https://doi.org/10.1145/3097983.3098098.

  24. Patwary M A, Palsetia D, Agrawal A, Liao W K, Manne F, Choudhary A. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proc. the 2012 International Conference on High Performance Computing, Networking, Storage and Analysis, Nov. 2012, Article No. 62. DOI: https://doi.org/10.5555/2388996.2389081.

  25. Patwary M M A, Satish N, Sundaram N, Manne F, Habib S, Dubey P. Pardicle: Parallel approximate density-based clustering. In Proc. the 2014 Int. Conf. for High Performance Computing, Networking, Storage and Analysis, Nov. 2014, pp.560–571. DOI: https://doi.org/10.1109/SC.2014.51.

  26. Patwary M M A, Byna S, Satish N R et al. BD-CATS: Big data clustering at trillion particle scale. In Proc. the 2015 Int. Conf. for High Performance Computing, Networking, Storage and Analysis, Nov. 2015, Article No. 6. DOI: https://doi.org/10.1145/2807591.2807616.

  27. Götz M, Bodenstein C, Riedel M. HPDBSCAN: Highly parallel DBSCAN. In Proc. the 2015 Workshop on Machine Learning in High-Performance Computing Environments, Nov. 2015, Article No. 2. DOI: https://doi.org/10.1145/2834892.2834894.

  28. Kumari S, Goyal P, Sood A, Kumar D, Balasubramaniam S, Goyal N. Exact, fast and scalable parallel DBSCAN for commodity platforms. In Proc. the 18th Int. Conf. Distributed Computing and Networking, Jan. 2017, Article No. 14. DOI: https://doi.org/10.1145/3007748.3007773.

  29. Song H, Lee J G. RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning. In Proc. the 2018 Int. Conf. Management of Data, May 2018, pp.1173–1187. DOI: https://doi.org/10.1145/3183713.3196887.

  30. Sarma A, Goyal P, Kumari S, Wani A, Challa J S, Islam S, Goyal N. μDBSCAN: An exact scalable DBSCAN algorithm for big data exploiting spatial locality. In Proc. the 2019 IEEE International Conference on Cluster Computing, Sept. 2019. DOI: https://doi.org/10.1109/CLUSTER.2019.8891020.

  31. Nazerzadeh H, Ghodsi M, Sadjadian S. Parallel sub-space clustering. In Proc. the 10th Annual Conference of Computer Society of Iran, Feb. 2005.

  32. Adinetz A, Kraus J, Meinke J, Pleiter D. GPUMAFIA: Efficient subspace clustering with MAFIA on GPUs. In Proc. the 19th Int. Conf. Parallel Processing, Aug. 2013, pp.838–849. DOI: https://doi.org/10.1007/978-3-642-40047-6_83.

  33. Goyal P, Kumari S, Singh S, Kishore V, Balasubramaniam S S, Goyal N. A parallel framework for grid-based bottom-up subspace clustering. In Proc. the 2016 IEEE Int. Conf. Data Science and Advanced Analytics, Oct. 2016, pp.331–340. DOI: https://doi.org/10.1109/DSAA.2016.42.

  34. Hendrix W, Palsetia D, Patwary M M A, Agrawal A, Liao W K, Choudhary A. A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures. In Proc. the 2013 IEEE Symposium on Large-Scale Data Analysis and Visualization, Oct. 2013, pp.7–13. DOI: https://doi.org/10.1109/LDAV.2013.6675153.

  35. Goyal P, Kumari S, Sharma S, Kumar D, Kishore V, Balasubramaniam S, Goyal N. A fast, scalable SLINK algorithm for commodity cluster computing exploiting spatial locality. In Proc. the 18th Int. Conf. High Performance Computing and Communications, Dec. 2016, pp.268–275. DOI: https://doi.org/10.1109/HPCC-SmartCity-DSS.2016.0047.

  36. Hendrix W, Patwary M M A, Agrawal A, Liao W K, Choudhary A. Parallel hierarchical clustering on shared memory platforms. In Proc. the 19th International Conference on High Performance Computing, Dec. 2012. DOI: https://doi.org/10.1109/HiPC.2012.6507511.

  37. Olman V, Mao F L, Wu H W, Xu Y. Parallel clustering algorithm for large data sets with applications in bioinformatics. IEEE/ACM Trans. Computational Biology and Bioinformatics, 2009, 6(2): 344–352. DOI: https://doi.org/10.1109/TCBB.2007.70272.

    Article  Google Scholar 

  38. Patwary M A, Palsetia D, Agrawal A, Liao W K, Manne F, Choudhary A. Scalable parallel OPTICS data clustering using graph algorithmic techniques. In Proc. the 2013 International Conference on High Performance Computing, Networking, Storage and Analysis, Nov. 2013, Article No. 49. DOI: https://doi.org/10.1145/2503210.2503255.

  39. Goyal P, Kumari S, Kumar D, Balasubramaniam S, Goyal N, Islam S, Challa J S. Parallelizing OPTICS for commodity clusters. In Proc. the 16th International Conference on Distributed Computing and Networking, Jan. 2015, Article No. 33. DOI: https://doi.org/10.1145/2684464.2684477.

  40. Kumari S, Maurya S, Goyal P, Balasubramaniam S S, Goyal N. Scalable parallel algorithms for shared nearest neighbor clustering. In Proc. the 23rd International Conference on High Performance Computing, Dec. 2016, pp.72–81. DOI: https://doi.org/10.1109/HiPC.2016.018.

  41. Challa J S, Goyal P, Nikhil S, Mangla A, Balasubramaniam S S, Goyal N. DD-Rtree: A dynamic distributed data structure for efficient data distribution among cluster nodes for spatial data mining algorithms. In Proc. the 2016 IEEE International Conference on Big Data, Dec. 2016, pp.27–36. DOI: https://doi.org/10.1109/BigData.2016.7840586.

  42. Welton B, Miller B P. Mr. Scan: A hybrid/hybrid extreme scale density based clustering algorithm. Technical Report, Northwestern University, 2015. https://www.paradyn.org/papers/Welton15MrScan.pdf, Mar. 2024.

  43. Dhillon I S, Modha D S. A data-clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, Zaki M J, Ho C T (eds.), Springer-Verlag, 2000, pp.245–260. DOI: https://doi.org/10.1007/3-540-46502-2_13.

  44. Zhang J, Wu G Q, Hu X G, Li S Y, Hao S L. A parallel clustering algorithm with MPI-MKmeans. Journal of Computers, 2013, 8(1): 10–17. DOI: https://doi.org/10.4304/jcp.8.1.10-17.

    Article  Google Scholar 

  45. Kumar J, Mills R T, Hoffman F M, Hargrove W W. Parallel k-means clustering for quantitative ecoregion delineation using large data sets. Procedia Computer Science, 2011, 4: 1602–1611. DOI: https://doi.org/10.1016/j.procs.2011.04.173.

    Article  Google Scholar 

  46. Kerdprasop K, Taokok S, Kerdprasop N. Declarative parallelized techniques for K-means data clustering. International Journal of Mathematics and Computers in Simulation, 2012, 6(5): 483–495.

    Google Scholar 

  47. Balcan M F, Ehrlich S, Liang Y Y. Distributed k-means and k-median clustering on general topologies. In Proc. the 26th International Conference on Neural Information Processing Systems, Dec. 2013, pp.1995–2003. DOI: https://doi.org/10.5555/2999792.2999835.

  48. Gursoy A. Data decomposition for parallel K-means clustering. In Proc. the 5th International Conference on Parallel Processing and Applied Mathematics, Sept. 2003, pp.241–248. DOI: https://doi.org/10.1007/978-3-540-24669-5_31.

  49. Di Fatta G, Pettinger D. Dynamic load balancing in parallel KD-tree k-means. In Proc. the 10th IEEE Int. Conf. Computer and Information Technology, Jul. 2010, pp.2478–2485. DOI: https://doi.org/10.1109/CIT.2010.424.

  50. Arbelaez A, Quesada L. Parallelising the k-Medoids clustering problem using space-partitioning. In Proc. the 6th International Symposium on Combinatorial Search, Jul. 2013, pp.20–28. DOI: https://doi.org/10.1609/socs.v4i1.18282.

  51. Li Y J, Chung S M. Parallel bisecting k-means with prediction clustering algorithm. The Journal of Supercomputing, 2007, 39(1): 19–37. DOI: https://doi.org/10.1007/s11227-0060002-7.

    Article  Google Scholar 

  52. Xu X W, Jäger J, Kriegel H P. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 1999, 3(3): 263–290. DOI: https://doi.org/10.1023/A:1009884809343.

    Article  Google Scholar 

  53. Zhou A Y, Zhou S G, Cao J, Fan Y, Hu Y F. Approaches for scaling DBSCAN algorithm to large spatial databases. Journal of Computer Science and Technology, 2000, 15(6): 509–526. DOI: https://doi.org/10.1007/BF02948834.

    Article  Google Scholar 

  54. Arlia D, Coppola M. Experiments in parallel clustering with DBSCAN. In Proc. the 7th International Euro-Par Conference Manchester on Parallel Processing, Aug. 2001, pp.326–331. DOI: https://doi.org/10.5555/646666.699596.

  55. Coppola M, Vanneschi M. High-performance data mining with skeleton-based structured parallel programming. Parallel Computing, 2002, 28(5): 793–813. DOI: https://doi.org/10.1016/S0167-8191(02)00095-9.

    Article  Google Scholar 

  56. Brecheisen S, Kriegel H P, Pfeifle M. Parallel density-based clustering of complex objects. In Proc. the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Apr. 2006, pp.179–188. DOI: https://doi.org/10.1007/11731139_22.

  57. Chen M, Gao X D, Li H F. Parallel DBSCAN with priority R-tree. In Proc. the 2nd IEEE International Conference on Information Management and Engineering, Apr. 2010, pp.508–511. DOI: https://doi.org/10.1109/ICIME.2010.5477926.

  58. Yang K Y, Gao Y J, Ma R, Chen L, Wu S, Chen G. DBSCAN-MS: Distributed density-based clustering in metric spaces. In Proc. the 35th International Conference on Data Engineering, Apr. 2019, pp.1346–1357. DOI: https://doi.org/10.1109/ICDE.2019.00122.

  59. Rajasekaran S. Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel and Distributed Systems, 2005, 16(6): 497–502. DOI: https://doi.org/10.1109/TPDS.2005.72.

    Article  Google Scholar 

  60. Dash M, Petrutiu S, Scheuermann P. pPOP: Fast yet accurate parallel hierarchical clustering using partitioning. Data & Knowledge Engineering, 61 (3): 563–578. DOI: https://doi.org/10.1016/j.datak.2006.07.004.

  61. Nagesh H S, Goil S, Choudhary A. A scalable parallel subspace clustering algorithm for massive data sets. In Proc. the 2000 International Conference on Parallel Processing, Aug. 2000, pp.477–484. DOI: https://doi.org/10.1109/ICPP.2000.876164.

  62. Bradley P S, Mangasarian O L, Street W N. Clustering via concave minimization. In Proc. the 9th International Conference on Neural Information Processing Systems, Dec. 1996, pp.368–374. DOI: https://doi.org/10.5555/2998981.2999033.

  63. Deb B, Srirama S N. Parallel K-Means clustering for gene expression data on SNOW. International Journal of Computer Applications, 2013, 71(24): 26–30. DOI: https://doi.org/10.5120/12691-9486.

    Google Scholar 

  64. Torti E, Florimbi G, Castelli F, Ortega S, Fabelo H, Callicó G M, Marrero-Martin M, Leporati F. Parallel K-means clustering for brain cancer detection using hyper-spectral images. Electronics, 2018, 7(11): 283. DOI: https://doi.org/10.3390/electronics7110283.

    Article  Google Scholar 

  65. Sardar T H, Ansari Z. An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm. Future Computing and Informatics Journal, 2018, 3(2): 200–209. DOI: https://doi.org/10.1016/j.fcij.2018.03.003.

    Article  Google Scholar 

  66. Zhou G J. Improved optimization of canopy-Kmeans clustering algorithm based on Hadoop platform. In Proc. the 2018 International Conference on Information Technology and Electrical Engineering, Dec. 2018, Article No. 19. DOI: https://doi.org/10.1145/3148453.3306258.

  67. Megarchioti S, Mamalis B. The BigKClustering approach for document clustering using Hadoop MapReduce. In Proc. the 22nd Pan-Hellenic Conference on Informatics, Nov. 2018, pp.261–266. DOI: https://doi.org/10.1145/3291533.3291546.

  68. Bousbaci A, Kamel N. Efficient data distribution and results merging for parallel data clustering in MapReduce environment. Applied Intelligence, 2018, 48(8): 2408–2428. DOI: https://doi.org/10.1007/s10489-017-1089-7.

    Article  Google Scholar 

  69. Santhi V, Jose R. Performance analysis of parallel K-means with optimization algorithms for clustering on Spark. In Proc. the 14th International Conference on Distributed Computing and Internet Technology, Jan. 2018, pp.158–162. DOI: https://doi.org/10.1007/978-3-319-72344-0_12.

  70. Chitrakar A S, Petrović S. Efficient k-means using triangle inequality on spark for cyber security analytics. In Proc. the 2019 ACM International Workshop on Security and Privacy Analytics, Mar. 2019, pp.37–45. DOI: https://doi.org/10.1145/3309182.3309187.

  71. Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable k-means++. Proceedings of the VLDB Endowment, 2012, 5(7): 622–633. DOI: https://doi.org/10.14778/2180912.2180915.

    Article  Google Scholar 

  72. Shafiq M O, Torunski E. A parallel K-Medoids algorithm for clustering based on MapReduce. In Proc. the 15th Int. Conf. Machine Learning and Applications, Dec. 2016, pp.502–507. DOI: https://doi.org/10.1109/ICMLA.2016.0089.

  73. Yue X, Man W, Yue J, Liu G C. Parallel K-Medoids++ spatial clustering algorithm based on MapReduce. ar**v: 1608.06861, 2016. https://doi.org/10.48550/ar**v.1608.06861, Mar. 2024.

  74. Martino A, Rizzi A, Frattale Mascioli F M. Efficient approaches for solving the large-scale k-medoids problem: Towards structured data. In Proc. the 9th International Joint Conference on Computational Intelligence, Nov. 2017, pp.199–219. DOI: https://doi.org/10.1007/978-3-030-16469-0_11.

  75. Beckmann N, Kriegel H P, Schneider R, Seeger B. The R*-tree: An efficient and robust access method for points and rectangles. In Proc. the 1990 ACM SIGMOD International Conference on Management of Data, May 1990, pp.322–331. DOI: https://doi.org/10.1145/93597.98741.

  76. Goyal P, Challa J S, Kumar D, Balasubramaniam S, Goyal N. Grid-R-tree: A data structure for efficient neighborhood and nearest neighbor queries in data mining. International Journal of Data Science and Analytics, 2020, 10(1): 25–47. DOI: https://doi.org/10.1007/s41060-020-00208-2.

    Article  Google Scholar 

  77. Chen L, Gao Y J, Huang X R, Jensen C S, Zheng B L. Efficient distributed clustering algorithms on star-schema heterogeneous graphs. IEEE Trans. Knowledge and Data Engineering, 2022, 34(10): 4781–4796. DOI: https://doi.org/10.1109/TKDE.2020.3047631.

    Article  Google Scholar 

  78. Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L. G-DBSCAN: A GPU accelerated algorithm for density-based clustering. Procedia Computer Science, 2013, 18: 369–378. DOI: https://doi.org/10.1016/j.procs.2013.05.200.

    Article  Google Scholar 

  79. Chen C C, Chen M S. HiClus: Highly scalable density-based clustering with heterogeneous cloud. Procedia Computer Science, 2015, 53: 149–157. DOI: https://doi.org/10.1016/j.procs.2015.07.289.

    Article  Google Scholar 

  80. Hu X J, Liu L, Qiu N J, Yang D, Li M. A MapReduce-based improvement algorithm for DBSCAN. Journal of Algorithms & Computational Technology, 2018, 12(1): 53–61. DOI: https://doi.org/10.1177/1748301817735665.

    Article  MathSciNet  Google Scholar 

  81. Gu Y H, Ye X Y, Zhang F, Du Z H, Liu R Y, Yu L F. A parallel varied density-based clustering algorithm with optimized data partition. Journal of Spatial Science, 2018, 63(1): 93–114. DOI: https://doi.org/10.1080/14498596.2017.1352542.

    Article  Google Scholar 

  82. Han D W, Agrawal A, Liao W K, Choudhary A. A novel scalable DBSCAN algorithm with Spark. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, May 2016, pp.1393–1402. DOI: https://doi.org/10.1109/IPDPSW.2016.57.

  83. Huang F, Zhu Q, Zhou J, Tao J, Zhou X C, ** D, Tan X C, Wang L Z. Research on the parallelization of the DBSCAN clustering algorithm for spatial data mining based on the spark platform. Remote Sensing, 2017, 9(12): 1301. DOI: https://doi.org/10.3390/rs9121301.

    Article  Google Scholar 

  84. Zhang Y F, Chen S M, Yu G. Efficient distributed density peaks for clustering large data sets in MapReduce. IEEE Trans. Knowledge and Data Engineering, 2016, 28(12): 3218–3230. DOI: https://doi.org/10.1109/TKDE.2016.2609423.

    Article  Google Scholar 

  85. Guttman A. R-trees: A dynamic index structure for spatial searching. In Proc. the 1984 ACM SIGMOD International Conference on Management of Data, Jun. 1984, pp.47–57. DOI: https://doi.org/10.1145/602259.602266.

  86. Ertöz L, Steinbach M, Kumar V. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proc. the 2003 SIAM International Conference on Data Mining, Jan. 2003, pp.47–58. DOI: https://doi.org/10.1137/1.9781611972733.5.

  87. Cao Z W, Zhou Y. Parallel text clustering based on MapReduce. In Proc. the 2nd International Conference on Cloud and Green Computing, Nov. 2012, pp.226–229. DOI: https://doi.org/10.1109/CGC.2012.128.

  88. Wang S J, Eick C F. MR-SNN: Design of parallel shared nearest neighbor clustering algorithm using MapReduce. In Proc. the 2nd International Conference on Big Data Analysis, Mar. 2017, pp.312–315. DOI: https://doi.org/10.1109/ICBDA.2017.8078831.

  89. Gagolewski M, Bartoszuk M, Cena A. Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences, 2016, 363: 8–23. DOI: https://doi.org/10.1016/j.ins.2016.05.003.

    Article  Google Scholar 

  90. Li X. Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans. Pattern Analysis and Machine Intelligence, 1990, 12(11): 1088–1092. DOI: https://doi.org/10.1109/34.61708.

    Article  Google Scholar 

  91. Wu C H, Horng S J, Tsai H R. Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. Journal of Parallel and Distributed Computing, 2000, 60(9): 1137–1153. DOI: https://doi.org/10.1006/jpdc.2000.1644.

    Article  Google Scholar 

  92. Du Z, Lin F. A novel parallelization approach for hierarchical clustering. Parallel Computing, 2005, 31(5): 523–527. DOI: https://doi.org/10.1016/j.parco.2005.01.001.

    Article  Google Scholar 

  93. Johnson E, Kargupta H. Collective, hierarchical clustering from distributed, heterogeneous data. In Proc. the 2000 Large-Scale Parallel Data Mining, Feb. 2000, pp.221–244. DOI: https://doi.org/10.1007/3-540-46502-2_12.

  94. Olson C F. Parallel algorithms for hierarchical clustering. Parallel Computing, 1995, 21(8): 1313–1325. DOI: https://doi.org/10.1016/0167-8191(95)00017-I.

    Article  MathSciNet  Google Scholar 

  95. Dash M, Liu H, Scheuermann P, Tan K L. Fast hierarchical clustering and its validation. Data & Knowledge Engineering, 2003, 44(1): 109–138. DOI: https://doi.org/10.1016/S0169-023X(02)00138-6.

    Article  Google Scholar 

  96. ** C, Liu R Q, Chen Z Z, Hendrix W, Agrawal A, Choudhary A. A scalable hierarchical clustering algorithm using Spark. In Proc. the 1st Int. Conf. Big Data Computing Service and Applications, Mar. 30–Apr. 2, 2015, pp.418–426. DOI: https://doi.org/10.1109/BigDataService.2015.67.

  97. Mazzeo G, Zanilo C. The parallelization of a complex hierarchical clustering algorithm: Faster unsupervised learning on larger data sets. Technical Report, University of California, Los Angeles, 2016..

  98. Wang Y, Narasayya V, He Y Y, Chaudhuri S. PACk: An efficient partition-based distributed agglomerative hierarchical clustering algorithm for deduplication. Proceedings of the VLDB Endowment, 2022, 15(6): 1132–1145. DOI: https://doi.org/10.14778/3514061.3514062.

    Article  Google Scholar 

  99. Yang J, Wang W, Wang H X, Yu P. δ-Clusters: Capturing subspace correlation in a large data set. In Proc. the 18th Int. Conf. Data Engineering, Feb. 26–Mar. 1, 2002, pp.517–528. DOI: https://doi.org/10.1109/ICDE.2002.994771.

  100. Friedman J H, Meulman J J. Clustering objects on subsets of attributes. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2004, 66(4): 815–849. DOI: https://doi.org/10.1111/j.1467-9868.2004.02059.x.

    Article  MathSciNet  Google Scholar 

  101. Domeniconi C, Papadopoulos D, Gunopulos D, Ma S. Subspace clustering of high dimensional data. In Proc. the 2004 SIAM Int. Conf. Data Mining, Apr. 2004, pp.517–521. DOI: https://doi.org/10.1137/1.9781611972740.58.

  102. Sequeira K, Zaki M. SCHISM: A new approach for interesting subspace mining. In Proc. the 4th IEEE International Conference on Data Mining, Nov. 2004, pp.186–193. DOI: https://doi.org/10.1109/ICDM.2004.10099.

  103. Chang J W, ** D S. A new cell-based clustering method for large, high-dimensional data in data mining applications. In Proc. the 2002 ACM Symposium on Applied Computing, Mar. 2002, pp.503–507. DOI: https://doi.org/10.1145/508791.508886.

  104. Kailing K, Kriegel H P, Kröger P. Density-connected subspace clustering for high-dimensional data. In Proc. the 4th SIAM International Conference on Data Mining, Apr. 2004, pp.246–256. DOI: https://doi.org/10.1137/1.9781611972740.23.

  105. Kriegel H P, Kroger P, Renz M, Wurst S. A generic framework for efficient subspace clustering of high-dimensional data. In Proc. the 5th IEEE Int. Conf. Data Mining, Nov. 2005, pp.250–257. DOI: https://doi.org/10.1109/ICDM.2005.5.

  106. Assent I, Krieger R, Müller E, Seidl T. DUSC: Dimensionality unbiased subspace clustering. In Proc. the 7th IEEE International Conference on Data Mining, Oct. 2007, pp.409–414. DOI: https://doi.org/10.1109/ICDM.2007.49.

  107. Assent I, Krieger R, Müller E, Seidl T. INSCY: Indexing subspace clusters with in-process-removal of redundancy. In Proc. the 8th IEEE Int. Conf. Data Mining, Dec. 2008, pp.719–724. DOI: https://doi.org/10.1109/ICDM.2008.46.

  108. Kaur A, Datta A. A novel algorithm for fast and scalable subspace clustering of high-dimensional data. Journal of Big Data, 2015, 2 (1): Article No. 17. DOI: https://doi.org/10.1186/s40537-015-0027-y.

  109. Zhu B, Mara A, Mozo A. CLUS: Parallel subspace clustering algorithm on Spark. In Proc. the 2015 Short Papers and Workshops on New Trends in Databases and Information Systems, Sept. 2015, pp.175–185. DOI: https://doi.org/10.1007/978-3-319-23201-0_20.

  110. Zhu B, Mozo A, Ordozgoiti B. PSCEG: An unbiased parallel subspace clustering algorithm using exact grids. In Proc. the 24th European Symposium on Artificial Neural Networks, Apr. 2016, pp.581–586.

  111. Gao Z P, Fan Y D, Niu K, Ying Z Y. MR-Mafia: Parallel subspace clustering algorithm based on MapReduce for large multi-dimensional datasets. In Proc. the 2018 IEEE International Conference on Big Data and Smart Computing, Jan. 2018, pp.257–262. DOI: https://doi.org/10.1109/Big-Comp.2018.00045.

  112. Kaul M, Yang B, Jensen C S. Building accurate 3D spatial networks to enable next generation intelligent transportation systems. In Proc. the 14th International Conference on Mobile Data Management, Jun. 2013, pp.137–146. DOI: https://doi.org/10.1109/MDM.2013.24.

  113. Springel V, White S D M, Jenkins A et al. Simulations of the Formation, evolution and clustering of galaxies and quasars. Nature, 2005, 435(1): 629–636. DOI: https://doi.org/10.1038/nature03597

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Poonam Goyal.

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

Jagat Sesh Challa got his M.Sc. (Tech.) degree in information systems in 2010, his M.E. degree in software system in 2012, and his Ph.D. degree in computer science in 2019, all from the Birla Institute of Technology and Science (BITS), Pilani Campus. Since Feb. 2021, he has been employed as an assistant professor at BITS Pilani. His current research interests include big data analytics, high performance computing, stream analytics, computer vision, materials informatics, human-computer interaction, and federated learning.

Navneet Goyal got his M.Sc. degree in mathematics from H.N.Bahuguna University, Srinagar (U.P) in 1988, his M.Phil. degree in mathematics in 1989 and his Ph.D. degree in applied mathematics in 1995 from the University of Roorkee (now IIT, Roorkee) Roorkee. Since 1995, he has been working as a faculty of various designations (assistant professor, associate professor, professor, senior professor) at Birla Institute of Technology and Science (BITS), Pilani Campus. Currently, he is also the head of the CSIS Department at BITS Pilani, Pilani Campus. His current research interests include big data analytics, AI/ML, earth observation, satellite image analytics, IoT data analytics, data provenance, databases and warehousing, etc.

Amogh Sharma got his B.E. degree in computer science from BITS Pilani in 2019. Since July 2019, he has been working with Uber (India and USA) as a software engineer (July 2019–Jan. 2022) and senior software engineer (since Feb 2022). His research interests include data mining and parallel algorithms.

Nikhil Sreekumar got his B.Tech. degree in CSE from TKM College of Engineering, Kollam in 2014, his M.E. degree in CS from BITS Pilani in 2016, and currently is pursuing his Ph.D. degree from the University of Minnesota. His current research interests include big data and distributed computing.

Sundar Balasubramaniam completed his B.E. degree in electronics and communication from the College of Engineering, Anna University, Chennai, in 1988, his M.Tech. degree in computer science from NIT Warangal in 1990, and his M.S. degree in computer science from Indiana University Bloomington in 1998. Since 2019, he has been working as an independent consultant and product designer for higher education in Bengaluru, India. His research interests include HPC, compilers, cloud computing, parallel algorithms, big data, etc.

Poonam Goyal is a professor in the Computer Science Department at BITS Pilani, Pilani Campus, Pilani. She completed her post-graduation in mathematics in 1989, and got her Ph.D. degree in applied mathematics in 1995, both from the University of Roorkee (now IIT, Roorkee), Roorkee, India, and her M.E. degree in software systems from the CSIS Department, BITS Pilani, Pilani Campus, in 2001. Her research has contributed to various social and scientific domains like social media analytics, multi-modal knowledge graphs, HPC for AI, etc.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Challa, J.S., Goyal, N., Sharma, A. et al. A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering Algorithms. J. Comput. Sci. Technol. (2024). https://doi.org/10.1007/s11390-024-2700-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11390-024-2700-0

Keywords

Navigation