Log in

Open benchmark for filtering techniques in entity resolution

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. http://oaei.ontologymatching.org/2010.

  2. http://www.freedb.org.

  3. https://github.com/Rehket/FEBRL.

  4. https://github.com/scify/JedAIToolkit.

  5. https://github.com/anhaidgroup/py_entitymatching.

  6. https://github.com/FALCONN-LIB/FALCONN.

  7. https://github.com/facebookresearch/faiss.

  8. https://github.com/google-research/google-research/tree/master/scann.

  9. https://github.com/qcri/DeepBlocker.

  10. See examples at: https://github.com/FALCONN-LIB/FALCONN/blob/master/src/examples/glove/glove.py.

  11. For a schema-aware scalability analysis involving part of the blocking workflows, please refer to [14].

  12. More specifically, a speedup lower than 200 (=2 M/10K) indicate a sublinear scalability, a speedup around 200 indicates linear scalability, whereas a speedup between 200 and 40,000 (=\(200^2\)) indicates a superlinear, but sub-quadratic scalability.

References

  1. Getoor, L., Machanavajjhala, A.: Entity Resolution: Theory, Practice and Open Challenges. PVLDB (2012)

  2. Dong, X.L., Srivastava, D.: Big Data Integration. Morgan and Claypool Publishers (2015)

  3. Christen, P.: Data Matching. Springer (2012)

  4. Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. TKDE 19(1) (2007)

  5. Papadakis, G., Ioannou, E., Thanos,, E. Palpanas, T.: The Four Generations of Entity Resolution. Morgan & Claypool Publishers (2021)

  6. Barlaug, N., Gulla, J.A.: Neural networks for entity matching: a survey. In: ACM TKDD (2021)

  7. Hassanzadeh, O., Chiang, F., Miller, R.J., Lee, H.C.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1) (2009)

  8. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. TKDE (2012)

  9. Thirumuruganathan, S., et al.: Deep learning for blocking in entity matching: a design space exploration. PVLDB 14(11), 2459–2472 (2021)

    Google Scholar 

  10. Papadakis, G., Svirsky,, J. Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9) (2016)

  11. Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 9(9), 636–647 (2016)

    Article  Google Scholar 

  12. Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. Proc. VLDB Endow. 7(8), 625–636 (2014)

    Article  Google Scholar 

  13. Aumüller, M., Bernhardsson, E., Faithfull, A.J.: Ann-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. Inf. Syst. 87 (2020)

  14. Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. VLDB Endow. 9(4), 312–323 (2015)

    Article  Google Scholar 

  15. Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.: Set similarity joins on mapreduce: an experimental survey. PVLDB 11(10), 1110–1122 (2018)

    Google Scholar 

  16. Papadakis, G., Fisichella, M., Schoger, F., Mandilaras, G., Augsten, N., Nejdl, W.: Benchmarking filtering techniques for entity resolution. In: ICDE (2023)

  17. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: ACM SIGMOD, pp. 495–506 (2010)

  18. Papadakis, G. et al.: Three-dimensional entity resolution with jedai. Inf. Syst. 93 (2020)

  19. Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., Koubarakis, M.: Domain- and structure-agnostic end-to-end entity resolution with jedai. SIGMOD Rec. 48(4), 30–36 (2019)

    Article  Google Scholar 

  20. Paulsen, D., Govind, Y., Doan, A.: Sparkly: a simple yet surprisingly strong TF/IDF blocker for entity matching. PVLDB 16(6), 1507–1519 (2023)

    Google Scholar 

  21. Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow. 9(12), 1197–1208 (2016)

    Article  Google Scholar 

  22. Brunner, U., Stockinger, K.: Entity matching with transformer architectures: a step forward in data integration. In: EDBT, pp. 463–473 (2020)

  23. Galhotra, S., Firmani, D., Saha, B., Srivastava, D.: BEER: blocking for effective entity resolution. In: SIGMOD, pp. 2711–2715 (2021)

  24. Galhotra, S., Firmani, D., Saha, B., Srivastava, D.: Efficient and effective ER with progressive blocking. VLDB J. 30(4), pp. 537–557 (2021)

  25. Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  Google Scholar 

  26. Nanayakkara, C., Christen, P.: Locality sensitive hashing with temporal and spatial constraints for efficient population record linkage. In: ACM CIKM, pp. 4354–4358 (2022)

  27. Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. TKDE 26(8), 1946–1960 (2014)

    Google Scholar 

  28. Gagliardelli, L., Papadakis, G., Simonini, G., Bergamaschi, S., Palpanas, T.: Generalized supervised meta-blocking. PVLDB 15(9), 1902–1910 (2022)

    Google Scholar 

  29. Simonini, G., Bergamaschi, S., Jagadish, H.: BLAST: a loosely schema-aware meta-blocking approach for entity resolution. PVLDB 9(12), 1173–1184 (2016)

    Google Scholar 

  30. Gravano, L., et al.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)

  31. Augsten, N., Böhlen, M.H.: Similarity Joins in Relational Database Systems. Morgan & Claypool (2013)

  32. Augsten, N.: A roadmap towards declarative similarity queries. In: EDBT, pp. 509–512 (2018)

  33. Silva, Y., et al.: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB J. (2013)

  34. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)

  35. Chaudhuri, S. et al.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)

  36. Bouros, P., Ge, S., Mamoulis, N.: Spatio-textual similarity joins. PVLDB 6(1), 1–12 (2012)

    Google Scholar 

  37. Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. Proc. VLDB Endow. 9(4), 360–371 (2015)

    Article  Google Scholar 

  38. Deng, D., Tao, Y., Li, G.: Overlap set similarity joins with theoretical guarantees. In: SIGMOD (2018)

  39. Zhu, E., Deng, D., Nargesian, F., Miller, R.J.: JOSIE: overlap set similarity search for finding joinable tables in data lakes. In: SIGMOD, pp. 847–864 (2019)

  40. **ao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)

  41. Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)

  42. Kocher, D., Augsten, N.: A scalable index for top-k subtree similarity queries. In: SIGMOD (2019)

  43. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  44. Li, G., et al.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)

    Google Scholar 

  45. **ao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)

  46. Yang, Z., Zheng, B., Li, G., Zhao, X., Zhou, X., Jensen, C.S.: Adaptive top-k overlap set similarity joins. In: ICDE, pp. 1081–1092 (2020)

  47. Broder, A.Z.: On the resemblance and containment of documents. In: Sequences, pp. 21–29 (1997)

  48. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Data Sets. Cambridge University Press (2020)

  49. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)

  50. Fisichella, M., Deng, F., Nejdl, W.: Efficient incremental near duplicate detection based on locality sensitive hashing. In: DEXA, pp. 152–166 (2010)

  51. Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: Autoblock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752 (2020)

  52. Ebraheem, M. et al.: Distributed representations of tuples for entity resolution. PVLDB, pp. 1454–1467 (2018)

  53. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  54. Mudgal, S. et al.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34 (2018)

  55. Charikar, M. S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)

  56. Nelson , B. et al.: Multiprobe-lsh. https://github.com/gopalmenon/Multi-Probe-LSH (2018)

  57. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans Big Data (2021)

  58. Guo, R., et al.: Accelerating large-scale inference with anisotropic vector quantization. In: ICML (2020)

  59. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1), 484–493 (2010)

    Article  Google Scholar 

  60. Obraczka, D., Schuchart, J., Rahm, E.: Embedding-assisted entity resolution for knowledge graphs. In: ESWC, vol. 2873 (2021)

  61. Kenig, B., Gal, A.: Mfiblocks: an effective blocking algorithm for entity resolution. Inf. Syst. 38(6), 908–926 (2013)

    Article  Google Scholar 

  62. Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I.P., Schmidt, L.: Practical and optimal LSH for angular distance. In: NIPS, pp. 1225–1233 (2015)

  63. Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. Proc. VLDB Endow. 15(1), 31–45 (2021)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Papadakis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 5414 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Neuhof, F., Fisichella, M., Papadakis, G. et al. Open benchmark for filtering techniques in entity resolution. The VLDB Journal (2024). https://doi.org/10.1007/s00778-024-00868-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00778-024-00868-7

Keywords

Navigation