Abstract
Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00778-024-00868-7/MediaObjects/778_2024_868_Fig11_HTML.png)
Notes
For a schema-aware scalability analysis involving part of the blocking workflows, please refer to [14].
More specifically, a speedup lower than 200 (=2 M/10K) indicate a sublinear scalability, a speedup around 200 indicates linear scalability, whereas a speedup between 200 and 40,000 (=\(200^2\)) indicates a superlinear, but sub-quadratic scalability.
References
Getoor, L., Machanavajjhala, A.: Entity Resolution: Theory, Practice and Open Challenges. PVLDB (2012)
Dong, X.L., Srivastava, D.: Big Data Integration. Morgan and Claypool Publishers (2015)
Christen, P.: Data Matching. Springer (2012)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. TKDE 19(1) (2007)
Papadakis, G., Ioannou, E., Thanos,, E. Palpanas, T.: The Four Generations of Entity Resolution. Morgan & Claypool Publishers (2021)
Barlaug, N., Gulla, J.A.: Neural networks for entity matching: a survey. In: ACM TKDD (2021)
Hassanzadeh, O., Chiang, F., Miller, R.J., Lee, H.C.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1) (2009)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. TKDE (2012)
Thirumuruganathan, S., et al.: Deep learning for blocking in entity matching: a design space exploration. PVLDB 14(11), 2459–2472 (2021)
Papadakis, G., Svirsky,, J. Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9) (2016)
Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 9(9), 636–647 (2016)
Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. Proc. VLDB Endow. 7(8), 625–636 (2014)
Aumüller, M., Bernhardsson, E., Faithfull, A.J.: Ann-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. Inf. Syst. 87 (2020)
Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. VLDB Endow. 9(4), 312–323 (2015)
Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.: Set similarity joins on mapreduce: an experimental survey. PVLDB 11(10), 1110–1122 (2018)
Papadakis, G., Fisichella, M., Schoger, F., Mandilaras, G., Augsten, N., Nejdl, W.: Benchmarking filtering techniques for entity resolution. In: ICDE (2023)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: ACM SIGMOD, pp. 495–506 (2010)
Papadakis, G. et al.: Three-dimensional entity resolution with jedai. Inf. Syst. 93 (2020)
Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., Koubarakis, M.: Domain- and structure-agnostic end-to-end entity resolution with jedai. SIGMOD Rec. 48(4), 30–36 (2019)
Paulsen, D., Govind, Y., Doan, A.: Sparkly: a simple yet surprisingly strong TF/IDF blocker for entity matching. PVLDB 16(6), 1507–1519 (2023)
Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow. 9(12), 1197–1208 (2016)
Brunner, U., Stockinger, K.: Entity matching with transformer architectures: a step forward in data integration. In: EDBT, pp. 463–473 (2020)
Galhotra, S., Firmani, D., Saha, B., Srivastava, D.: BEER: blocking for effective entity resolution. In: SIGMOD, pp. 2711–2715 (2021)
Galhotra, S., Firmani, D., Saha, B., Srivastava, D.: Efficient and effective ER with progressive blocking. VLDB J. 30(4), pp. 537–557 (2021)
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Nanayakkara, C., Christen, P.: Locality sensitive hashing with temporal and spatial constraints for efficient population record linkage. In: ACM CIKM, pp. 4354–4358 (2022)
Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. TKDE 26(8), 1946–1960 (2014)
Gagliardelli, L., Papadakis, G., Simonini, G., Bergamaschi, S., Palpanas, T.: Generalized supervised meta-blocking. PVLDB 15(9), 1902–1910 (2022)
Simonini, G., Bergamaschi, S., Jagadish, H.: BLAST: a loosely schema-aware meta-blocking approach for entity resolution. PVLDB 9(12), 1173–1184 (2016)
Gravano, L., et al.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Augsten, N., Böhlen, M.H.: Similarity Joins in Relational Database Systems. Morgan & Claypool (2013)
Augsten, N.: A roadmap towards declarative similarity queries. In: EDBT, pp. 509–512 (2018)
Silva, Y., et al.: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB J. (2013)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Chaudhuri, S. et al.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)
Bouros, P., Ge, S., Mamoulis, N.: Spatio-textual similarity joins. PVLDB 6(1), 1–12 (2012)
Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. Proc. VLDB Endow. 9(4), 360–371 (2015)
Deng, D., Tao, Y., Li, G.: Overlap set similarity joins with theoretical guarantees. In: SIGMOD (2018)
Zhu, E., Deng, D., Nargesian, F., Miller, R.J.: JOSIE: overlap set similarity search for finding joinable tables in data lakes. In: SIGMOD, pp. 847–864 (2019)
**ao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Kocher, D., Augsten, N.: A scalable index for top-k subtree similarity queries. In: SIGMOD (2019)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Li, G., et al.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)
**ao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)
Yang, Z., Zheng, B., Li, G., Zhao, X., Zhou, X., Jensen, C.S.: Adaptive top-k overlap set similarity joins. In: ICDE, pp. 1081–1092 (2020)
Broder, A.Z.: On the resemblance and containment of documents. In: Sequences, pp. 21–29 (1997)
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Data Sets. Cambridge University Press (2020)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
Fisichella, M., Deng, F., Nejdl, W.: Efficient incremental near duplicate detection based on locality sensitive hashing. In: DEXA, pp. 152–166 (2010)
Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: Autoblock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752 (2020)
Ebraheem, M. et al.: Distributed representations of tuples for entity resolution. PVLDB, pp. 1454–1467 (2018)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Mudgal, S. et al.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34 (2018)
Charikar, M. S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)
Nelson , B. et al.: Multiprobe-lsh. https://github.com/gopalmenon/Multi-Probe-LSH (2018)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans Big Data (2021)
Guo, R., et al.: Accelerating large-scale inference with anisotropic vector quantization. In: ICML (2020)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1), 484–493 (2010)
Obraczka, D., Schuchart, J., Rahm, E.: Embedding-assisted entity resolution for knowledge graphs. In: ESWC, vol. 2873 (2021)
Kenig, B., Gal, A.: Mfiblocks: an effective blocking algorithm for entity resolution. Inf. Syst. 38(6), 908–926 (2013)
Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I.P., Schmidt, L.: Practical and optimal LSH for angular distance. In: NIPS, pp. 1225–1233 (2015)
Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. Proc. VLDB Endow. 15(1), 31–45 (2021)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Neuhof, F., Fisichella, M., Papadakis, G. et al. Open benchmark for filtering techniques in entity resolution. The VLDB Journal (2024). https://doi.org/10.1007/s00778-024-00868-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00778-024-00868-7