Open benchmark for filtering techniques in entity resolution

Neuhof, Franziska; Fisichella, Marco; Papadakis, George; Nikoletos, Konstantinos; Augsten, Nikolaus; Nejdl, Wolfgang; Koubarakis, Manolis

doi:10.1007/s00778-024-00868-7

Open benchmark for filtering techniques in entity resolution

Regular Paper
Published: 09 July 2024

(2024)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Franziska Neuhof¹,
Marco Fisichella¹,
George Papadakis ORCID: orcid.org/0000-0002-7298-9431²,
Konstantinos Nikoletos²,
Nikolaus Augsten³,
Wolfgang Nejdl¹ &
…
Manolis Koubarakis²

Abstract

Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 4

Notes

http://oaei.ontologymatching.org/2010.
http://www.freedb.org.
https://github.com/Rehket/FEBRL.
https://github.com/scify/JedAIToolkit.
https://github.com/anhaidgroup/py_entitymatching.
https://github.com/FALCONN-LIB/FALCONN.
https://github.com/facebookresearch/faiss.
https://github.com/google-research/google-research/tree/master/scann.
https://github.com/qcri/DeepBlocker.
See examples at: https://github.com/FALCONN-LIB/FALCONN/blob/master/src/examples/glove/glove.py.
For a schema-aware scalability analysis involving part of the blocking workflows, please refer to [14].
More specifically, a speedup lower than 200 (=2 M/10K) indicate a sublinear scalability, a speedup around 200 indicates linear scalability, whereas a speedup between 200 and 40,000 (=\(200^2\)) indicates a superlinear, but sub-quadratic scalability.

References

Getoor, L., Machanavajjhala, A.: Entity Resolution: Theory, Practice and Open Challenges. PVLDB (2012)
Dong, X.L., Srivastava, D.: Big Data Integration. Morgan and Claypool Publishers (2015)
Christen, P.: Data Matching. Springer (2012)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. TKDE 19(1) (2007)
Papadakis, G., Ioannou, E., Thanos,, E. Palpanas, T.: The Four Generations of Entity Resolution. Morgan & Claypool Publishers (2021)
Barlaug, N., Gulla, J.A.: Neural networks for entity matching: a survey. In: ACM TKDD (2021)
Hassanzadeh, O., Chiang, F., Miller, R.J., Lee, H.C.: Framework for evaluating clustering algorithms in duplicate detection. PVLDB 2(1) (2009)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. TKDE (2012)
Thirumuruganathan, S., et al.: Deep learning for blocking in entity matching: a design space exploration. PVLDB 14(11), 2459–2472 (2021)
Google Scholar
Papadakis, G., Svirsky,, J. Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9) (2016)
Mann, W., Augsten, N., Bouros, P.: An empirical evaluation of set similarity join techniques. Proc. VLDB Endow. 9(9), 636–647 (2016)
Article Google Scholar
Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: an experimental evaluation. Proc. VLDB Endow. 7(8), 625–636 (2014)
Article Google Scholar
Aumüller, M., Bernhardsson, E., Faithfull, A.J.: Ann-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. Inf. Syst. 87 (2020)
Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. Proc. VLDB Endow. 9(4), 312–323 (2015)
Article Google Scholar
Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.: Set similarity joins on mapreduce: an experimental survey. PVLDB 11(10), 1110–1122 (2018)
Google Scholar
Papadakis, G., Fisichella, M., Schoger, F., Mandilaras, G., Augsten, N., Nejdl, W.: Benchmarking filtering techniques for entity resolution. In: ICDE (2023)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: ACM SIGMOD, pp. 495–506 (2010)
Papadakis, G. et al.: Three-dimensional entity resolution with jedai. Inf. Syst. 93 (2020)
Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., Koubarakis, M.: Domain- and structure-agnostic end-to-end entity resolution with jedai. SIGMOD Rec. 48(4), 30–36 (2019)
Article Google Scholar
Paulsen, D., Govind, Y., Doan, A.: Sparkly: a simple yet surprisingly strong TF/IDF blocker for entity matching. PVLDB 16(6), 1507–1519 (2023)
Google Scholar
Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow. 9(12), 1197–1208 (2016)
Article Google Scholar
Brunner, U., Stockinger, K.: Entity matching with transformer architectures: a step forward in data integration. In: EDBT, pp. 463–473 (2020)
Galhotra, S., Firmani, D., Saha, B., Srivastava, D.: BEER: blocking for effective entity resolution. In: SIGMOD, pp. 2711–2715 (2021)
Galhotra, S., Firmani, D., Saha, B., Srivastava, D.: Efficient and effective ER with progressive blocking. VLDB J. 30(4), pp. 537–557 (2021)
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book Google Scholar
Nanayakkara, C., Christen, P.: Locality sensitive hashing with temporal and spatial constraints for efficient population record linkage. In: ACM CIKM, pp. 4354–4358 (2022)
Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. TKDE 26(8), 1946–1960 (2014)
Google Scholar
Gagliardelli, L., Papadakis, G., Simonini, G., Bergamaschi, S., Palpanas, T.: Generalized supervised meta-blocking. PVLDB 15(9), 1902–1910 (2022)
Google Scholar
Simonini, G., Bergamaschi, S., Jagadish, H.: BLAST: a loosely schema-aware meta-blocking approach for entity resolution. PVLDB 9(12), 1173–1184 (2016)
Google Scholar
Gravano, L., et al.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Augsten, N., Böhlen, M.H.: Similarity Joins in Relational Database Systems. Morgan & Claypool (2013)
Augsten, N.: A roadmap towards declarative similarity queries. In: EDBT, pp. 509–512 (2018)
Silva, Y., et al.: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB J. (2013)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
Chaudhuri, S. et al.: A primitive operator for similarity joins in data cleaning. In: ICDE (2006)
Bouros, P., Ge, S., Mamoulis, N.: Spatio-textual similarity joins. PVLDB 6(1), 1–12 (2012)
Google Scholar
Deng, D., Li, G., Wen, H., Feng, J.: An efficient partition based method for exact set similarity joins. Proc. VLDB Endow. 9(4), 360–371 (2015)
Article Google Scholar
Deng, D., Tao, Y., Li, G.: Overlap set similarity joins with theoretical guarantees. In: SIGMOD (2018)
Zhu, E., Deng, D., Nargesian, F., Miller, R.J.: JOSIE: overlap set similarity search for finding joinable tables in data lakes. In: SIGMOD, pp. 847–864 (2019)
**ao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 36(3), 15:1–15:41 (2011)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Kocher, D., Augsten, N.: A scalable index for top-k subtree similarity queries. In: SIGMOD (2019)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Li, G., et al.: PASS-JOIN: a partition-based method for similarity joins. PVLDB 5(3), 253–264 (2011)
Google Scholar
**ao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE, pp. 916–927 (2009)
Yang, Z., Zheng, B., Li, G., Zhao, X., Zhou, X., Jensen, C.S.: Adaptive top-k overlap set similarity joins. In: ICDE, pp. 1081–1092 (2020)
Broder, A.Z.: On the resemblance and containment of documents. In: Sequences, pp. 21–29 (1997)
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Data Sets. Cambridge University Press (2020)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
Fisichella, M., Deng, F., Nejdl, W.: Efficient incremental near duplicate detection based on locality sensitive hashing. In: DEXA, pp. 152–166 (2010)
Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: Autoblock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752 (2020)
Ebraheem, M. et al.: Distributed representations of tuples for entity resolution. PVLDB, pp. 1454–1467 (2018)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Mudgal, S. et al.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34 (2018)
Charikar, M. S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)
Nelson , B. et al.: Multiprobe-lsh. https://github.com/gopalmenon/Multi-Probe-LSH (2018)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans Big Data (2021)
Guo, R., et al.: Accelerating large-scale inference with anisotropic vector quantization. In: ICML (2020)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1), 484–493 (2010)
Article Google Scholar
Obraczka, D., Schuchart, J., Rahm, E.: Embedding-assisted entity resolution for knowledge graphs. In: ESWC, vol. 2873 (2021)
Kenig, B., Gal, A.: Mfiblocks: an effective blocking algorithm for entity resolution. Inf. Syst. 38(6), 908–926 (2013)
Article Google Scholar
Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I.P., Schmidt, L.: Practical and optimal LSH for angular distance. In: NIPS, pp. 1225–1233 (2015)
Jain, A., Sarawagi, S., Sen, P.: Deep indexed active learning for matching heterogeneous entity representations. Proc. VLDB Endow. 15(1), 31–45 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

L3S Research Center, Hannover, Germany
Franziska Neuhof, Marco Fisichella & Wolfgang Nejdl
National and Kapodistrian University of Athens, Athens, Greece
George Papadakis, Konstantinos Nikoletos & Manolis Koubarakis
University of Salzburg, Salzburg, Austria
Nikolaus Augsten

Authors

Franziska Neuhof
View author publications
You can also search for this author in PubMed Google Scholar
Marco Fisichella
View author publications
You can also search for this author in PubMed Google Scholar
George Papadakis
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Nikoletos
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaus Augsten
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Nejdl
View author publications
You can also search for this author in PubMed Google Scholar
Manolis Koubarakis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Papadakis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 5414 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Neuhof, F., Fisichella, M., Papadakis, G. et al. Open benchmark for filtering techniques in entity resolution. The VLDB Journal (2024). https://doi.org/10.1007/s00778-024-00868-7

Download citation

Received: 02 June 2023
Accepted: 29 June 2024
Published: 09 July 2024
DOI: https://doi.org/10.1007/s00778-024-00868-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Open benchmark for filtering techniques in entity resolution

Abstract

Access this article

Subscribe and save

Buy Now

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 5414 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation