Abstract
Entity matching (EM) is a fundamental task in data integration, which involves identifying records that refer to the same real-world entity. Unsupervised EM is often preferred in real-world applications, as labeling data is often a labor-intensive process. However, existing unsupervised methods may not always perform well because the assumptions for these methods may not hold for tasks in different domains. In this paper, we propose QA-Matcher, an unsupervised EM model that is domain-agnostic and doesn’t require any particular assumptions. Our idea is to frame EM as question answering (QA) by utilizing a trained QA model. Specifically, we generate a question that asks which record has the characteristics of a particular record and a passage that describes other records. We then use the trained QA model to predict the record pair that corresponds to the question-answer as a match. QA-Matcher leverages the power of a QA model to represent the semantics of various types of entities, allowing it to identify identical entities in a QA-like fashion. In extensive experiments on 16 real-world datasets, we demonstrate that QA-Matcher outperforms unsupervised EM methods and is competitive with supervised methods.
S. Hayashi—This work was conducted while the author was affiliated with NEC.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Cappuzzo, R., Papotti, P., Thirumuruganathan, S.: Creating embeddings of heterogeneous relational datasets for data integration tasks. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1335–1349 (2020)
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 475–480 (2002)
Das, S., et al.: The Magellan data repository. https://sites.google.com/site/anhaidgroup/projects/data
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence, pp. 3665–3671 (2020)
Ge, C., Wang, P., Chen, L., Liu, X., Zheng, B., Gao, Y.: CollaborEM: a self-supervised entity matching framework using multi-features collaboration. IEEE Trans. Knowl. Data Eng. 1 (2021)
Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., Daumé III, H.: A neural network for factoid question answering over paragraphs. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 633–644 (2014)
**, D., Sisman, B., Wei, H., Dong, X.L., Koutra, D.: Deep transfer learning for multi-source entity linkage via domain adaptation. Proc. VLDB Endow. 15(3), 465–477 (2021)
Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 6769–6781 (2020)
Kasai, J., Qian, K., Gurajada, S., Li, Y., Popa, L.: Low-resource deep entity resolution with transfer and active learning. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5851–5861 (2019)
Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow. 9(12), 1197–1208 (2016)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1), 484–493 (2010)
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. Proc. VLDB Endow. 14(1), 50–60 (2020)
Li, Y., Li, J., Suhara, Y., Wang, J., Hirota, W., Tan, W.: Deep entity matching: challenges and opportunities. J. Data Inf. Qual. 13(1), 1:1–1:17 (2021)
Mudgal, S., et al.: Deep learning for entity matching: a design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp. 19–34 (2018)
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100, 000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392 (2016)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)
Voorhees, E.M.: The TREC-8 question answering track report. In: Proceedings of the Eighth Text Retrieval Conference, vol. 99, pp. 77–82 (1999)
Wei, J., et al.: Finetuned language models are zero-shot learners. ar**v preprint ar**v:2109.01652 (2021)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: ZeroER: entity resolution using zero labeled examples. In: Proceedings of the ACM SIGMOD 2020 International Conference on Management of Data, pp. 1149–1164 (2020)
Yin, W., Hay, J., Roth, D.: Benchmarking zero-shot text classification: datasets, evaluation and entailment approach. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the Ninth International Joint Conference on Natural Language Processing, pp. 3912–3921 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hayashi, S., Dong, Y., Oyamada, M. (2023). QA-Matcher: Unsupervised Entity Matching Using a Question Answering Model. In: Kashima, H., Ide, T., Peng, WC. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2023. Lecture Notes in Computer Science(), vol 13938. Springer, Cham. https://doi.org/10.1007/978-3-031-33383-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-33383-5_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33382-8
Online ISBN: 978-3-031-33383-5
eBook Packages: Computer ScienceComputer Science (R0)