Abstract
Duplicated records, which describe the same entity in the real world, frequently generated by data integration. Ideally, the values on the same attributes of duplicated records should be identical. However, the duplicated records may have conflicting values on the same attributes due to ambiguity and data errors. Obviously, the more the conflicts there are among duplicated records in a data set, the poorer the quality of the data set is. To address the problem, we explore a new data quality measure, entity-description conflict, to evaluate the conflict on duplicated records. Since current entity resolution algorithms can hardly identify duplicated records correctly and completely, it brings challenges to compute the entity-description conflict. To this end, it is studied to compute the range of the entity-description conflict while the entity resolution result is not completely correct in this paper. (1) The mathematics model of the entity-description conflict is introduced. (2) Four primary operators for computing the range of the entity-description conflict are identified and are proved to be NP-hard, and thus it is proved that the problem of computing the range of the entity-description conflict is NP-hard. (3) Four approximation algorithms for the four primary operators are provided and a framework based on the four primary operators is proposed for computing the range of the entity-description conflict. (4) Using real-life data and synthetic data, the effectiveness and efficiency of the proposed algorithms are experimentally verified.
Similar content being viewed by others
References
Arasu A, Chaudhuri S, Kaushik R (2008) Transformation-based framework for record matching. In: IEEE 24th international conference on data engineering, 2008. ICDE 2008. IEEE, pp 40–49 (2008)
Arasu A, Chaudhuri S, Kaushik R (2009) Learning string transformations from examples. Proc VLDB Endow 2(1):514–525
Bansal N, Blum A, Chawla S (2004) Correlation clustering. Mach Learn 56(1–3):89–113
Berti-Equille L, Sarma AD, Marian A, Srivastava D et al (2009) Sailing the information ocean with awareness of currents: discovery and application of source dependence. ar**v preprint ar**v:0909.1776
Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data (TKDD) 1(1):5
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 39–48
Bilenko M, Kamath B, Mooney RJ (2006) Adaptive blocking: Learning to scale up record linkage. In: Sixth international conference on data mining, 2006. ICDM’06. IEEE, pp 87–96
Bleiholder J, Naumann F (2006) Conflict handling strategies in an integrated information system. Mathematisch-Naturwissenschaftliche Fakultät II, Institut für Informatik, Humboldt-Universität zu Berlin, pp 1–13
Bleiholder J, Naumann F (2008) Data fusion. ACM Comput Surv (CSUR) 41(1):1
Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, pp 313–324
Chaudhuri S, Chen BC, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. In: Proceedings of the 33rd international conference on very large data bases. VLDB Endowment, pp 327–338
Cohen WW (1998) Integration of heterogeneous databases without common domains using queries based on textual similarity. In: ACM SIGMOD record, vol 27. ACM, pp 201–212
Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 475–480
Cormen TH, Leiserson CE, Rivest RL, Stein C et al (2001) Introduction to algorithms, vol 2. MIT press, Cambridge
Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. ACM, pp 85–96
Dong XL, Berti-Equille L, Srivastava D (2009a) Truth discovery and copying detection from source update history. In: Technical report
Dong XL, Berti-Equille L, Srivastava D (2009b) Integrating conflicting data: the role of source dependence. Proc VLDB Endow 2(1):550–561
Fan X, Wang J, Pu X, Zhou L, Lv B (2011) On graph-based name disambiguation. J Data Inf Qual (JDIQ) 2(2):10
Fisher CW, Lauría EJM Matheus CC (2007) In search of an accuracy metric. In: Proceedings of the 12th International Conference on Information Quality (ICIQ 2007), pp 379–392
Gravano L, Ipeirotis PG, Koudas N, Srivastava D (2003) Text joins in an rdbms for web data integration. In: Proceedings of the 12th international conference on World Wide Web. ACM, pp 90–101
Hassanzadeh O, Chiang F, Lee HC, Miller RJ (2009) Framework for evaluating clustering algorithms in duplicate detection. Proc VLDB Endow 2(1):1282–1293
Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: ACM SIGMOD record, vol 24. ACM, pp 127–138
Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J Am Stat Assoc 84(406):414–420
Koudas N, Sarawagi S, Srivastava D (2006) Record linkage: similarity measures and algorithms. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, pp 802–803
Newcombe HB, Kennedy JM, Axford SJ, James AP (1959) Automatic linkage of vital records. Science 130(3381):954–959
Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218
Rastogi V, Dalvi N, Garofalakis M (2011) Large-scale collective entity matching. Proc VLDB Endow 4(4):208–218
Redman TC (1998) The impact of poor data quality on the typical enterprise. Commun ACM 41(2):79–82
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 269–278
Shu L, Long B, Meng W (2009) A latent topic model for complete entity resolution. In: IEEE 25th international conference on data engineering, 2009. ICDE’09. IEEE, pp 880–891 (2009)
Singla P, Domingos P (2005) Object identification with attribute-mediated dependences. In: Knowledge discovery in databases: PKDD 2005. Springer, pp 297–308
Tejada S, Knoblock CA, Minton S (2001) Learning object identification rules for information integration. Inf Syst 26(8):607–633
Verykios VS, Moustakides GV, Elfeky MG (2003) A Bayesian decision model for cost optimal record matching. VLDB J 12(1):28–40
Wang RY, Storey VC, Firth CP (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng 7(4):623–640
Whang SE, Garcia-Molina H (2010) Entity resolution with evolving rules. Proc VLDB Endow 3(1–2):1326–1337
Whang SE, Garcia-Molina H (2012) Joint entity resolution. In: 2012 IEEE 28th international conference on data engineering (ICDE). IEEE, pp 294–305 (2012)
Whang SE, Menestrina D, Koutrika G, Theobald M, Garcia-Molina H (2009) Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 219–232
Wu M, Marian A (2007) Corroborating answers from multiple web sources. In: Proceedings of the 10th International Workshop on Web and Databases (WebDB 2007), pp 1–6
Yin X, Han J, Yu PS (2008) Truth discovery with multiple conflicting information providers on the web. IEEE Trans Knowl Data Eng 20(6):796–808
Acknowledgments
This paper was partially supported by NGFR 973 Grant 2012CB316200, NGFR 863 Grant 2012AA011004 and NSFC Grant 61472099.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Theorem 5 MaxDec-Comp is a polynomial-time 2-approximation algorithm.
Proof
From Proposition 1, MaxDec-Comp runs in polynomial time.
Suppose the solution obtained by MaxDec-Comp is \(c_1\), the optimal solution is \(c^*_1\), and the costs of \(c_1\) and \(c^*_1\) are denoted by \(cost({c_1})\) and \(cost({c^*_1})\) respectively such that \(cost({c_1})=edc(c)-edc(c\backslash {c_1})\) and \(cost({c^*_1})=edc(c)-edc(c\backslash {c^*_1})\). Now we prove \(2cost({c_1})\ge {cost({c^*_1})}\).
Because the records in \(c_1\) have the top \(b\) weights, we obtain
By the definition of \(edc\) and the definition of weight \(w_i\) of each record \(r_i\), the following formula can be derived.
Similarly, we have
From inequalities (11), (12) and (13), it follows that
This completes the proof of Theorem 5. \(\square \)
Before proving Theorem 7, the following lemma is proved.
Lemma 1
Given an instance of \(\mathsf {MaxInc^*}\), suppose the optimal solution contains \(\{x_i|1\le {i}\le {n}\}\) and \(\{y_{ij}|1\le {i}<j\le {n}\}\). \(\{x_i|1\le {i}\le {n}\}\) and \(\{y_{ij}|1\le {i}<j\le {n}\}\) satisfy that \(\forall {i,j}, i\ne {j}, y_{ij}=min\{x_i,x_j\}\).
Proof
From inequalities (7) and (8), it follows that
From inequality (14), we have
As \(0\le {x_j}\le {1}\) for \(1\le j\le n\), it implies that
Combining inequalities (14)-(16) gives
As \(\{y_{ij}|1\le {i}<j\le {n}\}\) is the optimal solution for \(\mathsf {MaxInc}\), \(\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}\) is maximized. Thus \(y_{ij}=min\{x_i,x_j\}\). \(\square \)
Theorem 7 MaxInc-Comp is a polynomial-time \(\rho \) -approximation algorithm, where \(\rho =\frac{(n-1)b}{b-1}\).
Proof
We have already shown that MaxInc-Comp runs in polynomial time.
Now we prove that MaxInc-Comp is a \(\rho \)-approximation algorithm. Suppose the optimal solution to \(\mathsf {MaxInc^*}\) contains \(\{x_i|1\le {i}\le {n}\}\) and \(\{y_{ij}|1\le {i}<j\le {n}\}\). Then the cost of the optimal solution is \(cost^*=\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}\) which is the upper bound of the cost of the optimal solution to \(\mathsf {MaxInc}\) denoted by \(cost\). Suppose the solution obtained by MaxInc-Comp is \((U,F)\). The cost of this solution is \(\sum _{(i,j)\in {F}}c_{ij}\).
Applying Lemma , for all \(1\le {i}<j\le {n}\), \(y_{ij}\le {\frac{x_i+x_j}{2}}\), we have,
Let \(d_i={\frac{1}{2}} \sum _{j\ne {i}}{c_{ij}}\). Inequality (17) can be written as
Suppose \(d_i\) is sorted in a descending order, and the order is \(d_1,d_2,\ldots ,d_{n-1}\). As \(\sum _{i}{x_i}=b\), the following inequality can be easily proved by contradiction.
The combination of inequalities (18) and (19) gives
Because the algorithm sorts edges in \(E\) in a descending order denoted by \(e_{i_1j_1},e_{i_2j_2},\ldots ,e_{i_kj_k},\ldots \) and picks the edges which have larger weights and places their endpoints into \(U\) until \(|U|=b\). Let \(E^{\prime }=\{e_{i_1j_1},e_{i_2j_2},\ldots ,e_{i_mj_m}\}\) be the edges which are picked during the loop of the algorithm. Since \(E^{\prime }\subseteq {F}\) and \(m\ge {\lfloor \frac{b}{2}\rfloor }\ge {\frac{b-1}{2}}\), we have
From inequality (22), we have
Combining inequalities (20) and (23) gives
From inequalities (21) and (24), we have
which implies that
where \(\rho =\frac{(n-1)b}{b-1}\). Hence MaxInc-Comp is a \(\rho \)-approximation algorithm. \(\square \)
Theorem 8 MinInc-Comp is a polynomial-time \(\rho \) -approximation algorithm, where \(\rho =\frac{(n-1)b}{b-1}\).
Proof
Obviously MinInc-Comp runs in polynomial time.
Now we prove that MinInc-Comp is a \(\rho \)-approximation algorithm. We denote the given \(\mathsf {MinInc}\) instance as \(I\). Suppose \(I\) is encoded into a \(\mathsf {MaxInc}\) instance denoted by \(I^{\prime }\) and the relaxation IP instance of \(I^{\prime }\) is denoted by \(I^{\prime \prime }\) which is a \(\mathsf {MaxInc^*}\) problem. Suppose the cost of the optimal solution to \(I^{\prime }\) is \(cost^{\prime }\) and the cost of the optimal solution to \(I^{\prime \prime }\) is \(cost^{\prime \prime }\). Applying Theorem 7, we have \(\frac{cost^{\prime \prime }}{cost}\le {\rho }\). Since \(cost^{\prime }=\sum _{(i,j)\in {F}}{(w-c_{ij})}\), the cost of the solution generated by MinInc-Comp denoted by \(cost\) satisfies that \(cost=\sum _{(i,j)\in {F}}{w}-cost^{\prime }\). We denote that \(M=\sum _{(i,j)\in {F}}{w}\). Then we have,
Suppose the optimal solution to \(I^{\prime \prime }\) involves \(\{x_i|1\le {i}\le {n}\}\) and \(\{y_{ij}|1\le {i}<j\le {n}\}\). Since \(cost^{\prime \prime }=\sum _{1\le {i}<j\le {n}}{(w-c_{ij}) y_{ij}}\) is maximized and the cost \(cost^*\) of the optimal solution to \(I\) satisfies that \(cost^*=\sum _{1\le {i}<j\le {n}}{c_{ij} y_{ij}}\) is minimized. Then we have
Combining the equalities (27) and (28) gives
By Theorem 7, we have
Since \(w=(\rho +1) \max \{c_{ij}\}\), the following formula can be obtained.
From the inequalities (30) and (31), it can be achieved that
which implies that
Combining the inequalities (29) and (32) gives
Hence MinInc-Comp is a \(\rho \)-approximation algorithm. \(\square \)
Rights and permissions
About this article
Cite this article
Li, L., Li, J. & Gao, H. Evaluating entity-description conflict on duplicated data. J Comb Optim 31, 918–941 (2016). https://doi.org/10.1007/s10878-014-9801-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10878-014-9801-6