Abstract
CiteSeer is considered as the first academic search engine that have been serving data for almost twenty years. Recently, CiteSeer graciously makes all the data public, including raw PDF files, text transformed from PDF, and metadata extracted from the text. Numerous efforts have been tried to improve the accuracy of the metadata extraction. The problem is inherently challenging and errors are abundant. In this paper, we propose an innovative record-linkage-based method for data cleaning, which use two new matching algorithms to significantly improve the cleaning performance for the CiteSeer dataset. One is an enhanced matching algorithm for local datasets, the other is developed for online datasets. Experimental results show that 48.1 % wrong metadata entries can be corrected by our method in total and the improvement is more than 539 % compared to existing state-of-the-art data cleaning methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Peng, F., Schuurmans, D.: Combining naive bayes and n-gram language models for text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003). doi:10.1007/3-540-36618-0_24
Caragea, C., Silvescu, A., Kataria, S., Caragea, D., Mitra, P.: Classifying scientific publications using abstract features. In: SARA (2011)
Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., Eliassi-Rad, T.: Collective classification in network data. AI Mag. 29(3), 93–106 (2008)
Caragea, C., Silvescu, A., Mitra, P., Giles, C.: Can’t see the forest for the trees? a citation recommendation system. In: JCDL, pp. 111–114 (2013)
Carage, C., Wu, J., Williams, K., Das, S., Khabsa, M., Teregowda, P., Giles, C.L.: Automatic identification of research articles from crawled documents. In: WSDM-WSCBD (2014)
Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C., Rokach, L.: Recommending citations: translating papers into references. In: CIKM, pp. 1910–1914 (2012)
Caragea, C., Wu, J., Ciobanu, A., Williams, K. ndez Ram rez, J.F., Chen, H., Wu, Z., Giles, L.: Citeseerx: a scholarly big dataset. In: Advances in InformationRetrieval, pp. 311–322 (2014)
CiteSeerX. http://csxstatic.ist.psu.edu/about/data
Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of header metadata extraction approaches and tools for scientific pdf documents. In: Proceedings of JCDL, pp. 385–386 (2013)
Wu, J., Williams, K., Khabsa, M., Giles, C.L.: The impact of user corrections on a crawl-based digital library: a citeseerx perspective. In: Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom) (2014)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Tang, J.: https://aminer.org/
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information retrieval. Cambridge University Press, Cambridge (2008)
Wang, Y., Lu, J., Chen, J.: TS-IDS algorithm for query selection in the deep web crawling. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 189–200. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11116-2_17
Manku, G., Jain, A., Sarma, S.A.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Wu, J., William, K., Chen, H., Khabsa, M., Caragea, C., Tuarob, S., Ororbia, A.G., Jordan, D., Mitra, P., Lee Giles, C.: Citeseerx: AI in a digital library search engine. AI Mag. 36(3), 35–49 (2015)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of SIGKDD, pp. 169–178 (2000)
Rahm, E., Do, H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)
Tejada, S., Knoblock, C., Minton, S.: Learning object identification rules for information integration. J. Inf. Syst. 26(3), 607–633 (2001)
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of SIGKDD, pp. 475–480 (2002)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)
Chakrabarti, S.: Mining the web: discovering knowledge from hypertext data. Morgan-Kauffman (2002)
**, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of DASFAA, pp. 137–146 (2003)
Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications (2004)
Khabsa, M., Giles, C.L.: The number of scholarly documents on the public web. PLoS One 9(5) (2014)
Acknowledgements
This work has been partially supported by National Key Research Program of China (2016YFB1001101), NSFC (No.61440020, No.61272398 and No.61309030), NSERC Discovery grant (RGPIN-2014-04463) and Programs for Innovation Research in CUFE.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Wang, Y. et al. (2016). A Data Cleaning Method for CiteSeer Dataset. In: Cellary, W., Mokbel, M., Wang, J., Wang, H., Zhou, R., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2016. WISE 2016. Lecture Notes in Computer Science(), vol 10041. Springer, Cham. https://doi.org/10.1007/978-3-319-48740-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-48740-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48739-7
Online ISBN: 978-3-319-48740-3
eBook Packages: Computer ScienceComputer Science (R0)