Abstract
It has become a challenging work to collect valuable information from fast text streams. In this work, we propose a method which gains useful information effectively and efficiently. Firstly, we maintain an analyzer based on the Trie structure and the dynamic N-Gram tokenizer; secondly, unlike the traditional search engine principle, we consider the documents as a query by building the indexes for the whole query base. The experimental results show that it has the strong adaption ability, low latency and high quality support for the complex query combination compared with the conventional methods.
Chapter PDF
Similar content being viewed by others
References
Gama, J., et al.: Knowledge discovery from data streams, Citeseer (2010)
Aggarwal, C.C.: Data streams: models and algorithms. Springer (2006)
Graham, C., et al.: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Found. Trends Databases 4(1-3), 1–294 (2012)
Muthukrishnan, S.: Data streams: Algorithms and applications. Now Publishers Inc. (2005)
Li, M., et al.: Time and space efficient spectral clustering via column sampling. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2011)
Zhang, Y., et al.: Space-efficient relative error order sketch over data streams. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006. IEEE (2006)
**oufis, E.S., et al.: Dealing with concept drift and class imbalance in multi-label stream classification. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, vol. 2. AAAI Press (2011)
Gama, J., Medas, P., Castillo, G., Rodrigues, P.: Learning with Drift Detection. In: Bazzan, A.L.C., Labidi, S. (eds.) SBIA 2004. LNCS (LNAI), vol. 3171, pp. 286–295. Springer, Heidelberg (2004)
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Irmak, U., et al.: Efficient query subscription processing for prospective search engines. In: Proceedings of the 15th International Conference on World Wide Web. ACM (2006)
Kanlayanawat, W., Prasitjutrakul, S.: Automatic indexing for Thai text with unknown words using trie structure. In: Proceedings of the Natural Language Processing Pacific Rim Symposium (NLPRS 1997) (1997)
Kijkanjanarat, T., Chao, H.: Fast IP lookups using a two-trie data structure. In: Global Telecommunications Conference, GLOBECOM 1999. IEEE (1999)
Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc. (2002)
Brown, P.F., et al.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Zhang, H.-P., et al.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17, p. 2003. Association for Computational Linguistics (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 IFIP International Federation for Information Processing
About this paper
Cite this paper
Qi, B., Ma, G., Shi, Z., Wang, W. (2014). Collecting Valuable Information from Fast Text Streams. In: Shi, Z., Wu, Z., Leake, D., Sattler, U. (eds) Intelligent Information Processing VII. IIP 2014. IFIP Advances in Information and Communication Technology, vol 432. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44980-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-662-44980-6_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44979-0
Online ISBN: 978-3-662-44980-6
eBook Packages: Computer ScienceComputer Science (R0)