Abstract
Clustering categorical sequences is currently a difficult problem due to the lack of an efficient representation model for sequences. Unlike the existing models, which mainly focus on the fixed-length tuples representation, in this paper, a new representation model on the variable-length tuples is proposed. The variable-length tuples are obtained using a pruning method applied to delete the redundant tuples from the suffix tree, which is created for the fixed-length tuples with a large memory-length of sequences, in terms of the entropy-based measure evaluating the redundancy of tuples. A partitioning algorithm for clustering categorical sequences is then defined based on the normalized representation using tuples collected from the pruned tree. Experimental studies on six real-world sequence sets show the effectiveness and suitability of the proposed method for subsequence-based clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aggarwal, C.C.: Data Mining: The Textbook. Springer, New York (2015)
Xu, R., Wunsch, D.C.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16, 645–678 (2005)
Yang, J., Wang, W.: CLUSEQ: Efficient and effective sequence clustering. In: Proceedings of IEEE ICDE, pp. 101–112 (2003)
Dong, G., Pei, J.: Classification, clustering, features and distances of sequence data. Seq. Data Min. 33, 47–65 (2007)
Kelil, A., Wang, S.: SCS: a new similarity measure for categorical sequences. In: Proceedings of IEEE ICDM, pp. 343–352 (2008)
Vinga, S., Almeida, J.: Alignment-free sequence comparison: a review. Bioinformatics 19, 513–523 (2003)
Leopold, E., Kindermann, J.: Text categorization with support vector machines: how to represent texts in input space? Mach. Learn. 46, 423–444 (2002)
Kondrak, G.: N-Gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). doi:10.1007/11575832_13
Wei, D., Jiang, Q., Wei, Y., Wang, S.: A novel hierarchical clustering algorithm for gene sequences. BMC Bioinform. 13, 174 (2012)
**ong, T., Wang, S., Jiang, Q., Huang, J.Z.: A novel variable-order Markov model for clustering categorical sequences. IEEE Trans. Knowl. Data Eng. 26, 2339–2353 (2014)
**g, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensinoal sparse data. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)
Chen, L., Jiang, Q., Wang, S.: Model-based method for projective clustering. IEEE Trans. Knowl. Data Eng. 24, 1291–1305 (2012)
Herranz, J., Nin, J.: Sol\(\acute{e}\) M.: optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Trans. Knowl. Data Eng. 23, 1541–1554 (2011)
Chen, L.: EM-type method for measuring graph dissimilarity. Int. J. Mach. Learn. Cybern. 5, 625–633 (2014)
Wu, T.J., Burke, J.P., Davison, D.B.: A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics. 53, 1431–1439 (1997)
Wu, T., Fan, Y., Hong, Z., Chen, L.: Subspace clustering on mobile data for discovering circle of friends. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 703–711. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25159-2_64
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)
Loiselle, S., Rouat, J., Pressnitzer, D., Thorpe, S.: Exploration of rank order coding with spiking neural networks for speech recognition. Proc. IEEE IJCNN 4, 2076–2080 (2005)
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant No. 61175123, and partially supported by the Natural Science Foundation of Fujian Province of China under Grant No. 2015J01238.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Yuan, L., Hong, Z., Chen, L., Cai, Q. (2016). Clustering Categorical Sequences with Variable-Length Tuples Representation. In: Lehner, F., Fteimi, N. (eds) Knowledge Science, Engineering and Management. KSEM 2016. Lecture Notes in Computer Science(), vol 9983. Springer, Cham. https://doi.org/10.1007/978-3-319-47650-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-47650-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47649-0
Online ISBN: 978-3-319-47650-6
eBook Packages: Computer ScienceComputer Science (R0)