Abstract
The bag-of-words representation commonly used in text analysis can be analyzed very efficiently and retains a great deal of useful information, but it is also troublesome because the same thought can be expressed using many different terms or one term can have very different meanings. Dimension reduction can collapse together terms that have the same semantics, to identify and disambiguate terms with multiple meanings and to provide a lower-dimensional representation of documents that reflects concepts instead of raw terms. In this chapter, we survey two influential forms of dimension reduction. Latent semantic indexing uses spectral decomposition to identify a lower-dimensional representation that maintains semantic properties of the documents. Topic modeling, including probabilistic latent semantic indexing and latent Dirichlet allocation, is a form of dimension reduction that uses a probabilistic model to find the co-occurrence patterns of terms that correspond to semantic topics in a collection of documents. We describe the basic technologies in detail and expose the underlying mechanism. We also discuss recent advances that have made it possible to apply these techniques to very large and evolving text collections and to incorporate network structure or other contextual information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
E. Airoldi, D. Blei, S. Fienberg, and E. **ng. Mixed membership stochastic blockmodels. J. Mach. Learn. Res., 9:1981–2014, June 2008.
D. Andrzejewski, X. Zhu, M. Craven, and B. Recht. A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. In IJCAI, 2011.
A. Asuncion, M. Welling, P. Smyth, and Y. Teh. On smoothing and inference for topic models. In UAI, pages 27–34, 2009.
L. Bahl, J. Baker, E. Jelinek, and R. Mercer. Perplexity—a measure of the difficulty of speech recognition tasks. In Program, 94th Meeting of the Acoustical Society of America, volume 62, page S63, 1977.
H. Bast and D. Majumdar. Why spectral retrieval works. In SIGIR, page 11, 2005.
J.-P. Benzecri. L’Analyse des Donnees. Volume II. 1973.
M. Berry. Large-scale sparse singular value computations. The International Journal Of Supercomputer Applications, 6(1):13–49, 1992.
M. Berry, S. Dumais, and G. O’Brien. Using linear algebra for intelligent information retrieval. SIAM review, 37(4):573–595, 1995.
D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In NIPS, 2003.
D. Blei and J. Lafferty. Dynamic topic models. In ICML, pages 113–120, 2006.
D. Blei and J. Lafferty. A correlated topic model of science. AAS, 1(1):17–35, 2007.
D. Blei and J. McAuliffe. Supervised topic models. In NIPS, 2007.
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, 2003.
J. Boyd-Graber and D. Blei. Multilingual topic models for unaligned text. In UAI, pages 75–82, 2009.
J. Boyd-Graber and D. Blei. Syntactic topic models. In NIPS, pages 185–192. 2009.
W. Buntine and A. Jakulin. Discrete component analysis. In Craig Saunders, Marko Grobelnik, Steve Gunn, and John Shawe-Taylor, editors, Subspace, Latent Structure and Feature Selection, volume 3940 of Lecture Notes in Computer Science, pages 1–33. Springer Berlin / Heidelberg, 2006.
J. Chang and D. Blei. Relational topic models for document networks. In AIStats, 2009.
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, pages 288–296. 2009.
K. Church and W. Gale. Poisson mixtures. Natural Language Engineering, 1:163–190, 1995.
D. Cohn. The missing link-a probabilistic model of document content and hypertext connectivity. In NIPS, 2001.
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In ICML, pages 167–174, 2001.
S. Crain, S.-H. Yang, Y. Jiao, and H. Zha. Dialect topic modeling for improved consumer medical search. In AMIA Annual Symposium, 2010.
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, September 1990.
A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1–38, 1977.
H. Deng, J. Han, B. Zhao, Y. Yu, and C. Lin. Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks. In KDD, pages 1271—-1279, San Diego, 2011. ACM.
C. Ding. A similarity-based probability model for latent semantic indexing. In SIGIR, pages 58–65, 1999.
G. Doyle and C. Elkan. Accounting for burstiness in topic models. In ICML, 2009.
S. Dumais and J. Nielsen. Automating the assignment of submitted manuscripts to reviewers. In SIGIR, pages 233–244, 1992.
G. Dupret. Latent concepts and the number orthogonal factors in latent semantic analysis. SIGIR, pages 221–226, 2003.
G. Golub and C. Van Loan. Matrix computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996.
T. Griffiths and M. Steyvers. Latent Semantic Analysis: A Road to Meaning, chapter Probabilistic topic models. 2006.
T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences of the United States of America, volume 101, pages 5228–5235, 2004.
T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax. In NIPS, pages 537–544, 2005.
Z, Guo, S. Zhu, Y. Chi, Z. Zhang, and Y. Gong. A latent topic model for linked documents. In SIGIR, page 720, 2009.
M. Hoffman, D. Blei, and F. Bach. Online learning for latent Dirichlet allocation. In NIPS, pages 856–864, 2010.
T. Hofmann. Probabilistic latent semantic analysis. In UAI, page 21, 1999.
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50–57, 1999.
R. Kubota Ando and L. Lee. Iterative residual rescaling: An analysis and generalization of LSI. In SIGIR, pages 154–162, 2001.
S. Kullback and R. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, March 1951.
T. Landauer. On the computational basis of learning and cognition: Arguments from LSA. Psychology of learning and motivation, (1):1– 63, 2002.
W. Li, D. Blei, and A. McCallum. Nonparametric Bayes Pachinko allocation. In UAI, 2007.
G. Lisowsky and L. Rost. Konkordanz zum hebr¨aischen Alten Testament: nach dem von Paul Kahle in der Biblia Hebraica edidit Rudolf Kittel besorgten Masoretischen Text. Deutsche Bibelgesellschaft, 1958.
Z. Liu, Y. Zhang, E.Y. Chang, and M. Sun. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol., 2:26:1–26:18, May 2011.
C. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In Proceedings of the 19th international joint conference on Artificial intelligence, pages 786–791, 2005.
Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. In WWW, page 101, 2008.
Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In KDD, pages 490–499, 2007.
D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In UAI, 2008.
D. Mimno, H.Wallach, J. Naradowsky, D. Smith, and A. McCallum. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 880–889, 2009.
L. Molgaard, J. Larsen, and D. Lyngby. Temporal analysis of text data using latent variable models. 2009 IEEE International Workshop on Machine Learning for Signal Processing, 2009.
A. Ng, A. Zheng, and M. Jordan. Link analysis, eigenvectors and stability. In International Joint Conference on Artificial Intelligence, volume 17, pages 903–910, 2001.
G. O’Brien. Information management tools for updating an SVDencoded indexing scheme. Master’s thesis, The University of Knoxville, Tennessee, (October), 1994.
C. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 159–168, 1998.
J. Reisinger, A. Waters, B. Silverthorn, and R. Mooney. Spherical topic models. In ICML, pages 903–910, 2010.
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The authortopic model for authors and documents. In UAI, 2004.
A. Smola and S. Narayanamurthy. An architecture for parallel topic models. Proc. VLDB Endow., 3:703–710, September 2010.
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. JASA, 101, 2006.
I. Titov and R. McDonald. Modeling online reviews with multigrain topic models. In WWW, pages 111–120, 2008.
J. Varadarajan, R. Emonet, and J. Odobez. Probabilistic latent sequential motifs: Discovering temporal activity patterns in video scenes. In BMVC 2010, volume 42, pages 177–196, 2010.
H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In NIPS, pages 1973–1981, 2009.
H.Wallach, I. Murray, R. Salakhutdinov and D. Mimno. Evaluation methods for topic models In ICML, pages 1105–1112, 2009.
H. Wallach. Topic modeling: beyond bag-of-words. In ICML, 2006.
Q. Wang, J. Xu, and H. Li. Regularized latent semantic indexing. In SIGIR, 2011.
Y. Wang and E. Agichtein. Temporal latent semantic analysis for collaboratively generated content: preliminary results. In SIGIR, pages 1145—-1146, 2011.
X. Wei and W. Bruce Croft. LDA-based document models for adhoc retrieval. In SIGIR, pages 178–185, 2006.
F. Yan, N. Xu, and Y. Qi. Parallel inference for latent Dirichlet allocation on graphics processing units. In NIPS, pages 2134–2142. 2009.
S. Yang, J. Bian, and H. Zha. Hybrid generative/discriminative learning for automatic image annotation. In UAI, 2010.
S. Yang, S. Crain, and H. Zha. Briding the language gap: topic-level adaptation for cross-domain knowledge transfer. In AIStat, 2011.
S. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha. Like like alike – joint friendship and interest propagation in social networks. In WWW, 2011.
S. Yang and H. Zha. Language pyramid and multi-scale text analysis. In CIKM, pages 639–648, 2010.
S. Yang, H. Zha, and B. Hu. Dirichlet-bernoulli alignment: A generative model for multi-class multi-label multi-instance corpora. In NIPS, 2009.
L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD, pages 937–946, 2009.
Y. Saad. Numerical Methods for Large Eigenvalue Problems. Manchester University Press ND, 1992.
H. Zha and H. Simon. On updating problems in latent semantic indexing. SIAM Journal on Scientific Computing, 21(2):782, 1999.
H. Zha and Z. Zhang. On matrices with low-rank-plus-shift structures: Partial SVD and latent semantic indexing. SIAM Journal Matrix Analysis and Applications, 21:522–536, 1999.
D. Zhou, S. Zhu, K. Yu, X. Song, B. Tseng, H. Zha, and C. Lee Giles. Learning multiple graphs for document recommendations. In WWW, page 141, 2008.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Crain, S.P., Zhou, K., Yang, SH., Zha, H. (2012). Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_5
Download citation
DOI: https://doi.org/10.1007/978-1-4614-3223-4_5
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-3222-7
Online ISBN: 978-1-4614-3223-4
eBook Packages: Computer ScienceComputer Science (R0)