Abstract
Training a large probabilistic neural network language model, with typical high-dimensional output is excessively time-consuming, which is one of the main reasons that more simplified models such as n-gram is often more popular despite the inferior performance. In this paper a Chinese neural probabilistic language model is trained using the Fudan Chinese Language Corpus. As hundreds of thousands of distinct words have been tokenized from the raw corpus, the model contains tens of millions of parameters. To address the challenge, popular parallel computing platform MPI (Message Passing Interface) based on cluster is employed to implement the parallel neural network language model. Specifically, we propose a new method termed as Parallel Randomized Block Coordinate Descent (PRBCD) to train this model cost-effectively. Different from traditional coordinate descent method, our new method could be employed in network with multiple layers, allowing scaling up the gradients with respect to hidden units proportionally based on sampled parameters. We empirically show that our PRBCD is stable and is well suited for language models, which contain only a few layers while often have a large amount of parameters and extremely high-dimensional output targets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Here ‘identical’ means each processor keeps the exactly same parameters and training samples in the embedding layer and hidden layer. While when the proposed PRBCD as will be discussed later in the paper is applied, they are not strictly identical.
- 2.
- 3.
- 4.
- 5.
In parallel computing, an embarrassingly parallel workload or problem is one where little or no effort is needed to separate it into multiple parallel tasks [7].
- 6.
- 7.
Overall categories include Agriculture, Art, Communication, Computer, Economy, Education, Electronics, Energy, Environment, History, Law, Literature, Medical, Military, Mine, Philosophy, Politics, Space, Sports, Transport.
- 8.
Toolbox: https://code.google.com/p/word2vec/.
References
Bengio, Y., Schwenk, H., Senécal, J.S., Morin, F., Gauvain, J.L.: Neural probabilistic language models. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Machine Learning, pp. 137–186. Springer, Heidelberg (2006)
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960)
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV (2015)
Herlihy, M., Shavit, N.: The art of multiprocessor programming. Revised Reprint (2012)
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. ar**v preprint ar**v:1207.0580 (2012)
Jelinek, F.: Interpolated estimation of markov source parameters from sparse data. In: Pattern Recognition in Practice (1980)
Kingma, D.P., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representation (2015)
Lai, S., Liu, K., Xu, L., Zhao, J.: How to generate a good word embedding? ar**v preprint ar**v:1507.05523 (2015)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, vol. 30, p. 1 (2013)
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., Cernocky, J.: Empirical evaluation and combination of advanced language modeling techniques. In: Proceedings of Interspeech, pp. 605–608 (2011)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. ar**v preprint ar**v:1301.3781 (2013)
Mnih, A., Hinton, G.: A scalable hierarchical distributed language model (2009)
Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models (2012)
Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model, pp. 246–252 (2005)
Sainath, T., Kingsbury, B., Sindhwani, V., Arisoy, E., Ramabhadran, B.: Low-rank matrix factorization for deep neural network training with high-dimensional output targets, pp. 6655–6659 (2013)
Schwenk, H., Gauvain, J.L.: Training neural network language models on very large corpora, pp. 201–208 (2005)
Tieleman, T., Hinton, G.: Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4, 2 (2012)
Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theor. 13(2), 260–269 (1967)
Zeiler, M.D., Fergus, R.: Stochastic pooling for regularization of deep convolutional neural networks. In: ICLR (2013)
Zhao, T., Yu, M., Wang, Y., Arora, R., Liu, H.: Accelerated mini-batch randomized block coordinate descent method. In: Advances in Neural Information Processing Systems, pp. 3329–3337 (2014)
Acknowledgement
This work is partially supported by China Postdoctoral Science Foundation Funded Project (2016M590337), NSFC (11501210), Shanghai YangFan Plan (15YF1403400), and Shanghai Science and Technology Committee Project (15JC1401700).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Liu, X., Yan, J., Wang, X., Zha, H. (2016). Parallel Randomized Block Coordinate Descent for Neural Probabilistic Language Model with High-Dimensional Output Targets. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (eds) Pattern Recognition. CCPR 2016. Communications in Computer and Information Science, vol 663. Springer, Singapore. https://doi.org/10.1007/978-981-10-3005-5_28
Download citation
DOI: https://doi.org/10.1007/978-981-10-3005-5_28
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3004-8
Online ISBN: 978-981-10-3005-5
eBook Packages: Computer ScienceComputer Science (R0)