Abstract
This chapter provides a brief overview of how machine learning and deep learning algorithms are trained for biomedical natural language processing tasks. The contents of this chapter will be familiar to readers who have previously studied machine learning and deep learning methods. It discusses design of inputs and outputs for machine learning models, training algorithms including gradient descent, feature-based methods including logistic regression and decision trees, and deep-learning methods including convolutional, recurrent, and transformer networks.
References
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2013.
Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016.
Burkov A. The hundred-page machine learning book; 2019.
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Yee Whye T, Mike T, editors. Proceedings of the thirteenth international conference on artificial intelligence and statistics. In: Proceedings of machine learning research. PMLR; 2010. p. 249–56.
Saxe A, McClelland J, Ganguli S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In: International conference on learning representations 2014; 2014.
Kingma DP, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, ICLR 2015. San Diego, CA, USA; 2017.
Li L, Jamieson K, DeSalvo G, Rostamizadeh A, Talwalkar A. Hyperband: a novel bandit-based approach to hyperparameter optimization. J Mach Learn Res. 2017;18(1):6765–816.
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. ar**v preprint ar**v:160908144. 2016.
Kudo T, Richardson J. Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations. Brussels, Belgium: Association for Computational Linguistics; 2018. p. 66–71.
Aronson AR, Lang F-M. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010;17(3):229–36. https://doi.org/10.1136/jamia.2009.002733.
Kraljevic Z, Searle T, Shek A, Roguski L, Noor K, Bean D, et al. Multi-domain clinical natural language processing with MedCAT: the medical concept annotation toolkit. Artif Intell Med. 2021;117: 102083. https://doi.org/10.1016/j.artmed.2021.102083.
Lindberg DAB, Humphreys BL, McCray AT. The unified medical language system. Yearb Med Inform. 1993;02(01):41–51. https://doi.org/10.1055/s-0038-1637976.
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems, vol. 2. Lake Tahoe, Nevada: Curran Associates Inc.; 2013. p. 3111–9.
Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014. p. 1532–43.
Iyyer M, Manjunatha V, Boyd-Graber J, Daumé III H. Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, vol. 1. Long papers; 2015. p. 1681–91.
Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Math Program. 1989;45(1–3):503–28.
Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on computational learning theory; 1992. p. 144–52.
Shalev-Shwartz S, Singer Y, Srebro N, Cotter A. Pegasos: primal estimated sub-gradient solver for SVM. Math Program. 2011;127(1):3–30. https://doi.org/10.1007/s10107-010-0420-4.
Hsieh C-J, Chang K-W, Lin C-J, Keerthi SS, Sundararajan S. A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th international conference on machine learning; 2008. p. 408–15.
Breiman L. Random Forests. Mach Learn. 2001;45(1):5–32. https://doi.org/10.1023/A:1010933404324.
Salzberg SL. C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach Learn. 1994;16(3):235–40. https://doi.org/10.1007/BF00993309
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Taylor & Francis; 1984.
Bentley JL. Multidimensional binary search trees used for associative searching. Commun ACM. 1975;18(9):509–17. https://doi.org/10.1145/361002.361007.
Omohundro SM. Five balltree construction algorithms. In: International computer science Institute Berkeley; 1989.
Jayaram Subramanya S, Devvrit F, Simhadri HV, Krishnawamy R, Kadekodi R. Diskann: fast accurate billion-point nearest neighbor search on a single node. Adv Neural Inf Process Syst. 2019;32.
Guo R, Sun P, Lindgren E, Geng Q, Simcha D, Chern F, et al. Accelerating large-scale inference with anisotropic vector quantization. In: Hal D, III, Aarti S, editors. Proceedings of the 37th international conference on machine learning. In: Proceedings of machine learning research. PMLR; 2020. p. 3887–96.
Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In: Geoffrey G, David D, Miroslav D, editors. Proceedings of the fourteenth international conference on artificial intelligence and statistics. In: Proceedings of machine learning research. PMLR; 2011. p. 315–23.
Hendrycks D, Gimpel K. Gaussian error linear units (gelus). ar**v preprint ar**v:160608415. 2016.
Ramachandran P, Zoph B, Le QV. Searching for activation functions. ar**v preprint ar**v:171005941. 2017.
Goodfellow IJ, Bulatov Y, Ibarz J, Arnoud S, Shet V. Multi-digit number recognition from street view imagery using deep convolutional neural networks. In: Proceedings of the 2nd international conference on learning representations. Banff, AB, Canada; 2013.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR); 2016. p. 770–8.
Cho K, van Merriënboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation. Doha, Qatar; 2014. p. 103–11.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Akbik A, Blythe DAJ, Vollgraf R. Contextual string embeddings for sequence labeling. In: International conference on computational linguistics; 2018.
Le QV, Jaitly N, Hinton GE. A Simple way to initialize recurrent networks of rectified linear units. Ar**v. 2015;abs/1504.00941.
Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45(11):2673–81. https://doi.org/10.1109/78.650093.
Ba JL, Kiros JR, Hinton GE. Layer normalization. ar**v preprint ar**v:160706450. 2016.
Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthcare. 2021;3(1):Article 2. https://doi.org/10.1145/3458754.
Wang B, Shang L, Lioma C, Jiang X, Yang H, Liu Q, et al. On position embeddings in BERT. In: International conference on learning representations. Vienna, Austria; 2021.
Beltagy I, Peters ME, Cohan A. Longformer: the long-document transformer. ar**v preprint ar**v:200405150. 2020.
Zaheer M, Guruganesh G, Dubey KA, Ainslie J, Alberti C, Ontanon S, et al. Big bird: Transformers for longer sequences. Adv Neural Inf Process Syst. 2020;33:17283–97.
Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst. 2014;27.
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN. Convolutional sequence to sequence learning. In: International conference on machine learning. PMLR; 2017. p. 1243–52.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
Phan LN, Anibal JT, Tran H, Chanana S, Bahadroglu E, Peltekian A, et al. Scifive: a text-to-text transformer model for biomedical literature. ar**v preprint ar**v:210603598; 2021.
Feurer M, Hutter F. Hyperparameter Optimization. In: Hutter F, Kotthoff L, Vanschoren J, editors. Automated machine learning: methods, systems, challenges. Cham: Springer International Publishing; 2019. p. 3–33.
Laparra E, Mascio A, Velupillai S, Miller T. A review of recent work in transfer learning and domain adaptation for natural language processing of electronic health records. Yearb Med Inform. 2021;30(1):239–44. https://doi.org/10.1055/s-0041-1726522.
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1):Article 140.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Glossary
- Machine learning
-
(from the standard Springer glossary).
- Gradient descent
-
an optimization algorithm that repeatedly updates a model’s parameters based upon the gradient of the model’s cost function.
- Parameter
-
(of a machine learning model): a choice (typically a numeric value) that is determined automatically by the machine learning algorithm from the training data.
- Hyperparameter
-
(of a machine learning model): a choice (typically a numeric value) that is not made by the learning algorithm but must instead be made by the machine learning designer.
- Parameter initialization
-
how initial values are assigned to the parameters of a machine learning model.
- Mini-batch
-
a small sample of the training data.
- Stochastic gradient descent
-
gradient descent using mini-batches, rather than the entire training data, for each gradient step.
- Training set
-
example inputs and outputs on which the parameters of a machine learning model are to be tuned.
- Development set
-
example inputs and outputs on which the hyper-parameters of a machine learning model are to be tuned.
- Test set
-
example inputs and outputs on which the generalization of a machine learning model is to be evaluated.
- Learning curve
-
a plot of model performance on the training and/or development data against varying amounts of training data.
- Underfitting
-
when a model has a high cost on the training set.
- Overfitting
-
when a model has a low cost on the training set but a high cost on the development or test set.
- Regularization
-
including a measure of model complexity, in addition to a measure of error on the training data, in the cost function that is minimized for a machine learning model.
- Grid search
-
a hyper-parameter search algorithm where the designer selects a small number of values for each hyperparameter and then explores all possible combinations of such hyperparameter values.
- Random search
-
a hyper-parameter search algorithm where the designer selects a numeric range of values for each hyperparameter and then samples a fixed number of combinations, each time sampling all hyperparameter values from their specified ranges.
- Feature engineering
-
designing the inputs that are fed into a machine learning algorithm.
- 1-hot vector
-
a vector where a single entry is 1 and all other entries are 0.
- Word embedding
-
a small dense vector used to represent a word.
- Bag-of-words
-
a representation of text as a vector of counts of each of the words in a vocabulary that were found in that text.
- Linear regression
-
a machine learning model that assumes that outputs can be predicted as a weighted sum of the inputs.
- Logistic regression
-
a machine learning model that assumes that outputs can be predicted by applying a logistic sigmoid over the weighted sum of the inputs.
- Support vector machine
-
a machine learning model that assumes that outputs can be predicted as a weighted sum of the inputs, under a learning process that maximizes the margin between the closest examples of one class and the other.
- Decision tree
-
a machine learning model that assumes that outputs can be predicted by applying a series of tests to different features of the input.
- k-nearest neighbors
-
a machine learning model that assumes that an output can be predicted by taking a new input, finding the most similar inputs in the training examples, and producing the most common output from those training examples.
- Deep learning
-
applying complex neural network architectures to learn nonlinear transformations of the input.
- Feedforward network
-
a machine learning model that assumes that outputs can be predicted as non-linear combinations of the inputs.
- Convolutional network
-
a machine learning model that assumes that outputs can be predicted by aggregating over non-linear combinations of small regions of the input.
- Recurrent network
-
a machine learning model that assumes that an output can be predicted by walking through the input one step at a time, and at each step, non-linearly combining the new step’s input with an aggregation of all of the previous steps’ inputs.
- Transformer network
-
a machine learning model that assumes that an output can be predicted by non-linear combinations over all pairs of time steps in the input.
- Sequence-to-sequence model
-
a model that takes a sequence of inputs and produces a sequence of outputs, where there is no guaranteed relation between the length of the inputs and outputs.
- Pre-training
-
(a neural network): training a neural network on a task, typically using large unlabeled data, with the assumption that the network will later be fine-tuned.
- Fine-tuning
-
(a neural network): taking the learned parameters of a pre-trained model and using those as the starting point for training on a new task.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Bethard, S. (2024). Machine Learning and Deep Learning Algorithms. In: Xu, H., Demner Fushman, D. (eds) Natural Language Processing in Biomedicine. Cognitive Informatics in Biomedicine and Healthcare. Springer, Cham. https://doi.org/10.1007/978-3-031-55865-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-55865-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-55864-1
Online ISBN: 978-3-031-55865-8
eBook Packages: MedicineMedicine (R0)