Machine Learning and Deep Learning Algorithms

Bethard, Steven

doi:10.1007/978-3-031-55865-8_3

Steven Bethard⁴

Part of the book series: Cognitive Informatics in Biomedicine and Healthcare ((CIBH))

112 Accesses

Abstract

This chapter provides a brief overview of how machine learning and deep learning algorithms are trained for biomedical natural language processing tasks. The contents of this chapter will be familiar to readers who have previously studied machine learning and deep learning methods. It discusses design of inputs and outputs for machine learning models, training algorithms including gradient descent, feature-based methods including logistic regression and decision trees, and deep-learning methods including convolutional, recurrent, and transformer networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2013.
Google Scholar
Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016.
Google Scholar
Burkov A. The hundred-page machine learning book; 2019.
Google Scholar
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Yee Whye T, Mike T, editors. Proceedings of the thirteenth international conference on artificial intelligence and statistics. In: Proceedings of machine learning research. PMLR; 2010. p. 249–56.
Google Scholar
Saxe A, McClelland J, Ganguli S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In: International conference on learning representations 2014; 2014.
Google Scholar
Kingma DP, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, ICLR 2015. San Diego, CA, USA; 2017.
Google Scholar
Li L, Jamieson K, DeSalvo G, Rostamizadeh A, Talwalkar A. Hyperband: a novel bandit-based approach to hyperparameter optimization. J Mach Learn Res. 2017;18(1):6765–816.
Google Scholar
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, et al. Google’s neural machine translation system: bridging the gap between human and machine translation. ar**v preprint ar**v:160908144. 2016.
Google Scholar
Kudo T, Richardson J. Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations. Brussels, Belgium: Association for Computational Linguistics; 2018. p. 66–71.
Google Scholar
Aronson AR, Lang F-M. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc. 2010;17(3):229–36. https://doi.org/10.1136/jamia.2009.002733.
Article PubMed PubMed Central Google Scholar
Kraljevic Z, Searle T, Shek A, Roguski L, Noor K, Bean D, et al. Multi-domain clinical natural language processing with MedCAT: the medical concept annotation toolkit. Artif Intell Med. 2021;117: 102083. https://doi.org/10.1016/j.artmed.2021.102083.
Article PubMed Google Scholar
Lindberg DAB, Humphreys BL, McCray AT. The unified medical language system. Yearb Med Inform. 1993;02(01):41–51. https://doi.org/10.1055/s-0038-1637976.
Article Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems, vol. 2. Lake Tahoe, Nevada: Curran Associates Inc.; 2013. p. 3111–9.
Google Scholar
Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014. p. 1532–43.
Google Scholar
Iyyer M, Manjunatha V, Boyd-Graber J, Daumé III H. Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, vol. 1. Long papers; 2015. p. 1681–91.
Google Scholar
Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Math Program. 1989;45(1–3):503–28.
Article Google Scholar
Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on computational learning theory; 1992. p. 144–52.
Google Scholar
Shalev-Shwartz S, Singer Y, Srebro N, Cotter A. Pegasos: primal estimated sub-gradient solver for SVM. Math Program. 2011;127(1):3–30. https://doi.org/10.1007/s10107-010-0420-4.
Article Google Scholar
Hsieh C-J, Chang K-W, Lin C-J, Keerthi SS, Sundararajan S. A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th international conference on machine learning; 2008. p. 408–15.
Google Scholar
Breiman L. Random Forests. Mach Learn. 2001;45(1):5–32. https://doi.org/10.1023/A:1010933404324.
Article Google Scholar
Salzberg SL. C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach Learn. 1994;16(3):235–40. https://doi.org/10.1007/BF00993309
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Taylor & Francis; 1984.
Google Scholar
Bentley JL. Multidimensional binary search trees used for associative searching. Commun ACM. 1975;18(9):509–17. https://doi.org/10.1145/361002.361007.
Article Google Scholar
Omohundro SM. Five balltree construction algorithms. In: International computer science Institute Berkeley; 1989.
Google Scholar
Jayaram Subramanya S, Devvrit F, Simhadri HV, Krishnawamy R, Kadekodi R. Diskann: fast accurate billion-point nearest neighbor search on a single node. Adv Neural Inf Process Syst. 2019;32.
Google Scholar
Guo R, Sun P, Lindgren E, Geng Q, Simcha D, Chern F, et al. Accelerating large-scale inference with anisotropic vector quantization. In: Hal D, III, Aarti S, editors. Proceedings of the 37th international conference on machine learning. In: Proceedings of machine learning research. PMLR; 2020. p. 3887–96.
Google Scholar
Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In: Geoffrey G, David D, Miroslav D, editors. Proceedings of the fourteenth international conference on artificial intelligence and statistics. In: Proceedings of machine learning research. PMLR; 2011. p. 315–23.
Google Scholar
Hendrycks D, Gimpel K. Gaussian error linear units (gelus). ar**v preprint ar**v:160608415. 2016.
Google Scholar
Ramachandran P, Zoph B, Le QV. Searching for activation functions. ar**v preprint ar**v:171005941. 2017.
Google Scholar
Goodfellow IJ, Bulatov Y, Ibarz J, Arnoud S, Shet V. Multi-digit number recognition from street view imagery using deep convolutional neural networks. In: Proceedings of the 2nd international conference on learning representations. Banff, AB, Canada; 2013.
Google Scholar
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR); 2016. p. 770–8.
Google Scholar
Cho K, van Merriënboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation. Doha, Qatar; 2014. p. 103–11.
Google Scholar
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Article CAS PubMed Google Scholar
Akbik A, Blythe DAJ, Vollgraf R. Contextual string embeddings for sequence labeling. In: International conference on computational linguistics; 2018.
Google Scholar
Le QV, Jaitly N, Hinton GE. A Simple way to initialize recurrent networks of rectified linear units. Ar**v. 2015;abs/1504.00941.
Google Scholar
Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45(11):2673–81. https://doi.org/10.1109/78.650093.
Article Google Scholar
Ba JL, Kiros JR, Hinton GE. Layer normalization. ar**v preprint ar**v:160706450. 2016.
Google Scholar
Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthcare. 2021;3(1):Article 2. https://doi.org/10.1145/3458754.
Wang B, Shang L, Lioma C, Jiang X, Yang H, Liu Q, et al. On position embeddings in BERT. In: International conference on learning representations. Vienna, Austria; 2021.
Google Scholar
Beltagy I, Peters ME, Cohan A. Longformer: the long-document transformer. ar**v preprint ar**v:200405150. 2020.
Google Scholar
Zaheer M, Guruganesh G, Dubey KA, Ainslie J, Alberti C, Ontanon S, et al. Big bird: Transformers for longer sequences. Adv Neural Inf Process Syst. 2020;33:17283–97.
Google Scholar
Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst. 2014;27.
Google Scholar
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN. Convolutional sequence to sequence learning. In: International conference on machine learning. PMLR; 2017. p. 1243–52.
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
Google Scholar
Phan LN, Anibal JT, Tran H, Chanana S, Bahadroglu E, Peltekian A, et al. Scifive: a text-to-text transformer model for biomedical literature. ar**v preprint ar**v:210603598; 2021.
Google Scholar
Feurer M, Hutter F. Hyperparameter Optimization. In: Hutter F, Kotthoff L, Vanschoren J, editors. Automated machine learning: methods, systems, challenges. Cham: Springer International Publishing; 2019. p. 3–33.
Chapter Google Scholar
Laparra E, Mascio A, Velupillai S, Miller T. A review of recent work in transfer learning and domain adaptation for natural language processing of electronic health records. Yearb Med Inform. 2021;30(1):239–44. https://doi.org/10.1055/s-0041-1726522.
Article PubMed PubMed Central Google Scholar
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1):Article 140.
Google Scholar

Download references

Author information

Authors and Affiliations

The University of Arizona, 1103 E. 2nd St., Tucson, AZ, 85721, USA
Steven Bethard

Authors

Steven Bethard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven Bethard .

Editor information

Editors and Affiliations

Yale University, New Haven, CT, USA
Hua Xu
United States National Library of Medicine, Bethesda, MD, USA
Dina Demner Fushman

Glossary

Machine learning: (from the standard Springer glossary).
Gradient descent: an optimization algorithm that repeatedly updates a model’s parameters based upon the gradient of the model’s cost function.
Parameter: (of a machine learning model): a choice (typically a numeric value) that is determined automatically by the machine learning algorithm from the training data.
Hyperparameter: (of a machine learning model): a choice (typically a numeric value) that is not made by the learning algorithm but must instead be made by the machine learning designer.
Parameter initialization: how initial values are assigned to the parameters of a machine learning model.
Mini-batch: a small sample of the training data.
Stochastic gradient descent: gradient descent using mini-batches, rather than the entire training data, for each gradient step.
Training set: example inputs and outputs on which the parameters of a machine learning model are to be tuned.
Development set: example inputs and outputs on which the hyper-parameters of a machine learning model are to be tuned.
Test set: example inputs and outputs on which the generalization of a machine learning model is to be evaluated.
Learning curve: a plot of model performance on the training and/or development data against varying amounts of training data.
Underfitting: when a model has a high cost on the training set.
Overfitting: when a model has a low cost on the training set but a high cost on the development or test set.
Regularization: including a measure of model complexity, in addition to a measure of error on the training data, in the cost function that is minimized for a machine learning model.
Grid search: a hyper-parameter search algorithm where the designer selects a small number of values for each hyperparameter and then explores all possible combinations of such hyperparameter values.
Random search: a hyper-parameter search algorithm where the designer selects a numeric range of values for each hyperparameter and then samples a fixed number of combinations, each time sampling all hyperparameter values from their specified ranges.
Feature engineering: designing the inputs that are fed into a machine learning algorithm.
1-hot vector: a vector where a single entry is 1 and all other entries are 0.
Word embedding: a small dense vector used to represent a word.
Bag-of-words: a representation of text as a vector of counts of each of the words in a vocabulary that were found in that text.
Linear regression: a machine learning model that assumes that outputs can be predicted as a weighted sum of the inputs.
Logistic regression: a machine learning model that assumes that outputs can be predicted by applying a logistic sigmoid over the weighted sum of the inputs.
Support vector machine: a machine learning model that assumes that outputs can be predicted as a weighted sum of the inputs, under a learning process that maximizes the margin between the closest examples of one class and the other.
Decision tree: a machine learning model that assumes that outputs can be predicted by applying a series of tests to different features of the input.
k-nearest neighbors: a machine learning model that assumes that an output can be predicted by taking a new input, finding the most similar inputs in the training examples, and producing the most common output from those training examples.
Deep learning: applying complex neural network architectures to learn nonlinear transformations of the input.
Feedforward network: a machine learning model that assumes that outputs can be predicted as non-linear combinations of the inputs.
Convolutional network: a machine learning model that assumes that outputs can be predicted by aggregating over non-linear combinations of small regions of the input.
Recurrent network: a machine learning model that assumes that an output can be predicted by walking through the input one step at a time, and at each step, non-linearly combining the new step’s input with an aggregation of all of the previous steps’ inputs.
Transformer network: a machine learning model that assumes that an output can be predicted by non-linear combinations over all pairs of time steps in the input.
Sequence-to-sequence model: a model that takes a sequence of inputs and produces a sequence of outputs, where there is no guaranteed relation between the length of the inputs and outputs.
Pre-training: (a neural network): training a neural network on a task, typically using large unlabeled data, with the assumption that the network will later be fine-tuned.
Fine-tuning: (a neural network): taking the learned parameters of a pre-trained model and using those as the starting point for training on a new task.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bethard, S. (2024). Machine Learning and Deep Learning Algorithms. In: Xu, H., Demner Fushman, D. (eds) Natural Language Processing in Biomedicine. Cognitive Informatics in Biomedicine and Healthcare. Springer, Cham. https://doi.org/10.1007/978-3-031-55865-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-55865-8_3
Published: 09 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-55864-1
Online ISBN: 978-3-031-55865-8
eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics