Keywords

A model that either computes the joint probability or the conditional probability of natural language texts is called a language model as it potentially covers all information about the language. In this chapter, we present the main architecture types of attention-based language models (LMs), which process texts consisting of sequences of tokens, i.e. words, numbers, punctuation, etc.:

  • Autoencoders (AE) receive an input text and produce a contextual embedding for each token. These models are also called BERT models and are described in Sect. 2.1.

  • Autoregressivelanguage models (AR) receive a subsequence v1, …, vt−1 of tokens of the input text. They generate contextual embeddings for each token and use them to predict the next token vt. In this way, they can successively predict all tokens of the sequence. These models are also called GPT models and are outlined in Sect. 2.2.

  • Transformer Encoder-Decoders have the task to translate an input sequence in to another sequence, e.g. for language translation. First they generate a contextual embedding for each input token by an autoencoder. Then these embeddings are used as input to an autoregressive language model, which sequentially generates the output sequence tokens. These models are also called Transformers and are defined in Sect. 2.3.

In this chapter, we focus on NLP, where we consider sequences of text tokens. Historically, the transformer encoder-decoder was developed in 2017 by Vaswani et al. [141] to perform translation of text into another language. The autoencoder [1.5 all these meanings are conflated. As a consequence, the interpretation of text based on these embeddings is flawed.

As an alternative, contextual embeddings or contextualized embeddings were developed, where the details of a word embedding depend on the word itself as well as on the neighboring words occurring in the specific document. Consequently, each occurrence of the same word in the text has a different embedding depending on the context. Starting with the Transformer [141], a number of approaches have been designed to generate these contextual embeddings, which are generally trained in an unsupervised manner using a large corpus of documents.

BERT (Bidirectional Encoder Representations from Transformers) was proposed by Devlin et al. [

$$\displaystyle \begin{aligned} \boldsymbol{q}_t^\intercal={\boldsymbol{x}}_t^\intercal {\boldsymbol{W}}^{(q)} \qquad \boldsymbol{k}_t^\intercal = {\boldsymbol{x}}_t^\intercal {\boldsymbol{W}}^{(k)} \qquad {\boldsymbol{v}}_t^\intercal={\boldsymbol{x}}_t^\intercal {\boldsymbol{W}}^{(v)}. {} \end{aligned} $$
(2.1)

Note that the query- and key-vectors have the same length. Then scalar products \(\boldsymbol {q}^\intercal _r\boldsymbol {k}_t\) between the query-vector qr of a target token vr and the key-vectors kt of all tokens of the sequence are computed:

$$\displaystyle \begin{aligned} (\alpha_{r,1},\ldots,\alpha_{r,T})=\operatorname{\mathrm{softmax}}\left( \frac{\boldsymbol{q}^\intercal_r\boldsymbol{k}_1}{\sqrt{d_k}},\ldots, \frac{\boldsymbol{q}^\intercal_r\boldsymbol{k}_T}{\sqrt{d_k}}\right). {} \end{aligned} $$
(2.2)

Each scalar product yields a real-valued association score\((\boldsymbol {q}^\intercal _r\boldsymbol {k}_t)/\sqrt {d_k}\) between the tokens, which depends on the matrices W(q) and W(k). This association score is called scaled dot-product attention. It is normalized to a probability score αr,t by the softmax function. The factor \(1/\sqrt {d_k}\) avoids large values, where the softmax function has only tiny gradients. With these weights a weighted average of the value vectors vt of all sequence elements is formed yielding the new embedding x̆r of length dv for the target token vr:

$$\displaystyle \begin{aligned} \breve{{\boldsymbol{x}}}_r = \alpha_{r,1}*{\boldsymbol{v}}_1+\cdots+\alpha_{r,T}*{\boldsymbol{v}}_T {}. \end{aligned} $$
(2.3)

This algorithm is called self-attention and was first proposed by Vaswani et al. [141]. Figure 2.2 shows the computations for the r-th token “mouse”. Note that the resulting embedding is a contextual embedding as it includes information about all words in the input text. A component of vt gets a high weight whenever the scalar product \(\boldsymbol {q}^\intercal _r\boldsymbol {k}_t\) is large. It measures a specific form of a correlation between xr and xt and is maximal if the vector \({\boldsymbol {x}}_r^\intercal {\boldsymbol {W}}^{(q)}\) points in the same direction as \({\boldsymbol {x}}_t^\intercal {\boldsymbol {W}}^{(k)}\).

Fig. 2.2
A flow diagram of the input tokens the, mouse, MASK within square brackets, and cheese have an interconnected system of embedding. They are divided into sections of embedding vector, query and key and value vectors, association and probability scores, weighted value vectors, and new embedding.

Computation of a contextual embedding for a single token “mouse” by self-attention. By including the embedding of “cheese”, the embedding of mouse can be shifted to the meaning of “rodent” and away from “computer pointing device”. Such an embedding is computed for every word of the input sequence

The self-attention mechanism in general is non-symmetric, as the matrices W(q) and W(k) are different. If token vi has a high attention to token vj (i.e. \(\boldsymbol {q}^\intercal _i\boldsymbol {k}_j\) is large), this does not necessarily mean that vj will highly attend to token vi (i.e. \(\boldsymbol {q}^\intercal _j\boldsymbol {k}_i\) also is large). The influence of vi on the contextual embedding of vj therefore is different from the influence of vj on the contextual embedding of vi. Consider the following example text “Fred gave roses to Mary”. Here the word “gave” has different relations to the remaining words. “Fred” is the person who is performing the giving, “roses” are the objects been given, and “Mary” is the recipient of the given objects. Obviously these semantic role relations are non-symmetric. Therefore, they can be captured with the different matrices W(q) and W(k) and can be encoded in the embeddings.

Self-attention allows for shorter computation paths and provides direct avenues to compare distant elements in the input sequence, such as a pronoun and its antecedent in a sentence. The multiplicative interaction involved in attention provides a more flexible alternative to the inflexible fixed-weight computation of MLPs and CNNs by dynamically adjusting the computation to the input at hand. This is especially useful for language modeling, where, for instance, the sentence “She ate the ice-cream with the X” is processed. While a feed-forward network would always process it in the same way, an attention-based model could adapt its computation to the input and update the contextual embedding of the word “ate” if X is “spoon”, or update the embedding of “ice-cream” if X refers to “strawberries” [17].

In practice all query, key, and value vectors are computed in parallel by Q = XW(q), K = XW(k), V  = XW(v), where X is the T × demb matrix of input embeddings [141]. The query-vectors qt, key-vectors kt and value vectors vt are the rows of Q, K, V respectively. Then the self-attention output matrix ATTL(X) is calculated by one large matrix expression

$$\displaystyle \begin{aligned} \breve{{\boldsymbol{X}}}=\text{ATTL}({\boldsymbol{X}})=\text{ATTL}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})=\operatorname{\mathrm{softmax}}\left(\frac{\boldsymbol{Q}\boldsymbol{K}^\intercal}{\sqrt{d_k}}\right)\boldsymbol{V} {}, \end{aligned} $$
(2.4)

resulting in a T × dv-matrix X̆. Its r-th row contains the new embedding x̆r of the r-th token vr.

A number of alternative compatibility measures instead of the scaled dot-product attention (2.2) have been proposed. They are, however, rarely used in PLMs, as described in the surveys [27, 46].

It turns out that a single self-attention module is not sufficient to characterize the tokens. Therefore, in a layer dhead parallel self-attentions are computed with different matrices \({\boldsymbol {W}}^{(q)}_m\), \({\boldsymbol {W}}^{(k)}_m\), and \({\boldsymbol {W}}^{(v)}_m\), m = 1, …, dhead, yielding partial new embeddings

$$\displaystyle \begin{aligned} \breve{{\boldsymbol{X}}}_m = \text{ATTL}({\boldsymbol{X}}{\boldsymbol{W}}^{(q)}_m, {\boldsymbol{X}}{\boldsymbol{W}}^{(k)}_m, {\boldsymbol{X}}{\boldsymbol{W}}^{(v)}_m) {}. \end{aligned} $$
(2.5)

The emerging partial embeddings x̆m,t for a token vt are able to concentrate on complementary semantic aspects, which develop during training.

The BERTBASE model has dhead=12 of these parallel attention heads. The lengths of these head embeddings are only a fraction dembdhead of the original length demb. The resulting embeddings are concatenated and multiplied with a (dhead ∗ dv) × demb-matrix W(o) yielding the matrix of intermediate embeddings

$$\displaystyle \begin{aligned} \breve{{\boldsymbol{X}}} &= \left[\breve{{\boldsymbol{X}}}_1,\ldots,\breve{{\boldsymbol{X}}}_{d_{\text{head}}}\right] {\boldsymbol{W}}_0 {}, \end{aligned} $$
(2.6)

where W0 is a parameter matrix. If the length of the input embeddings is demb, the length of the query, key, and value vector is chosen as dk = dv = dembdhead. Therefore, the concatenation again creates a T × demb matrix X̆. This setup is called multi-head self-attention. Because of the reduced dimension of the individual heads, the total computational cost is similar to that of a single-head attention with full dimensionality.

Subsequently, each row of X̆, the intermediate embedding vectors \(\breve {{\boldsymbol {x}}}_t^\intercal \), is converted by a fully connected layerFcl with a ReLU activation followed by another linear transformation [141]

$$\displaystyle \begin{aligned} \tilde{{\boldsymbol{x}}}_t^\intercal &= \text{FCL}(\breve{{\boldsymbol{x}}}_t) =ReLU(\breve{{\boldsymbol{x}}}_t^\intercal*{\boldsymbol{W}}_1+\boldsymbol{b}_1^\intercal)*{\boldsymbol{W}}_2 + \boldsymbol{b}_2^\intercal {}. \end{aligned} $$
(2.7)

The matrices W0, W1, W2 and the vectors b1, b2 are parameters. These transformations are the same for each token vt of the sequence yielding the embedding \(\tilde {{\boldsymbol {x}}}_t \).

To improve training speed, residual connections are added as a “bypass”, which simply copy the input. They were shown to be extremely helpful for the optimization of multi-layer image classifiers [54]. In addition, layer normalization [6] is used for regularization (Sect. 2.4.2), as shown in Fig. 2.3. Together the multi-head self-attention (2.5), the concatenation (2.6), and the fully connected layer (2.7) form an encoder block.

Fig. 2.3
A flow diagram of the inputs the, mouse, MASK within square brackets, and cheese have an interconnected system of embedding divided into sections of input embeddings, self-attention, concatenation of partial embeddings, feed-forward and non-linearity, and output embedding via residual connections.

Multi-head self-attention computes self-attentions for each layer l and head m with different matrices \({ \boldsymbol {W}}^{(q)}_{l,m}\), \({ \boldsymbol {W}}^{(k)}_{l,m}\), and \({ \boldsymbol {W}}^{(v)}_{l,m}\). In this way, different aspects of the association between token pairs, e.g. “mouse” and “cheese”, can be computed. The resulting embeddings are concatenated and transformed by a feedforward network. In addition, residual connections and layer normalization improve training convergence [39]

This procedure is repeated for a number of k layers with different encoder blocks, using the output embeddings of one block as input embeddings of the next block. This setup is shown in Fig. 2.4. The embeddings \(\tilde {{\boldsymbol {x}}}_{k,t}\) of the last encoder block provides the desired contextual embeddings. The structure of an encoder block overcomes the limitations of RNNs (namely the sequential nature of RNNs) by allowing each token in the input sequence to directly determine associations with every other token in the sequence. BERTBASE has k=12 encoder blocks. It was developed at Google by Devlin et al. [39]. More details on the implementation of self-attention can be found in these papers [38, 41, 126].

Fig. 2.4
A flow diagram of the input tokens the, mouse, MASK within square brackets, and cheese have an interconnected system of embedding. They are divided into sections of input embedding, parallel self-attention, embedding vector, fully connected layer, and target word, among others via encoder blocks.

Parallel computation of contextual embeddings in each encoder block by BERT. The output embeddings of an encoder block are used as input embeddings of the next encoder block. Finally, masked tokens are predicted by a logistic classifier L using the corresponding contextual embedding of the last encoder block as input

2.1.2 Training BERT by Predicting Masked Tokens

The BERT model has a large number of unknown parameters. These parameters are trained in a two-step procedure.

  • Pre-training enables the model to acquire general knowledge about language in an unsupervised way. The model has the task to fill in missing words in a text. As no manual annotation is required, pre-training can use large text corpora.

  • Fine-tuning adjusts the pre-trained model to a specific task, e.g. sentiment analysis. Here, the model parameters are adapted to solve this task using a smaller labeled training dataset.

The performance on the fine-tuning task is much better than without pre-training because the model can use the knowledge acquired during pre-training through transfer learning.

To pre-train the model parameters, a training task is designed: the masked language model (MLM). Roughly 15% of the input tokens in the training documents are selected for prediction, which is performed by a logistic classifier (Sect. 1.3)

$$\displaystyle \begin{aligned} p(V_t|v_1,\ldots,v_{t-1},v_{t+1}\ldots,v_T)=\operatorname{\mathrm{softmax}}(A\tilde{{\boldsymbol{x}}}_{k,t}+\boldsymbol{b}) {}, \end{aligned} $$
(2.8)

receiving the embedding \(\tilde {{\boldsymbol {x}}}_{k,t}\) of the last layer at position t as input to predict the random variable Vt of possible tokens at position t. This approach avoids cycles where words can indirectly “see themselves”.

The tokens to be predicted have to be changed, as otherwise the prediction would be trivial. Therefore, a token selected for prediction is replaced by:

  • a special [MASK] token for 80% of the time (e.g., “the mouse likes cheese” becomes “the mouse [MASK] cheese”);

  • a random token for 10% of the time (e.g., “the mouse likes cheese” becomes “the mouse absent cheese”);

  • the unchanged label token for 10% of the time (e.g., “the mouse likes cheese” becomes “the mouse likes cheese”).

The second and third variants were introduced, as there is a discrepancy between pre-training and the subsequent fine-tuning, were there is no [MASK] token. The authors mitigate this issue by occasionally replacing [MASK] with the original token, or by sampling from the vocabulary. Note that in 1.5% of the cases a random token is inserted. This occasional noise encourages BERT to be less biased towards the masked token (especially when the label token remains unchanged) in its bidirectional context encoding. To predict the masked token, BERT has to concentrate all knowledge about this token in the corresponding output embedding of the last layer, which is the input to the logistic classifier. Therefore, it is often called an autoencoder, which generates extremely rich output embeddings.

In addition to predicting the masked tokens, BERT also has to predict, whether the next sentence is a randomly chosen sentence or the actual following sentence (next sentence prediction). This requires BERT to consider the relation between two consecutive pieces of text. Again a logistic classifier receiving the embedding of the first [CLS] token is used for this classification. However, this task did not have a major impact on BERT’s performance, as BERT simply learned if the topics of both sentences are similar [158].

In Fig. 2.4 the task is to predict a high probability of the token “likes” for the input text “The mouse [MASK] cheese”. At the beginning of the training this probability will be very small (≈ 1∕no. of tokens). By backpropagation for each unknown parameter the derivative can be determined, indicating how the parameters should be changed to increase the probability of “likes”. The unknown parameters of BERT comprise the input embeddings for each token of the vocabulary, the position embeddings for each position, matrices\({\boldsymbol {W}}^{(q)}_{l,m}\), \({\boldsymbol {W}}^{(k)}_{l,m}\), \({\boldsymbol {W}}^{(v)}_{l,m}\) for each layer l and attention head m (2.4), the parameters of the fully connected layers (2.7) as well as A, b of the logistic classifier (2.8). BERT uses the Adam algorithm [69] for stochastic gradient descent.

The BERTBASE model has a hidden size of demb=768, k=12 encoder blocks each with dhead=12 attention heads, and a total of 110 million parameters. The BERTLARGE model has a hidden size of demb=1024, and k=24 encoder blocks each with dhead=16 attention heads and a total of 340 million parameters [39]. The English Wikipedia and a book corpus with 3.3 billion words were encoded by the WordPiece tokenizer [154] with a vocabulary of 30,000 tokens and used to pre-train BERT. No annotations of the texts by humans were required, so the training is self-supervised. The pre-training took 4 days on 64 TPU chips, which are very fast GPU chips allowing parallel processing. Fine-tuning can be done on a single Graphical Processing Unit (GPU).

To predict the masked tokens, the model has to learn many types of language understanding features: syntax ([MASK] is a good position for a verb), semantics (e.g. the mouse prefers cheese), pragmatics, coreference, etc. Note that the computations can be processed in parallel for each token of the input sequence, eliminating the sequential dependency in Recurrent Neural Networks. This parallelism enables BERT and related models to leverage the full power of modern SIMD (single instruction multiple data) hardware accelerators like GPUs/TPUs, thereby facilitating training of NLP models on datasets of unprecedented size. Reconstructing missing tokens in a sentence has long been used in psychology. Therefore, predicting masked tokens is also called a cloze task from ‘closure’ in Gestalt theory (a school of psychology).

It turns out that BERT achieves excellent results for the prediction of the masked tokens, and that additional encoder blocks markedly increase the accuracy. For example, BERT is able to predict the original words (or parts of words) with an accuracy of 45.9%, although in many cases several values are valid at the target position [125]. In contrast to conventional language models, the MLM takes into account the tokens before and after the masked target token. Hence, it is called a bidirectional encoder. In addition, self-attention directly provides the relation to distant tokens without recurrent model application. Finally, self-attention is fast, as it can be computed in parallel for all input tokens of an encoder block.

2.1.3 Fine-Tuning BERT to Downstream Tasks

Neural networks have already been pre-trained many years ago [16], but the success of pre-training has become more evident in recent years. During pre-training BERT learns general syntactic and semantic properties of the language. This can be exploited for a special training task during subsequent fine-tuning with a modified training task. This approach is also called transfer learning as the knowledge acquired during pre-training is transferred to a related application. In contrast to other models, BERT requires minimal architecture changes for a wide range of natural language processing tasks. At the time of its publication, BERT improved the Sota on various natural language processing tasks.

Usually, a fine-tuning task requires a classification, solved by applying a logistic classifier L to the output embedding \(\tilde {{\boldsymbol {x}}}_{k,1}\) of the [CLS] token at position 1 of BERT’s last encoder block. There are different types of fine-tuning tasks, as shown in Fig. 2.5.

Fig. 2.5
4 block diagrams explain the text classification and text annotation on the left and text pair classification and span prediction on the right via B E R T encoder blocks.

For fine-tuning, BERT is enhanced with an additional layer containing one or more logistic classifiers L using the embeddings of the last layer as inputs. This setup may be employed for text classification and comparison of texts with the embedding of [CLS] as input of the logistic classifier. For sequence tagging, L predicts a class for each sequence token. For span prediction, two logistic classifiers L1 and L2 predict the start and end of the answer phrase [39]

  • Text classification assigns a sentence to one of two or more classes. Examples are the classification of restaurant reviews as positive/negative or the categorization of sentences as good/bad English. Here the output embedding of the start token [CLS] is used as input to L to generate the final classification.

  • Text pair classification compares two sentences separated by “[SEP]”. Examples include classifying whether the second sentence implies, contradicts, or is neutral with respect to the first sentence, or whether the two sentences are semantically equivalent. Again the output embedding of the start token [CLS] is used as input to L. Sometimes more than one sentence is compared to the root sentence. Then outputs are computed for every sentence pair and jointly normalized to a probability.

  • Word annotation marks each word or token of the input text with a specific property. An example is Named Entity Recognition (NER) annotating the tokens with five name classes (e.g. “person”, “location”, …, “other”). Here the same logistic model L is applied to every token output embedding \(\tilde {{\boldsymbol {x}}}_{k,t}\) at position t and yields a probability vector of the different entity classes.

  • Span prediction tags a short sequence of tokens within a text. An example is question answering. The input to BERT consists of a question followed by “[SEP]” and a context text, which is assumed to contain the answer. Here two different logistic classifiers L and \(\tilde {L}\) are applied to every token output embedding \(\tilde {{\boldsymbol {x}}}_{k,t}\) of the context and generate the probability that the answer to the question starts/ends at the specific position. The valid span (i.e. the end is not before the start) with the highest sum of start/end scores is selected as the answer. An example is the input “[CLS] When did Caesar die ? [SEP] … On the Ides of March, 44 BC, Caesar was assassinated by a group of rebellious senators …”, where the answer to the question is the span “Idesstartof March, 44 BCend. Span prediction may be applied to a number of similar tasks.

Therefore, BERT just needs an extra layer with one or more logistic classifiers for fine-tuning. During fine-tuning with a downstream application, parameters of the logistic models are learned from scratch and usually all parameters in the pre-trained BERT model are adapted. The parameters for the logistic classifiers of the masked language model and the next sentence prediction are not used during fine-tuning.

2.1.4 Visualizing Attentions and Embeddings

According to Bengio et al. [14], a good representation of language should capture the implicit linguistic rules and common sense knowledge contained in text data, such as lexical meanings, syntactic relations, semantic roles, and the pragmatics of language use. The contextual word embeddings of BERT can be seen as a big step in this direction. They may be used to disambiguate different meanings of the same word.

The self-attention mechanism of BERT computes a large number of “associations” between tokens and merges embeddings according to the strengths of these associations. If x1, …, xT are the embeddings of the input tokens v1, …, vT, the associations \(\boldsymbol {q}^\intercal _r\boldsymbol {k}_t\) are determined between the query \(\boldsymbol {q}_r^\intercal ={\boldsymbol {x}}_r^\intercal {\boldsymbol {W}}^{(q)}\) and the key \(\boldsymbol {k}_t^\intercal = {\boldsymbol {x}}_t^\intercal {\boldsymbol {W}}^{(k)}\) vectors (2.1). Then a sum of value vectors \({\boldsymbol {v}}_t^\intercal ={\boldsymbol {x}}_t^\intercal {\boldsymbol {W}}^{(v)}\) weighted with the normalized associations is formed yielding the new embeddings (2.3).

This is repeated with different matrices \({\boldsymbol {W}}^{(q)}_{l,m},{\boldsymbol {W}}^{(k)}_{l,m},{\boldsymbol {W}}^{(v)}_{l,m}\) in m self-attention heads and l layers. Each layer and head the new embeddings thus captures different aspects of the relations between the embeddings of each layer. For BERTBASE we have l = 12 layers and m = 12 bidirectional self-attention heads in each layer yielding 144 different “associations” or self-attentions. For the input sentence “The girl and the boy went home. She entered the door.” Figure 2.6 shows on the left side the strength of associations for one of the 144 self-attention heads. Between every pair of tokens of the sentence an attention value is calculated and its strength is symbolized by lines of different widths. We see that the pronoun “she” is strongly associated with “the girl”. In the subsequent calculations (c.f. Fig. 2.2) the word “she” is disambiguated by merging its embedding with the embeddings of “the” and “girl” generating a new contextual embedding of “she”, which includes its relation to “girl”. On the right side of the figure the input “The girl and the boy went home. He entered the door.” is processed. Then the model creates an association of “boy” with “he”.

Fig. 2.6
2 screenshots with layer spin boxes have a horizontal color gradient bar. A list of words read the, girl, and, the, boy, walked, home divided into 2 sections. In the first one, she is linked to the and girl. In the second one, he is linked to the and boy.

Visualization of a specific self-attention in the fifth layer of a BERT model with BERTviz [142]. If the next sentence contains the pronoun “she” this is associated with “the girl”. If this pronoun is changed to “he” it is related to “the boy”. Image created with BERTviz [142], with kind permission of the author

Figure 2.7 shows a subset of the self-attention patterns for the sentence “[CLS] the cat sat on the mat [SEP] the cat lay on the rug [SEP]”. The self-attention patterns are automatically optimized in such a way that they jointly lead to an optimal prediction of the masked tokens. It can be seen that the special tokens [CLS] and [SEP] often are prominent targets of attentions. They usually function as representatives of the whole sentence [124]. Note, however, that in a multilayer PLM the embeddings generated by different heads are concatenated and transformed by a nonlinear transformation. Therefore, the attention patterns of a single head do not contain the complete information [124]. Whenever the matrices are randomly initialized, the self-attention patterns will be completely different, if the training is restarted with new random parameter values. However, the overall pattern of attentions between tokens will be similar.

Fig. 2.7
A matrix of 24 color gradient interconnected patterns divided into 0 by 3 and 0 by 5 grids labeled heads.

Visualization of some of the 144 self-attention patterns computed for the sentence “[CLS] the cat sat on the mat [SEP] the cat lay on the rug[SEP]” with BERTviz. Image reprinted with kind permission of the author [142]

Figure 2.10 shows on the left side a plot of six different senses of the token embeddings of “bank” in the Senseval-3 dataset projected to two dimensions by T-SNE [140]. The different senses are identified by different colors and form well-separated clusters of their own. Senses which are difficult to distinguish, like “bank building” and “financial institution” show a strong overlap [153]. The graphic demonstrates that BERT embeddings have the ability to distinguish different senses of words which are observed frequently enough.

There is an ongoing discussion on the inner workings of self attention.Tay et al [134] empirically evaluated the importance of the dot product \(\boldsymbol {q}^\intercal _r\boldsymbol {k}_s\) on natural language processing tasks and concluded that query-key interaction is “useful but not that important”. Consequently they derived alternative formulae, which in some cases worked well and failed in others. A survey of attention approaches is provided by de Santana Correia et al. [37]. There are a number of different attention mechanisms computing the association between embedding vectors [50, 61, 104, 151]. However, most current large-scale models still use the original scaled dot-product attention with minor variations, such as other activation functions and regularizers (c.f. Sect. 3.1.4).

The fully connected layers Fcl(x̆t) in (2.7) contain 2/3 of the parameters of BERT, but their role in the network has hardly been discussed. Geva et al. [49] show that fully connected layers operate as key-value memories, where each key is correlated with text patterns in the training samples, and each value induces a distribution over the output vocabulary. For a key the authors retrieve the training inputs, which yield the highest activation of the key. Experts were able to assign one or more interpretations to each key. Usually lower fully connected layers were associated with shallow patterns often sharing the last word. The upper layers are characterized by more semantic patterns that describe similar contexts. The authors demonstrate that the output of a feed-forward layer is a composition of its memories.

2.1.5 Natural Language Understanding by BERT

An outstanding goal of PLMs is Natural Language Understanding (NLU). This cannot be evaluated against a single task, but requires a set of benchmarks covering different areas to assess the ability of machines to understand natural language text and acquire linguistic, common sense, and world knowledge. Therefore, PLMs are fine-tuned to corresponding real-world downstream tasks.

GLUE [146] is a prominent benchmark for NLU. It is a collection of nine NLU tasks with public training data, and an evaluation server using private test data. Its benchmarks cover a number of different aspects, which can be formulated as classification problems:

  • Determine the sentiment (positive/negative) of a sentences (SST-2).

  • Classify a sentence as grammatically acceptable or unacceptable (CoLA).

  • Check if two sentences are similar or are paraphrases (MPRC, STS-B, QQP).

  • Determine if the first sentence entails the second one (MNLI, RTE).

  • Check if sentence B contains the answer to question A (QNLI).

  • Specify the target of a pronoun from a set of alternatives (WNLI).

Each task can be posed as text classification or text pair classification problem. The performance of a model is summarized in a single average value, which has the value 87.1 for human annotators [145]. Usually, there is an online leaderboard where the performance of the different models are recorded. A very large repository of leaderboards is on the PapersWithCode website [109]. Table 2.1 describes the tasks by examples and reports the performance of BERTLARGE. BERT was able to lift the Sota of average accuracy from 75.2 to 82.1%. This is a remarkable increase, although the value is still far below the human performance of 87.1 with much room for improvement. Recent benchmark results for NLU are described in Sect. 4.1 for the more demanding SuperGLUE and other benchmarks.

Table 2.1 GLUE language understanding tasks. BERTLARGE was trained for three epochs on the fine-tuning datasets [38]. The performance of the resulting models is printed in the last column yielding an average value of 82.1

2.1.5.1 BERT’s Performance on Other Fine-Tuning Tasks

The pre-training data is sufficient to adapt the large number of BERT parameters and learn very detailed peculiarities about language. The amount of training data for pre-training usually is much higher than for fine-tuning. Fine-tuning usually only requires two or three passes through the fine-tuning training data. Therefore, the stochastic gradient optimizer changes most parameters only slightly and sticks relatively close to the optimal pre-training parameters. Consequently, the model is usually capable to preserve its information about general language and to combine it with the information about the fine-tuning task.

Because BERT can reuse its general knowledge about language acquired during pre-training, it produces excellent results even with small fine-tuning training data [39].

  • CoNLL 2003 [128] is a benchmark dataset for Named entity recognition (NER), where each token has to be marked with a named entity tag, e.g. PER (for person), LOC (for location), …, O (for no name) (Sect. 5.3). The task involves text annotation, where a label is predicted for every input token. BERT increased Sota from 92.6% to 92.8% F1-value on the test data.

  • SQuAD 1.0 [120] is a collection of 100k triples of questions, contexts, and answers. The task is to mark the span of the answer tokens in the context. An example is the question “When did Augustus die?”, where the answer “14 AD” has to be marked in the context “…the death of Augustus in AD 14 …” (Sect. 6.2). Using span prediction BERT increased the Sota of SQuAD from 91.7% to 93.2%, while the human performance was measured as 91.2%.

From these experiments a large body of evidence has been collected demonstrating the strengths and weaknesses of BERT [124]. This is discussed in Sect. 4.2.

In summary, the advent of the BERT model marks a new era of NLP. It combines two pre-training tasks, i.e., predicting masked tokens and determining whether the second sentence matches the first sentence. Transfer learning with unsupervised pre-training and supervised fine-tuning becomes the new standard.

2.1.6 Computational Complexity

It is instructive to illustrate the computational effort required to train PLMs. Its growth determines the time needed to train larger models that can massively improve the quality of language representation. Assume D is the size of the hidden embeddings and the input sequence has length T, then the intermediate dimension of the fully connected layer Fcl is set to 4D and the dimension of the keys and values are set to DH as in Vaswani et al. [141]. Then according to Lin et al. [81] we get the following computational complexities and parameters counts of self-attention and the position-wise Fcl (2.7):

Module

Complexity

# Parameters

Self-attention

O(T2 ∗ D)

4D2

Position-wise Fcl

O(T ∗ D2)

8D2

As long as the input sequence length T is small, the hidden dimension D mainly determines the complexity of self-attention and position-wise Fcl. The main limiting factor is the Fcl. But when the input sequences become longer, the sequence length T gradually dominates the complexity of these modules, so that self-attention becomes the bottleneck of the PLM. Moreover, the computation of self-attention requires that an attention score matrix of size T × T is stored, which prevents the computation for long input sequences. Therefore, modifications reducing the computational effort for long input sequences are required.

To connect all input embeddings with each other, we could employ different modules. Fully connected layers require T ∗ T networks between the different embeddings. Convolutional layers with a kernel width K do not connect all pairs and therefore need O(logK(T)) layers in the case of dilated convolutions. RNNs have to apply a network T times. This leads to the following complexities per layer [81, 141]

  

Sequential

Maximum

Layer type

Complexity per layer

operations

path length

Self-attention

O(T2 ∗ D)

O(1)

O(1)

Recurrent

O(T ∗ D2)

O(T)

O(T)

Fully connected

O(T2 ∗ D2)

O(1)

O(1)

Convolutional

O(K ∗ T ∗ D2)

O(1)

O(logK(T))

Restricted self-attention

O(R ∗ T ∗ D)

O(1)

O(TR)

The last line describes a restricted self-attention, where self-attention only considers a neighborhood of size R to reduce computational effort. Obviously the computational complexity per layer is a limiting factor. In addition, computation for recurrent layers need to be sequential and cannot be parallelized, as shown in the column for sequential operations. The last column shows the path length, i.e. the number of computations to communicate information between far-away positions. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies. Here self-attention has a definite advantage compared to all other layer types. Section 3.2 discusses advanced approaches to process input sequences of larger length. In conclusion, BERT requires less computational effort than alternative layer types.

2.1.7 Summary

BERT is an autoencoder model whose main task is to derive context-sensitive embeddings for tokens. In a preliminary step, tokens are generated from the words and letters of the training data in such a way that most frequent words are tokens and arbitrary words can be composed of tokens. Each token is encoded by an input embedding. To mark the position of each input token, a position embedding is added to the input embedding.

In each layer of BERT, the lower layer embeddings are transformed by self-attention to a new embedding. Self-attention involves the computation of scalar products between linear transformations of embeddings. In this way, the embeddings in the next layer can adapt to tokens from the context, and the embeddings become context-sensitive. The operation is performed in parallel for several attention heads involving different linear projections. The heads can compute associations in parallel with respect to different semantic features. The resulting partial embeddings are concatenated to a new embedding. In addition to self-attention heads, each encoder block contains a fully connected layer as well as normalization operations.

The original BERT model consists of six encoder blocks and generates a final embedding for each input token. BERT is pre-trained on a very large document collection. The main pre-training task is to predict words from the input sequence, which have been replaced by a [MASK] token. This is done by using the last layer embedding of the token as input to a logistic classifier, which predicts the probabilities of tokens for this position. During pre-training the model parameters are optimized by stochastic gradient descent. This forces the model to collect all available information about that token in the output embedding. The first input token is the [CLS] token. During pre-training, it is used for next sentence prediction, where a logistic classifier with the [CLS]-embedding as input has to decide, if the first and second sentence of the input sequence belong together or not.

Typically, the pre-trained model is fine-tuned for a specific task using a small annotated training dataset. An example is the supervised classification task of whether the input text expresses a positive, negative or neutral sentiment. Again a logistic classifier with the [CLS]-embedding as input has to determine the probability of the three sentiments. During pre-training all parameters of the model are adjusted slightly. It turns out that this transfer learning approach has a much higher accuracy than supervised training only on the small training dataset, since the model can use knowledge about language acquired during pre-training.

Experiments show that BERT is able to raise the Sota considerably in many language understanding tasks, e.g. the GLUE benchmark. Other applications are named entity recognition, where names of persons, locations, etc. have to be identified in a text, or question answering, where the answer to a question has to be extracted from a paragraph. An analysis of computational complexity shows that BERT requires less computational effort than alternative layer types. Overall, BERT is the workhorse of natural language processing and is used in different variants to solve language understanding problems. Its encoder blocks are reused in many other models.

Chapter 3 describes ways to improve the performance of BERT models, especially by designing new pre-training tasks (Sect. 3.1.1). In Chap. 4 the knowledge acquired by BERT models is discussed. In the Chaps. 57, we describe a number of applications of BERT models such as relation extraction (Sect. 5.4) or document retrieval (Sect. 6.1).