A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition

Guan, Zhengyi; Zhou, **aobing

doi:10.1186/s12859-023-05172-9

A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition

Research
Open access
Published: 08 February 2023

Volume 24, article number 42, (2023)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition

Download PDF

Zhengyi Guan¹ &
**aobing Zhou¹

2063 Accesses
6 Citations
2 Altmetric
Explore all metrics

Abstract

Background

The biomedical literature is growing rapidly, and it is increasingly important to extract meaningful information from the vast amount of literature. Biomedical named entity recognition (BioNER) is one of the key and fundamental tasks in biomedical text mining. It also acts as a primitive step for many downstream applications such as relation extraction and knowledge base completion. Therefore, the accurate identification of entities in biomedical literature has certain research value. However, this task is challenging due to the insufficiency of sequence labeling and the lack of large-scale labeled training data and domain knowledge.

Results

In this paper, we use a novel word-pair classification method, design a simple attention mechanism and propose a novel architecture to solve the research difficulties of BioNER more efficiently without leveraging any external knowledge. Specifically, we break down the limitations of sequence labeling-based approaches by predicting the relationship between word pairs. Based on this, we enhance the pre-trained model BioBERT, through the proposed prefix and attention map dscrimination fusion guided attention and propose the E-BioBERT. Our proposed attention differentiates the distribution of different heads in different layers in the BioBERT, which enriches the diversity of self-attention. Our model is superior to state-of-the-art compared models on five available datasets: BC4CHEMD, BC2GM, BC5CDR-Disease, BC5CDR-Chem, and NCBI-Disease, achieving F1-score of 92.55%, 85.45%, 87.53%, 94.16% and 90.55%, respectively.

Conclusion

Compared with many previous various models, our method does not require additional training datasets, external knowledge, and complex training process. The experimental results on five BioNER benchmark datasets demonstrate that our model is better at mining semantic information, alleviating the problem of label inconsistency, and has higher entity recognition ability. More importantly, we analyze and demonstrate the effectiveness of our proposed attention.

View this article's peer review reports

Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning

Article Open access 03 November 2022

BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework

Article Open access 22 November 2022

MMBERT: a unified framework for biomedical named entity recognition

Article 14 October 2023

Background

As the number of biomedical articles and resources increases, searching and extracting valuable information becomes challenging. Researchers consider a variety of information sources to transform unstructured textual data into refined knowledge to improve research efficiency. Manual annotation and feature generation by biomedical experts are inefficient because they involve complex processes [1]. Therefore, deep learning (DL) and natural language processing (NLP) are particularly important for biomedical text mining and computational data analysis. Valuable information such as relationships between objects requires us to identify meaningful terms from the text. A meaningful term or phrase in a domain that can be distinguished from similar objects is called a named entity (NE) [2]. Named entity recognition (NER) [3] has become a mature technology in mining medical text terms because it is one fundamental task for natural language processing, which aims to recognize named entities (NEs), such as person, location, disease from raw text and classify them into pre-defined categories [4]. Over the past few decades, NER has attracted a great deal of attention owing to its importance in downstream tasks such as entity linking [5], question answering [6], and relationship extraction [7]. In the biomedical field, biomedical named entity recognition (BioNER) also acts as a fundamental task in biomedical text mining that aims to automatically recognize and classify biomedical entities (e.g. genes, proteins, chemicals and diseases) from biomedical text. Although medical NER is a fundamental upstream task, many difficulties remain. That’s because most of the medical literature is disorganized. Medical texts contain some special features, such as the publication of a large volume of medically relevant disease terminology (such as “adenomatous polyposis coli ”), the publication of some chemicals in letters and numbers (such as “CD-832 ”), a number of medical professional abbreviations (such as “SYN”), and BioNEs constantly increase with new discoveries (e.g. COVID-19 is new.). The specificity of medical texts increases the difficulty of treating NER as a sequence labeling problem. Besides, unlike those public domain named entity recognition tasks, BioNER is more challenging due to the naming complexity [8], lack of large-scale labeled training data [9], domain knowledge [8, 10], data privacy [11], and some ethical concerns [12]. These various factors bring limitations and challenges to solving BioNER. With the development of machine learning, some researchers have traditionally used a variety of natural language processing tools and domain knowledge to solve BioNER problems through well-designed feature engineering [13,14,15]. Since feature engineering relies on models and domain-specific knowledge, there has been a lot of research on BioNER over the past few decades, ranging from traditional feature-based approaches to recent deep learning-based neural approaches.

In recent years, BioNER methods based on DL and NLP have attracted more and more attention due to their excellent performance because deep learning-based approaches typically do not require manually labeling features. It automatically learns useful features from the sentences. Furthermore, advances in deep learning techniques used in NLP have enabled advances in biomedical text mining models. In the NLP field, a deep learning-based approach transforms text into embeddings and then extracts useful features from these embeddings for biomedical entity recognition. So choosing a suitable feature encoder has always been the most important step in NLP. From 2017 to 2022, the research on BioNER is roughly divided into the following several categories, methods based on various neural networks [16,17,18], pre-trained models [19, 20], external knowledge [10, 21], and multi-task learning [22,23,24,25,26]. For example, some studies use neural network models to generate high-quality features have become prevalent in solving BioNER tasks [16]. The feature extractors of neural networks are usually convolutional neural network [27] (CNN), long short term memory networks [28] (LSTM), bi-directional LSTM [29] (BiLSTM), or a combination of various neural networks. Machine learning-based conditional random field [30] (CRF) is often used as a classifier in conjunction with these feature extractors. Considering the correlation between neighboring labels, CRF can obtain the global optimal label chain for a given sequence. For instance, BiLSTM-CRF [16] is the most common architecture for BioNER using the deep learning method [31]. Since 2018, large-scale pre-trained language models (PLMs) are proved to be effective in many NLP tasks. Integrating or fine-tuning pre-trained language model embeddings for BioNER is becoming a new paradigm. Pre-trained models like BERT [19] show the effectiveness of the paradigm of first pre-trained an language model on the unlabeled text then fine-tuning the model on the down-stream NLP tasks [32]. Therefore, Lee et al. [20] proposed a variant of BERT, namely, BioBERT [20], for the biomedical domain, which is pre-trained on large raw biomedical corpora and achieves state-of-the-art performance in BioNER [10]. It was hard to beat the performance of BioBERT until recently when someone tries to use external knowledge and multi-task learning to improve the performance of BioNER [10, 21,22,23,24,25,26]. The recent SoTA model on BioNER in some datasets using multi-task methods are proposed by Tong et al. [26] and Chai et al. [25]. Tong et al. [26] try to combine BioBERT and multi-task by designing three auxiliary classification tasks and one main BioNER task to explore multi-granularity information in the dataset. The multiple loss functions of multiple tasks are jointly trained by assigning different fixed weight coefficients. Their multi-task model is hard parameter sharing. Different from them, Chai et al. [25] select 14 datasets containing 4 types of entities for training and evaluate the model on specific task, which realizes the multi-level information fusion between the underlying entity features and the upper data features. Different from the above models, Tian et al. [10] utilize additional syntactic knowledge to enhance BioNER for the first time. However, these methods have major disadvantages. For example, although multi-task learning is an effective approach to guide the language model to learn task-specific knowledge [33], the relationships between different BioNER tasks are often difficult to consider comprehensively due to differences among different datasets. Besides, multi-task learning is not conducive to model training because loss between different tasks may conflict, resulting in mutual consumption cancellation, or even negative transfer phenomenon, which makes us hard to balance the joint training process of all tasks. As for the methods leveraging additional knowledge, the disadvantages are also obvious: (1) acquiring external knowledge is labor-intensive (e.g., knowledge base) [34, 35] or computationally costly (e.g., dependency); (2) Integrating external knowledge adversely affects end-to-end learning and compromises the generality of DL-based systems [31]. Although some external syntactic information is easier to obtain through off-the-shelf NLP toolkits like spaCy or Stanford CoreNLP Toolkits [10, 17, 36], the text structure of BioNER is usually complex, and it is difficult to integrate general syntactic structure information into multiple BioNER datasets. Finally, the above methods are all based on sequence labeling which means the label of each word is predicted independently based on a context-dependent representation, regardless of its neighbors. We believe that ignoring the neighbors around entity words in sequence labeling will weaken the recognition ability of special medical words. And the complexity of biomedical terminology brings challenges and difficulties to sequence labeling. Different from all the above methods, we use a new prediction mode, namely word-pair relation classification, instead of the sequence tagging-based mode. We will introduce the differences and advantages between our novel approach and sequence tagging in Related work and Method sections. And we enhance the pre-trained BioBERT model with the proposed attention in a way that does not require additional knowledge to improve the performance of recognizing complex medical terms. To summarize, this paper makes the following contributions:

We first use a word-pair relation classification to solve the difficulty of sequence labeling for BioNER tasks.
We design an attention mechanism guided by fused prefix and attention map discrimination to enhance the BioBERT. Our proposed attention can be easily integrated to Transformer-based PLMs [37], which allows initialization from PLMs without introducing any new parameters, and only affects fine-tuning of standard model parameters.
We evaluate the proposed model on five BioNER datasets to demonstrate its generality.

Related work

Sequence labeling (also known as sequence tagging) is to input a string and output the sequence corresponding to each character in the string. Complete word segmentation through sequence tagging, that is, mark a character, whether it is the beginning, end, or middle part of a word. Sequence labeling has long been used to model and solve NLP tasks [38], including the BioNER tasks. Sequence labeling is a relatively simple NLP task, but it can also be the most basic task since it covers a wide range of characters, which can solve a series of problems in character classification, such as word segmentation, part-of-speech tagging, named entity recognition, relation extraction and so on. In this method, we need training a sequence labeling model by assigning and designing a label with some tagging schemes for each token in a given sequence. However, sequence labeling methods have many disadvantages in solving the BioNER tasks. First, designing a general labeling scheme for all BioNER subtasks is difficult and labor-intensive [68]. The operation of the prompt is different from the previous fine-tuning based on the PLMs paradigm. In prompt learning, especially for text classification, downstream tasks are formalized as equivalent cloze-style tasks, and PLMs are asked to handle these cloze-style tasks instead of original downstream tasks. For example, compared with conventional fine-tuning methods, prompt learning needs to reconstruct the input data through the template, so that the predicted content is also embedded in the input data, and the mask language model-like [19] (MLM) method can be used to learn the label information. Prompt learning has two types of prompts, namely discrete prompt and continuous prompt (also known as prefix). Prompt learning has been proven to have good results on some simple NLP tasks, including text classification, natural language inference, and so on. But unfortunately, prompt learning may perform poorly compared to fine-tuning on several hard sequence tasks such as information extraction and sequence tagging [70]. Because the template-based prompt method needs to iterate over all spans, the complexity is very high [41]. Later, Liu and Li et al. [69, 71, 72] proposed prompt tuning, an idea of tuning only the continuous prompts. They try to apply continuous prefix for every layer of the pre-trained model [70]. In other words, prefix tuning prepends a sequence of continuous task-specific vectors to the input [71]. This is also a great inspiration for our work.

In this work, instead of treating the BioNER task as a sequence labeling problem, we formulate it as a word pairs relation classification problem [39]. To the best of our knowledge, there is currently no specific research for BioNER by using this research mode. Furthermore, we are the first to explore enhancing PLMs based on this new research mode in BioNER. And we believe that generating continuous prompts can provide certain guided semantic information for word-pair representation on the BioNER datasets because word-pair relationship classification can be seen as a dimensionality reduction operation for sequence labeling. We are committed to designing a more diverse attention mechanism based on prompt tuning, which can make the representation of the same head as similar as possible while the distribution of different heads is as diverse as possible. In this way, the probability of entity words being noticed will increase. This kind of attention can enrich the diversity of multi-head attention at different layers of PLMs without introducing external knowledge, syntax trees and modifying the self-attention. We design this attention as a unified auxiliary task, which can be applied to any efficient model (This is our future work.). Therefore, we propose the prefix and attention map discrimination fusion guided attention (PAMDFGA). As far as we know, in the BioNER filed, no researchers have done similar prompt and attention guiding research. Our work is the first to guide the distribution of pre-trained models by using the prompt to solve the BioNER problem. The following section will introduce how PAMDFGA guides our model in detail.

Method

Task definition

Formally, for a sequence labeling task, given a sequence of tokens $s =$ $\langle$ $s_{1}$, $s_{2}$, ... , $s_{n}$ $\rangle$, PLMs are to output a list of tuples $\langle$ ${l_s}$, ${l_e}$, t $\rangle$. Here, ${l_s}$ $\in$ [1, N] and ${l_e}$ $\in$ [1, N] are the start and the end indexes of a named entity mention. t is the entity type from a pre-defined category set [31]. However, in our model, we do not use this prediction mode. This method cannot better mine the entity information of biomedical text, so we explore a model that can strengthen the attention of biomedical entities. Inspired by Li et al [39], our task is to predict the relationship between biomedical word pairs. Specifically, we design two pointer-like word-pair representations for BioNER datasets, namely Next-Neighboring-Word (NNW) and Tail-Head-Word (THW) for BioNER. The NNW relation addresses entity word identification, indicating if two argument words are adjacent in an entity, while the THW relation accounts for entity boundary and type detection, revealing if two argument words are the tail and head boundaries respectively. We give an example as demonstrated in Fig. 1 for a better understanding. Our task aims to extract the relations $\mathfrak {R}$, between each word pairs (${x_i}, {x_j}$), where $\mathfrak {R}$ is pre-defined, including None, NNW, and THW-$\star$ (“$\star$” represents the type of the entity.). As shown in Fig. 1, “CD-832” is a complete entity of chemical. This whole entity includes two relations NNW (CD$\rightarrow$-, and -$\rightarrow$832 ) and THW-C (832$\rightarrow$CD). If there is no relationship between word pairs, we set it to None. Therefore, a 2-dimensional grid for word pairs is constructed in Fig. 1. If an entity such as “calcium” has only one word we set it to THW-C. To avoid the sparsity of relation instances, NNW and THW-C relations are tagged in the upper and lower triangular regions. Our model needs to predict the relation between all word pairs and finally decode it. Through this method, we can better capture the semantic relationship between adjacent entities. With this constructed grid, we don’t have to design a label for each word [39].

Model

In this section, we will present the overall model architecture proposed in our method. The architecture of our framework is illustrated in Fig. 2. It mainly consists of three components. First, the enhanced BioBERT (E-BioBERT), and widely-used bi-directional LSTM [29] are used as the encoder to yield contextualized word representations from input sentences. Then a simple convolution layer is used to build and refine the representation of the word-pair grid for later word-word relation classification. Afterward, a multi-layer perceptron is leveraged for reasoning the relations between all word pairs.

Encoder layer

Answer engineering has a strong impact on the performance of prompt learning. As for entity class prediction in BioNER, adding additional label-specific parameters representing different entity types hinders the applicability of prompt learning [41, 72]. As shown in Fig. 3, we use the prefix tuning to tune the attention weights of BioBERT. This approach eliminates the need for a verbalizer and becomes a fully generative model that outputs a token-level class at each token position. Prompts in different layers are added as prefix tokens in the input sequence and are independent from other layers (rather than computed by previous transformer layers). Inspired by Chen et al. and Li et al. [41, 71], we add a set of trainable embedding matrices $\{\phi _{1},\phi _{2},\ldots ,\phi _{l}\}$ to each layer of BioBERT, where l is the layer number of BioBERT and $\phi _{\theta } \in {\mathbb {R}}^{P \times d }$ (P is the length of the prompt and d represents the dimension of the hidden layer of the encoder). The prefix of each layer participates in the calculation of self-attention. That is, unlike methods that place templates in the original input sequence, we incorporate continuous prompts into the self-attention layer and utilize these prefixes to guide attention allocation, which is sufficiently flexible and lightweight. Specifically, we inherit the structure of the Transformer, as a specific component, we introduce the prefix-guided attention layer over the original layer queries, keys and values (${\textbf {Q}}$, ${\textbf {K}}$, and ${\textbf {V}}$) to achieve more guided attention effect. As we all know, Transformer use stacked self-attentions to encode contextual information for input tokens [48]. The calculation of self-attention depends on the following components of Q, K and V, which are projected from the hidden vectors of the previous layer. Then the attention output A of one head is computed as follows:

$$\begin{aligned} {\textbf {A}} = softmax\left( \frac{{\textbf {Q}}{} {\textbf {K}}^{T}}{\root \of {{d}}}\right) {\textbf {V}} \end{aligned}$$

(1)

where d is the dimension of keys. Within the standard self-attention layer, global attention mechanism is employed that each token provides information to other tokens in the input sentence. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. Furthermore, Transformer rely on multi-head self-attention to capture dependencies between tokens. Given a hidden state H (input of initialized BioBERT), multi-head self-attention first projects it linearly into queries ${{\textbf {Q}}_{h}}$, keys ${{\textbf {K}}_{h}}$, and values ${{\textbf {V}}_{h}}$ using parameter matrices ${\textbf {W}}_{h}^{Q}$, ${\textbf {W}}_{h}^{K}$ and ${\textbf {W}}_{h}^{V}$ respectively. The formulation is as follows:

$$\begin{aligned} {\textbf {Q}}_{h}, {\textbf {K}}_{h}, {\textbf {V}}_{h} = {\textbf {HW}}_{h}^{Q}, {\textbf {HW}}_{h}^{K}, {\textbf {HW}}_{h}^{V} \end{aligned}$$

(2)

Then, we introduce the attention mechanism after the prefix to redefine the self-attention mechanism of ${\textbf {A}}_{h}$ as follows:

$$\begin{aligned} {\textbf {A}}_{h} =softmax\left( \frac{{\textbf {Q}}_{h}[{\textbf {K}}_{h};\phi _{k}^{h}]^{T}}{\root \of {{d}}}\right) [{\textbf {V}}_{h};\phi _{v}^{h}] \end{aligned}$$

(3)

where the self-attention distribution (attention weight) ${\textbf {A}}_{h}$ is computed via scaled dot-product of ${\textbf {Q}}_{h}$ and ${\textbf {K}}_{h}$ by Eq. 3. These weights are assigned to the corresponding value vectors ${\textbf {V}}_{h}$ to obtain output states ${\textbf {O}}_{h}$:

$$\begin{aligned} {\textbf {O}}_{h} = {\textbf {A}}_{h}{} {\textbf {V}}_{h} \end{aligned}$$

(4)

. Finally, the output states ${\textbf {O}}_{h}$ of all heads are concatenated to produce the final states. To allow the different attention heads to interact with each other, Transformer applies a non-linear feed-forward network over the multi-head attention’s output at each Transformer layer. However, even with prefix-guided attention, for the BioNER task, we still find that our attention mechanism has redundant attention patterns and insufficient attention to entities. In view of the shortage of Transformer, we propose the PAMDFGA. Inspired by the instance discrimination learning proposed by Wu et al. [73], take BioBERT’s twelve layers and twelve heads as an example, we treat each head in BioBERT as an instance and match different heads at different layers to maximize the difference between different heads. This will make our model get more information from the input text from different aspects and perspectives. We want to learn a good feature representation for each instance (head), which requires the semantic information learned between different heads is as different as possible. Instance discrimination learning can implicitly group similar instances together in the representation space without any explicit learning force directs to do so [74]. Our designed attention discrimination model is shown in Fig. 4. The specific construction process of our designed attention PAMDFGA is as follows: we first obtain attention weights from different heads and layers of prefix-guided BioBERT. Our proposed attention mechanism is based on the whole attention map. This is expressed as follows:

$$\begin{aligned} \{{\textbf {A}}_{1}, {\textbf {A}}_{2}\ldots {\textbf {A}}_{i}, {\textbf {A}}_{i+1}\ldots {\textbf {A}}_{l*h}\}= BioBERT(x_{i} | \theta _{BioBERT}) \end{aligned}$$

(5)

where $\{{\textbf {A}}_{1}, {\textbf {A}}_{2}\ldots {\textbf {A}}_{i}, {\textbf {A}}_{i+1}\ldots {\textbf {A}}_{l*h}\}$ is BioBERT’s multi-head attention map. l and h denote the layer number and head number in each layer, respectively. Each attention map ${\textbf {A}}_{i}$ $\in$ ${\mathbb {R}}^{L \times (L+P)}$, where L is the maximum sentence length in each batch and P is the length of random initial prefix. $x_{i}$ represent the input token and $\theta _{BioBERT}$ is the trainable parameters of the BioBERT model which is fine-tuned during model training. Then we stack twelve layers of attention map and perform an average pooling operation on the $\textbf{A}_{i}$ by summing up the attention values that all original (o) input length tokens received and all original input tokens with prefix (${\textbf {p}}$) received. Then, the corresponding formula of transforming the attention map ${\textbf {A}}_{i}$ to the attention vector ${\textbf {o}}_{i}$ and ${\textbf {p}}_{i}$ via the following:

$$\begin{aligned} \begin{aligned} {\textbf {o}}_{i}&= \sum _{j}^{L}{{\textbf {A}}_{i,j}} \\ {\textbf {p}}_{i}&= \sum _{j}^{L+P}{{\textbf {A}}_{i,j}} \end{aligned} \end{aligned}$$

(6)

where i represents the i-th attention map and j is the column index of ${\textbf {A}}_{i}$ of the attention map. ${\textbf {o}}_{i}$ $\in$ ${\mathbb {R}}^{L}$ and ${\textbf {p}}_{i}$ $\in$ ${\mathbb {R}}^{L+P}$. Then, we rebuild the entire attention map as follows:

$$\begin{aligned} \begin{aligned} {\textbf {O}}&= {\textbf {o}}_{1} \oplus {\textbf {o}}_{2} \oplus \,\cdots , \oplus \ {\textbf {o}}_{i} ,\ldots , \oplus \ {\textbf {o}}_{l*h} \\ {\textbf {P}}&= {\textbf {p}}_{1} \oplus {\textbf {p}}_{2} \oplus \,\cdots , \oplus \ {\textbf {p}}_{i} ,\ldots , \oplus \ {\textbf {p}}_{l*h} \end{aligned} \end{aligned}$$

(7)

where $\oplus$ denotes the concatenate operation. $\textbf{O}$ $\in$ ${\mathbb {R}}^{(l*h) \times L}$ and ${\textbf {P}}$ $\in$ ${\mathbb {R}}^{(l*h) \times (L+P)}$ represents the attention matrix. Finally, we push the diversity of attention maps via the idea of instance discrimination [73]. We treat each attention head as a distinct class of its own and train final category results of each head are different, so that the characteristics of each head are different, which means that the information of each head is different. The probability of one attention map ${\textbf {o}}$ and ${\textbf {p}}$ being assigned into the i-th class can be computed as follows:

$$\begin{aligned} \begin{aligned} {\textbf {O}}(i|{\textbf {o}})&=\frac{{exp}({\textbf {o}}_{i}^{T}{} {\textbf {o}}/\tau ) }{\sum _{j=1}^{{l*n}}{exp}({\textbf {o}}_{j}^{T}{} {\textbf {o}}/\tau )} \\ {\textbf {P}}(i|{\textbf {p}})&=\frac{{exp}({\textbf {p}}_{i}^{T}{} {\textbf {p}}/\tau ) }{\sum _{j=1}^{{l*n}}{exp}({\textbf {p}}_{j}^{T}{} {\textbf {p}}/\tau )} \end{aligned} \end{aligned}$$

(8)

where ${\textbf {o}}_{j}^{T}{} {\textbf {o}}$ measures how well o matches the i-th class because ${\textbf {o}}_{j}$ is regarded as the weight of j-th class. $\tau$ is a temperature parameter that controls the concentration of the distribution [75], which is necessary for tuning the concentration of ${\textbf {o}}$ on our unit sphere and we enforce $||{\textbf {p}}||$ and $||{\textbf {o}}||$ to 1 via a L2-normalization layer [73]. The objective of the auxiliary task is to maximize the joint probability $\prod _{i=1}^{l*h}P_{\theta }(i|f_{\theta }({\textbf {p}}_{i})$ and $\prod _{i=1}^{l*h}P_{\theta }(i|f_{\theta }({\textbf {o}}_{i})$ or equivalently to minimize the negative log-likelihood over the training set [51], as

$$\begin{aligned} \begin{aligned} Loss_{p}&= -\sum _{i=1}^{l*n}log P(i|f_{\theta }{({\textbf {p}}_{i}})) \\&=-\sum _{i=1}^{l*n}log(\frac{{exp}({\textbf {p}}_{i}^{T}{} {\textbf {p}}/\tau ) }{\sum _{j=1}^{l*n}exp({\textbf {p}}_{j}^{T}{} {\textbf {p}}/\tau )})\\ Loss_{o}&= -\sum _{i=1}^{l*n}log P(i|f_{\theta }{({\textbf {o}}_{i}})) \\&=-\sum _{i=1}^{l*n}log(\frac{exp({\textbf {o}}_{i}^{T}{} {\textbf {o}}/\tau ) }{\sum _{j=1}^{l*n}exp({\textbf {o}}_{j}^{T}{} {\textbf {o}}/\tau )})\\ \end{aligned} \end{aligned}$$

(9)

As such, the training objective of our PAMDFGA is revised as:

$$Loss_{{PAMDFGA}} = (Loss_{p} + Loss_{o} )/2$$

(10)

where $Loss_{PAMDFGA}$ fuse the information of the prefix and the original attention weight. We use $Loss_{PAMDFGA}$ as the auxiliary loss of main task $Loss_{BioNER}$.

Combined with our attention guidance mechanism, at the beginning of Fig. 3, using BioBERT, we add a special symbol token, i.e. ’[CLS]’, in front of each input sample [76]. By concatenating with both position embeddings and segmentation embeddings, the token embeddings were fed into the E-BioBERT model to get the output representation ${\varvec{{e(x}}}_{i})$ $\in$ ${\mathbb {R}}^{d_{x}}$. Formally, given the input tokens, the label-specific encoder calculates:

$$\begin{aligned} \begin{aligned}{}[{\varvec{{e(x}}}_{0}),{\varvec{{e(x}}}_{1}),\ldots ,{\varvec{{e(x}}}_{n})] = E-B ioBERT ([x_{0},x_{1},\ldots ,x_{n}];\theta _{E-BioBERT}) \end{aligned} \end{aligned}$$

(11)

where $\theta _{E-BioBERT}$ is the trainable parameters of the E-BioBERT model, which is fine-tuned during training. ${x_0}$ is the special token ’[CLS]’ and $d_{x}$ = 768 is the dimensionality of the local representation. Besides, we use the bi-directional LSTM [29] to yield contextual word representation from input embedding. The contextualized sentence-level representation $[{\varvec{{e(x}}}_{0}),{\varvec{{e(x}}}_{1}),\ldots ,{\varvec{{e(x}}}_{n})]$ are used as the input embeddings of bi-directional LSTM layer, denoted as,

$$\begin{aligned} \begin{aligned}{}[{\varvec{{h}}}_{0},{\varvec{{h}}}_{1},\ldots ,{\varvec{{h}}}_{n}] = BiLSTM([{\varvec{{e(x}}}_{0}),{\varvec{{e(x}}}_{1}),\ldots ,{\varvec{{e(x}}}_{n})];\theta _{BiLSTM}) \end{aligned} \end{aligned}$$

(12)

where $\theta _{BiLSTM}$ is the corresponding trainable parameters of the BiLSTM model. ${\varvec{{h}}}_{i}$ $\in$ ${\mathbb {R}}^{d_{h}}$, where ${d_{h}}$ denotes the dimension of a word representation. $[{\varvec{{h}}}_{0},{\varvec{{h}}}_{1},\ldots ,{\varvec{{h}}}_{n}]$ are the hidden layer state sequence of BiLSTM [76].

Convolution layer

The second part of the model is the convolutional layer. Since CNNs are naturally suitable for 2-dimensional convolution on the grid, and also show the very prominence on handling relation determination jobs [39, 77]. We use a convolution module to capture grid information. Our convolution layer includes three modules, including a condition layer with normalization [39, 78] (CLN) for generating the representation of the word-pair grid, a hybrid sentence grid representation build-up to enrich the representation of the word-pair grid, and a single-layer dilated convolution for capturing the interactions between close and distant words. Specifically, we follow prior work [39, 78] and use a conditional layer with normalization for generating the representation of the word-pair grid. Then, we combine the enhanced word pair representations from CLN with randomly initialized distance and region embeddings to augment sentence representations. In the third part of the convolutional layer, we just use a simple and single-layer dilated convolutional neural network to capture the interaction information between different word pairs. The specific module information of the convolutional layer is as follows.

Conditional layer normalization

The idea of conditional layer normalization comes from the idea of popular conditional generative adversarial networks (GAN) in image field - conditional batch normalization (CBN). That means a conditional vector is introduced as external contextual information to generate the gain parameter and bias of the well known layer normalization [79] (LN) mechanism. In our BioNER framework, since we need to to predict the final relations between word pairs by generating grid representations of the word-pair grid, which can be regarded as a 3-dimensional matrix, ${\textbf {W}}$ $\in$ ${\mathbb {R}}^{(N \times N \times d_{h})}$, where ${\textbf {W}}_{ij}$ denotes the representation of the word pair $({x}_{i},{x}_{j})$ and N is the number of tokens in each batch. Because both NNW and THW relations are directional, the representation ${\textbf {W}}_{ij}$ of the word pair $({x}_{i},{x}_{j})$ can be considered as a combination of the representation ${\varvec{{h}}}_{i}$ of ${x}_{i}$ and ${\varvec{{h}}}_{j}$ of ${x}_{j}$, where the combination should imply that ${x}_{j}$ is conditioned on ${x}_{i}$. We adopt the CLN to calculate ${\textbf {W}}_{ij}$:

$$\begin{aligned} \begin{aligned} {\textbf {W}}_{ij}&= CLN({\varvec{{h}}}_{i},{\varvec{{h}}}_{j}) \\&= \gamma _{ij}\odot \left( \frac{{\varvec{{h}}}_{j}-\mu }{\sigma }\right) + \lambda _{ij} \end{aligned} \end{aligned}$$

(13)

where ${\varvec{{h}}}_{i}$ is the condition to generate the gain parameter $\gamma _{ij}={\textbf {W}}_{\alpha }{\varvec{{h}}}_{i}+{\textbf {b}}_{\alpha }$ and bias $\lambda _{ij}={\textbf {W}}_{\beta }{\varvec{{h}}}_{i}+{\textbf {b}}_{\beta }$ of layer normalization. $\mu$ and $\sigma$ are the mean and standard deviation across the elements of ${\varvec{{h}}}_{j}$, denoted as:

$$\begin{aligned} \begin{aligned} \mu&= \frac{1}{d_{h}}\sum _{k=1}^{d_{h}}h_{jk} \\ \sigma&= \sqrt{\frac{1}{d_{h}}\sum _{k=1}^{d_{h}}(h_{jk}-\mu )^{2}} \end{aligned} \end{aligned}$$

(14)

where $h_{jk}$ denotes the k-th dimension of ${\varvec{{h}}}_{j}$ [39].

Hybrid sentence representation

Building a grid representation of word pairs is a key step in our word pair classification. To further enhance sentence representations from E-BioBERT and conditional layer normalization, the distance embeddings (D) and region embeddings (R) are leveraged to better represent the positional information of word pairs in the grid. After we get the 3-dimensional vector ${\textbf {W}}$ $\in$ ${\mathbb {R}}^{N \times N \times d_{h}}$ encoded by the BiLSTM encoder and E-BioBERT, we concat the word embedding (${\textbf {W}}$), distance embedding (${\textbf {D}}$), and region embedding (${\textbf {R}}$) together. ${\textbf {D}}$ and ${\textbf {R}}$ are also 3-dimensional vectors, where ${\textbf {D}}$ $\in$ ${\mathbb {R}}^{N \times N \times d_{d}}$ and ${\textbf {R}}$ $\in$ ${\mathbb {R}}^{N \times N \times d_{r}}$. Finally, we concatenate these three vectors to enhance the region and distance information of the hybrid sentence representation of grid ${\textbf {G}}$ $\in$ ${\mathbb {R}}^{N \times N \times d_{g}}$. The overall process can be formulated as:

$$\begin{aligned} {\textbf {G}} = MLP([{\textbf {W}}\otimes {\textbf {R}}\otimes {\textbf {D}}]) \end{aligned}$$

(15)

where MLP is a multi-layer perception to reduce their dimensions and $\otimes$ represents concatenation operations.

Convolutional neural network

Convolutional neural network is generally used in the field of computer vision for tasks such as image classification and detection. The core idea of CNN is to capture local features. For text, local features are sliding windows composed of several words, similar to N-gram. The advantage of CNN is that it can automatically combine and filter N-gram features to obtain semantic information at different levels of abstraction. This is beneficial for enriching the semantic information of ${\textbf {G}}$. We use a single layer dilated convolutional neural network (SDConv) to capture the interactions between word pairs, denoted as:

$$\begin{aligned} {\textbf {S}} = \sigma (SDConv({\textbf {G}})) \end{aligned}$$

(16)

where S $\in$ ${\mathbb {R}}^{N \times N \times d_{g}}$ and $\sigma$ is the GELU activation function [80].

Classifier

Our model mainly predicts the relationship of word pairs, that is, the probability that a directed graph edge belongs to a category. The vector ${\textbf {S}}$ from SDConv represents the grid information of word pairs, and we use the MLP to calculate two separate relations scores of word pair ($x_{i}$, $x_{j}$) and use the Softmax function to calculate the final relation probabilities, using ${\textbf {S}}_{ij}$,

$$\begin{aligned} {\textbf {y}}_{ij} = softmax (MLP({\textbf {S}}_{ij})) \end{aligned}$$

(17)

where $MLP({\textbf {S}}_{ij})$ $\in$ ${\mathbb {R}}^{|\mathfrak {R}|}$ is the scores of the relations pre-defined in $\mathfrak {R}$. Finally, for the $Loss_{BioNER}$, our BioNER training target is to minimize the negative log-likelihood losses with regards to the corresponding gold labels, formalized as:

$$\begin{aligned} \begin{aligned} Loss_{BioNER} = -\frac{1}{N^2}\sum _{i=1}^{N}\sum _{j=1}^N\sum _{r=1}^{|\mathfrak {R}|}\hat{{\textbf {y}}}^{r}_{ij}log{\textbf {y}}^{r}_{ij} \end{aligned} \end{aligned}$$

(18)

where N is the number of words in the sentence, $\hat{{\textbf {y}}}_{ij}$ is the binary vector that denotes the gold relation labels for the word pair ($x_{i}$,$y_{j}$), and ${\textbf {y}}_{ij}$ are the predicted probability vector. r indicates the r-th relation of the pre-defined relation set $\mathfrak {R}$. As such, our total training target is to minimize the loss of BioNER and the loss of PAMDFGA, formalized as:

$$Loss_{{Total}} = Loss_{{BioNER}} + \alpha Loss_{{PAMDFGA}}$$

(19)

where $Loss_{PAMDFGA}$ is defined in Eq. 10 and $\alpha Loss_{PAMDFGA}$ can be seen as a regularization loss, which are regulated using $\alpha$, and this term works like L2 term which does not introduce any new parameters and only influence the fine-tuning of the standard model parameters [51].

Decoding

The five BioNER datasets used in our framework are all flat NER. For the word-pair relationship scores predicted by the framework, we decode our predictions as a directional graph. The decoding object is to find certain paths from one word to another word in the graph using NNW relations. THW is used to determine the boundaries and type of entities, especially for sentences without entities, our THW is empty, and we do not judge which category it belongs to. Specifically, the relationships $\mathfrak {R}$ of all the word pairs serve as the inputs. The decoding object is to find all the entity word index sequences with their corresponding categories. First, since our dataset has no nested examples, in the lower triangle part of Fig. 1, we can decode them out just using THW-$\star$. For multiple consecutive entities, we construct a graph to, in which nodes are words and edges are NNW relations. Then we use the deep first search algorithm to find all the paths from the head word to the tail word, which are the word index sequences of corresponding entities [39].

Results

Datasets and metrics

We evaluate our model on five public and available datasets containing various biomedical entities: BC4CHEMD [81], BC5CDR [82] (including two sub-datasets, BC5CDR-Disease and BC5CDR-Chem), NCBI-Disease [83], BC2GM [84] and, all of which are pre-processed and provided by previos SoTA work. Table 1 summarizes these datasets. Among them, BC4CHEMD has the most sentences and entities and NCBI-Disease has the least datasets. As the same with previous work [18, 22, 23, 25, 26], we merged the train and development sets, made the same data split, and evaluated our model on the test set for a fair comparison. We follow prior SoTA works [25, 26], and adopt standard entity-level F1-score as evaluation metrics to measure the performance of the trained model. Specifically, a predicted entity is counted as true-positive if its token sequence and type match those of a gold entity. The corresponding metrics are Precision ($\mathrm P$), Recall ($\mathrm R$), and F1-score ($\mathrm F1$), where $\mathrm F1 = 2 \times \mathrm P \times \mathrm R/(\mathrm P + \mathrm R)$.

Table 1 Datasets description

Full size table

Settings

The BioBERTv1.1 (+PubMed, Cased) [20] model was used, containing 12 layers of Transformers with a hidden size of 768. The dimensionality of the hidden state $d_{h}$ in BiLSTM is 512, the channel size of the convolutional layer $d_{g}$ is set to 128 and the size of distance embedding and region embedding is initialized to 20. All datasets are trained with the batch size of 8 except BC4CHEMD, which has a batch size of 4. We use the AdamW optimizer [85] with a learning rate 1e-3 for all datasets. We select the sentence length of the largest sample in each batch for training. A linear learning rate decay schedule with warm-up over 0.1, and a weight decay of 0.1 applied to every epochs of the training [26]. The $\alpha$ in Eq. 19 are selected from the set {0.1, 0.01, 0.001, 0.0001} according to grid search. The temperature parameter is set to 2.0 [51]. On all the datasets, each experiment is repeated five times. We report the maximum F1-score (referred to “Max”), average F1-score (referred to “Mean”), and standard deviation (referred to “Std”). Table 2 demonstrates our work. The proposed attention guiding mechanism acts on all attention heads of BioBERT. The best results in all our datasets are obtained based on integrating PAMDFGA into the last four layers of BioBERT. The best training procedure contains 6 epochs for BC4CHEMD, 10 epochs for BC2GM, 41 epochs for NCBI-Disease, 34 epochs for BC5CDR-Disease, and 47 epochs for BC5CDR-Chem. All our ablation study and case study are performed under the same parameters and epochs. Because the model training is not complicated, we do not freeze the parameters of BioBERT. Finally, all models are trained on NVIDIA RTX 3090.

Table 2 Experimental results over five runs

Full size table

Performance and comparisons

We compare our model with a wide range of methods. These methods are based on sequence tagging. To be specific, we compare our model with the approaches based on neural network [16,17,18], approaches based on pre-trained language models, such as BERT [19] and BioBERT [20], approaches based on external knowledge [10, 21] and the approaches based on mulit-task learning [22,23,24,25,26]. As can be seen from Table 3, multi-task learning in solving BioNER tasks is becoming more and more popular, among them, Chai et al. [25] achieved SoTA performance on two datasets BC4CHEMD and BC5CDR-Chem by training a model on 14 datasets, which realizes the multi-level information fusion between the underlying entity features and the upper data features. Both multi-task learning and fine-tuning are applied to their model. Tong et al. [26] design multiple auxiliary classification losses by incorporating multi-granularity information in the datasets to achieve the best performance in the BC4CHEMD, BC5CDR-Chem, and BC5CDR-Disease datasets. They all get the best performance without utilizing additional resources. It is worth noting that Tian et al [10] injects a lot of external syntactic knowledge (i.e., POS labels, syntactic constituents, and dependency relations) into BioBERT in the form of a key-value pair that works best on the BC2GM, BC5CDR-Chem and NCBI-Disease datasets. Although additional knowledge and multi-task learning can alleviate the problem of insufficient data, the additional knowledge usually contains a lot of noise, and it is difficult for us to control how much additional information should be selected. The training process of multi-task learning is too complicated, so it is difficult for us to design a general multi-task framework for many BioNER datasets. Therefore, the current methods can only be effective on some specific datasets. However, surprisingly, we achieve the best performance on all five datasets by using a novel word-pair relation classification schema and the proposed PAMDFGA. As indicated in Table 3, compared with the models without additional knowledge, the improvement effect of our model is more obvious. First, we can see that our model outperforms existing methods, regardless of whether they introduce external knowledge, which further confirms the validity of our innovation in enhancing BioNER feature extraction. Second, although some models utilize higher-level features, such as Tian et al [10]. leverages POS tags, syntactic constituents, and dependencies rules, and Tong et al. [26] employs multi-task learning to train the model, our model can achieve better results with a simple attention guiding. This means our proposed model can better mine semantic information and solve entity sparse problems in all datasets, especially when mining datasets of disease entities (NCBI-Disease and BC5CDR-Disease) and gene entities (BC2GM). We can conclude that the features extracted from the PAMDFGA module are effective in assisting biomedical text representation, and even show more potential than special designs in the biomedical field, with the whole attention map guidance being pushed.

Table 3 Model performance comparison on the five benchmark datasets

Full size table

Table 4 Ablation study

Full size table

Discussion

Ablation study

Since our proposed attention mechanism PAMDFGA performs an average fusion of the attention weights in the two dimensions of prefix and original text length. In order to analyze the impact of different components of our proposed attention mechanism on different datasets. As shown in Table 4, we conducted ablation experiments under the same experimental parameters, and the pre-trained model we used was BioBERT. In this table, baseline refers to our model not using any attention mechanism. PAMDFGA w/o $Loss_{p}$ refers to not using $Loss_{p}$ but retaining $Loss_{o}$, PAMDFGA w/o $Loss_{o}$ refers to not using $Loss_{o}$ but retaining $Loss_{p}$, and PAMDFGA refers to the final attention proposed. It can be seen from the results in Table 4 that different components have certain guiding roles for BioBERT. Among them, PAMDFGA is better than using a single component alone in BG2GM, BC5CDR-Disease, BC5CDR-Chem and NCBI-Disease datasets. In view of the dataset BC4CHEMD, our proposed guidance does not perform as well as using the original attention weights. The reasons are two-fold: (1) this dataset, BC4CHEMD, are large in scale and the sentences in training set without entities are too long and too large. (2) After random initialization of BioBERT, the prefix distribution is mapped to the high dimensional space, resulting in the entity distribution is too sparse. So the guiding effect of these words is not good. Besides, we can find that the impact without $Loss_{o}$ is bigger than without $Loss_{p}$. But for the fused attention mechanism, $Loss_{p}$ plays a further guiding role. However, compared with the baseline, our proposed three-way guided attention mechanism has significantly improved the entity recognition of the BioBERT because this attention mechanism can better utilize the grid information of word pair relationships. Removing any of them will result in performance degradation for almost all metrics. The results of ablation experiments demonstrate that our proposed PAMDFGA can bring more valuable information to BioBERT, as PAMDFGA can push each head to focus on different locations of the input to capture diversity information.

Effect of different prefix length for PAMDFGA

Prefix length (P in Eq. 7) is a influential hyper-parameter for PAMDFGA because the length of the prefix will participate in the calculation of self-attention, which will affect the effect of the words that PAMDFGA pays attention to. As shown in Fig. 5, we experimentally demonstrate that the length of the prefix is the best within 20. The prefix length are selected from the set {5, 7, 9, 11, 13, 15, 20} according to grid search in our experiments. Figure 5 illustrates that prefix length shows a similar distribution across the datasets, with a prefix length of 11 performing best on all datasets. This may be related to the sentence length in BioNER. As for this, the prefix length of 11 was chosen for both our ablation experiment and the best experimental results to guide our model.

Effect of PAMDFGA on different layers of BioBERT

To demonstrate that our proposed PAMDFGA can be better integrated into each layer of PLMs, we studied the effect of PAMDFGA on each layer of BioBERT. We conducted experiments on the NCBI-Disease test set. As shown in Fig. 6, for NCBI-Disease dataset, most layers in BioBERT can benefit from the proposed PAMDFGA, and the improvement effect is more obvious in the last four to five layers. Among them, the F1-score of the eleventh layer increased from 89.87 to 90.82%. The F1-score of the last layes has decreased relative to the eleventh layer, which is understandable because PAMDFGA encourages pushing information of different heads of BioBERT. That is, our attention mechanism plays a better attention effect in other layers. It can also be argued that PAMDFGA makes the attention information of different heads more diverse compared to the patterns that traditional pre-trained models pay attention to. That’s because traditional pre-trained models have incremental F1 in the last few layers. In all our final experiments, we integrate PAMDFGA into the last four layers of BioBERT. For comparison, the baseline system also uses the outputs of the last four layers of BioBERT.

Computational cost analysis

Our proposed attention machine can work in the fine-tuning phase without modifying the self-attention formula, which means we does not need to re-train the PLMs. Our entire model maintains a time complexity identical to the Transformer, which is ${\mathcal {O}} (N^{2})$. N is the length of the input sentence. Thus, PAMDFGA also has merits in terms of time cost. Nevertheless, the calculation process of it will take more time than directly fine-tuning the pre-trained model. Table 5 shows the per-epoch training time of entire fine-tuned model on five datasets. As can be seen from Table 5, the increased time cost is minor by adding PAMDFGA. Specifically, the external time cost by PAMDFGA per-epoch training is about 2.21 s, 1.11 s, 3.35 s and 1.68 s on BC2GM, BC5CDR-Disease, BC5CDR-Chem and NCBI-Disease datasets, respectively. We can see that our attention mechanism hardly adds to the computational cost of training because we don’t introduce additional parameters. In term of this, the advantage of PAMDFGA is even more significant. For the dataset BC4CHEMD, the time of training time increases 24.74 s. We consider the time cost acceptable, since this dataset is inherently large and PAMDFGA improves the recognition of chemical disease terms.

Table 5 Per-epoch training time (in seconds) with or without PAMDFGA

Full size table

Case study

To show and prove the validity of our proposed attention mechanism, we plot the full attention heatmaps on NCBI-Disease dataset to verify the reason why the 11-th layer works well in Fig. 6. And we perform qualitative analysis on BC5CDR-Chem dataset with their real labels and predicted labels from the method based on sequence labelling using BioBERT and our model.

Attention visualization

To show the validity of the PAMDFGA in Fig. 6 and prove our attention can pay attention to more positions, we present examples of full self-attention maps of a set of fine-tuned models with/without the PAMDFGA to provide a better illustration of different head pattern in Figs. 7 and 8. The selected token sequence from NCBI-Disease after the WordPiece tokenizer is “[’[CLS]’, ’the’, ’first’, ’recognized’, ’human’, ’kind’, ’##red’, ’with’, ’hereditary’, ’deficiency’, ’of’, ’the’, ’fifth’, ’component’, ’of’, ’complement’, ’(’, ’c’, ’##5’, ’)’, ’is’, ’described’, ’.’, ’[SEP]’]”. Specifically, we first take the attention weight change of the first head of BioBERT layer 11 as an example, as shown in Fig. 7, we can see that there are many informative tokens overlooked by the self-attention without PAMDFGA (Fig. 7a) but captured by our method (Fig. 7b). Looking further at the full attention map, as shown in Figure 8, we can find that the repeated attention patterns like diagonal pattern [47] of different heads in different layers of BioBERT after using PAMDFGA are significantly reduced, and different heads in the last four layers of BioBERT, the words that the attention head pays attention to are more diverse. In this way, the probability of the entity being noticed will naturally increase. We can conclude our attention mechanism pushes the diversity of the entire attention map. This is also explain why the F1-score of layer 11 in Fig. 6 is better because layer 11 pays more attention to the words near the entity words. For example, more attention is paid to the token ’hereditary’ and ’deficiency’. In fact, these tokens constitute an important biomedical entity, which should be paid more attention. Most other heads have a similar effect. The comparison of the heatmaps between the eleventh and twelfth layers in Fig. 8b also demonstrates why the F1-score of the eleventh layer is higher.

Qualitative analysis

We randomly sampled one sentence from the BC5CDR-Chem test set and compare the sequence tagging method using BioBERT with the word pair model with PAMDFGA. Figure 9 shows that our model has certain advantages over BioBERT in terms of learning entity information and alleviate the problem of label inconsistency. For example, BioBERT model based on sequence tagging usually only recognizes entity composed of a single word such as “calcium”, and the entity composed of multiple words such as “CD-832” tends to be identified incorrectly, causing the base model regards these words as two different entities. As we all know, “CD-832” is an important and complete chemical entity. But surprisingly, since word-pair relationship classification can better capture the relationship between adjacent entities, our model will recognize that “CD” and “-” and “-” and “832” are the relation of NNW. “832” and “CD” are the relation of THW. Our model thus decodes the identified relationship into a complete entity. To further verify how much our model pays attention to entities, we draw the attention heatmaps of the model from the average attention perspective in Figure 10. We mainly focus on the interactions of tokens, except for ’[CLS]’ and ’[SEP]’. The selected token sequence after the WordPiece tokenizer is “[’[CLS]’, ’Effects’, ’of’, ’a’, ’new’, ’calcium’, ’antagonist, ’c’, ’##D’, ’83’, ’##2’, ’[SEP]’]”. Then the attention scores are averaged over all heads and layers. This visualization validates the effectiveness of proposed attention compared with the traditional self-attention pattern. As shown in Fig. 10, we can see that there are many informative tokens overlooked by the Transformer-based method (Fig. 10a) but captured by our method (Fig. 10b). For instance, the PAMDFGA allows the tokens “CD” to strongly attend to the token “-” and “832”, but these tokens are paid less attention in the Transformer-based attention. In addition, our model also strengthens the attention between sub-words such as ’83’, ’##2’. These explain why our model can better capture the semantic information of neighbor words.

Conclusion

In this work, we addresses the BioNER problem with a new prediction pattern for the first time. Experiments show that this prediction mode has better entity recognition ability and entity words modeling ability than sequence labeling. Empirically, to further improve the performance to recognize biomedical entities, we design a novel and efficient prefix and attention map discrimination fusion guided attention mechanism by changing the attention distribution to enhance BioBERT. And our method outperforms the four existing mainstream methods. This work also points to a promising direction and provides a new research angle for BioNER. As to future work, we plan to explore the effectiveness of PAMDFGA in different PLMs and different biomedical tasks, and explore how to incorporate more domain-specific knowledge to guide self-attention learning in other domains such as some biomedical low resource domains.

Availability of data and materials

The dataset is available on https://github.com/cambridgeltl/MTL-Bioinformatics-2016. Our code is released at https://github.com/Guan-cloud/PAMDFGA.

Abbreviations

DL:: Deep learning
BioNER:: Biomedical named entity recognition
NER:: Named entity recognition
PAMDFGA:: A prefix and attention map discrimination fusion guided attention
CLN:: Conditional layer normalization
CRF:: Conditional random field
NLP:: Natural language processing
NLI:: Natural language inference

References

Snow R, O’connor B, Jurafsky D, Ng AY. Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. In: Proceedings of the 2008 conference on empirical methods in natural language processing, 2008. pp. 254–263
Cho H, Lee H. Biomedical named entity recognition using deep neural networks with contextual information. BMC Bioinform. 2019;20(1):1–11.
CAS Google Scholar
Sang EF, De Meulder F. Introduction to the conll-2003 shared task: Language-independent named entity recognition. ar**v preprint cs/0306050 (2003).
Zhou R, Li X, He R, Bing L, Cambria E, Si L, Miao C. Melm: Data augmentation with masked entity language modeling for low-resource ner. In: Proceedings of the 60th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2022. pp. 2251–2262.
Le P, Titov I. Improving entity linking by modeling latent relations between mentions. ar**v preprint ar**v:1804.10637 2018.
Pergola G, Kochkina E, Gui L, Liakata M, He Y. Boosting low-resource biomedical qa via entity-aware masking strategies. ar**v preprint ar**v:2102.08366 2021.
Shen Y, Ma X, Tang Y, Lu W. A trigger-sense memory flow framework for joint entity and relation extraction. In: Proceedings of the web conference 2021. 2021, pp. 1704–1715.
Liu S, Tang B, Chen Q, Wang X. Drug name recognition: approaches and resources. Information. 2015;6(4):790–810.
Google Scholar
Kim D, Lee J, So CH, Jeon H, Jeong M, Choi Y, Yoon W, Sung M, Kang J. A neural named entity recognition and multi-type normalization tool for biomedical text mining. IEEE Access. 2019;7:73729–40.
Google Scholar
Tian Y, Shen W, Song Y, **a F, He M, Li K. Improving biomedical named entity recognition with syntactic information. BMC Bioinform. 2020;21(1):1–17.
Google Scholar
Hathurusinghe R, Nejadgholi I, Bolic M. A privacy-preserving approach to extraction of personal information through automatic annotation and federated learning. In: Proceedings of the third workshop on privacy in natural language processing 2021.
Jia C, Liang X, Zhang Y. Cross-domain ner using cross-domain language modeling. In: Proceedings of the 57th annual meeting of the association for computational linguistics, 2019. pp 2464–2474.
Leaman R, Islamaj Doğan R, Lu Z. Dnorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013;29(22):2909–17.
CAS Google Scholar
Leaman R, Wei C-H, Lu Z. tmchem: a high performance approach for chemical named entity recognition and normalization. J Cheminform. 2015;7(1):1–10.
Google Scholar
Leaman R, Lu Z. Taggerone: joint named entity recognition and normalization with semi-Markov models. Bioinformatics. 2016;32(18):2839–46.
CAS Google Scholar
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics. 2017;33(14):37–48.
Google Scholar
Luo L, Yang Z, Yang P, Zhang Y, Wang L, Lin H, Wang J. An attention-based bilstm-crf approach to document-level chemical named entity recognition. Bioinformatics. 2018;34(8):1381–8.
CAS Google Scholar
Sachan DS, **e P, Sachan M, **ng EP. Effective use of bidirectional language modeling for transfer learning in biomedical named entity recognition. In: Machine learning for healthcare conference, PMLR, 2018. pp. 383–402.
Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar**v preprint ar**v:1810.04805 2018.
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40.
CAS Google Scholar
Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc (HEALTH). 2021;3(1):1–23.
CAS Google Scholar
Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics. 2019;35(10):1745–52.
CAS Google Scholar
Yoon W, So CH, Lee J, Kang J. Collabonet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinform. 2019;20(10):55–65.
Google Scholar
Khan MR, Ziyadi M, AbdelHady M. Mt-bioner: multi-task learning for biomedical named entity recognition using deep bidirectional transformers. ar**v preprint ar**v:2001.08904 2020.
Chai Z, ** H, Shi S, Zhan S, Zhuo L, Yang Y. Hierarchical shared transfer learning for biomedical named entity recognition. BMC Bioinform. 2022;23(1):1–14.
CAS Google Scholar
Tong Y, Chen Y, Shi X. A multi-task approach for improving biomedical named entity recognition by incorporating multi-granularity information. In: Findings of the association for computational linguistics: ACL-IJCNLP 2021, 2021. pp. 4804–4813.
Wu Y, Jiang M, Lei J, Xu H. Named entity recognition in Chinese clinical text using deep neural network. Stud Health Technol Inform. 2015;216:624.
Google Scholar
Kuru O, Can OA, Yuret D. Charner: Character-level named entity recognition. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers, 2016. pp. 911–921.
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer, C. Neural architectures for named entity recognition. ar**v preprint ar**v:1603.01360 2016.
Lafferty J, McCallum A, Pereira FC. Conditional random fields: probabilistic models for segmenting and labeling sequence data 2001.
Li J, Sun A, Han J, Li C. A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng. 2020;34(1):50–70.
Google Scholar
Yuan Z, Liu Y, Tan C, Huang S, Huang F. Improving biomedical pretrained language models with knowledge. ar**v preprint ar**v:2104.10344 2021.
Yang Z, Salakhutdinov R, Cohen W. Multi-task cross-lingual sequence tagging from scratch. ar**v preprint ar**v:1603.06270 2016.
Akhondi SA, Hettne KM, Van Der Horst E, Van Mulligen EM, Kors JA. Recognition of chemical entities: combining dictionary-based and grammar-based approaches. J Cheminform. 2015;7(1):1–11.
Google Scholar
Zhou H, Ning S, Liu Z, Lang C, Liu Z, Lei B. Knowledge-enhanced biomedical named entity recognition and normalization: application to proteins and genes. BMC Bioinform. 2020;21(1):1–15.
CAS Google Scholar
Dang TH, Le H-Q, Nguyen TM, Vu ST. D3ner: biomedical named entity recognition using crf-bilstm improved with fine-tuned embeddings of various linguistic information. Bioinformatics. 2018;34(20):3539–46.
CAS Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst 2017;30.
Alonso MA, Gómez-Rodríguez C, Vilares J. On the use of parsing for named entity recognition. Appl Sci. 2021;11(3):1090.
CAS Google Scholar
Li J, Fei H, Liu J, Wu S, Zhang M, Teng C, Ji D, Li F. Unified named entity recognition as word-word relation classification. ar**v preprint ar**v:2112.10070 2021.
Fu J, Liu P, Neubig G. Interpretable multi-dataset evaluation for named entity recognition. ar**v preprint ar**v:2011.06854 2020.
Chen X, Zhang N, Li L, **e X, Deng S, Tan C, Huang F, Si L, Chen H. Lightner: a lightweight generative framework with prompt-guided attention for low-resource ner. ar**v preprint ar**v:2109.00720 2021.
Gu X, Liu L, Yu H, Li J, Chen C, Han J. On the transformer growth for progressive bert training. ar**v preprint ar**v:2010.12562 2020.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 2013;26 .
Pennington J, Socher R, Manning CD. Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014. pp. 1532–1543.
Choudhury M, Deshpande A. How linguistically fair are multilingual pre-trained language models. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, 2021. pp. 12710–12718.
Lai H, Toral A, Nissim M. Thank you bart! rewarding pre-trained models improves formality style transfer. ar**v preprint ar**v:2105.06947 2021.
Kovaleva O, Romanov A, Rogers A, Rumshisky A. Revealing the dark secrets of bert. ar**v preprint ar**v:1908.08593 2019.
Li Z, Zhou Q, Li C, Xu K, Cao Y. Improving bert with syntax-aware local attention. ar**v preprint ar**v:2012.15150 2020.
Raganato A, Tiedemann J, et al. An analysis of encoder representations in transformer-based machine translation. In: Proceedings of the 2018 EMNLP workshop BlackboxNLP: analyzing and interpreting neural networks for NLP (2018). The Association for Computational Linguistics
Voita E, Talbot D, Moiseev F, Sennrich R, Titov I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ar**v preprint ar**v:1905.09418 2019.
Wang S, Chen Z, Ren Z, Liang H, Yan Q, Ren P. Paying more attention to self-attention: Improving pre-trained language models via attention guiding. ar**v preprint ar**v:2204.02922 2022.
Michel P, Levy O, Neubig G. Are sixteen heads really better than one? Adv Neural Inf Process Syst 2019;32.
Bian Y, Huang J, Cai X, Yuan J, Church K. On attention redundancy: a comprehensive study. In: Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: human language technologies, 2021. pp. 930–945.
Raganato A, Scherrer Y, Tiedemann J. Fixed encoder self-attention patterns in transformer-based machine translation. ar**v preprint ar**v:2002.10260 2020.
Guan Y, Li Z, Leng J, Lin Z, Guo M. Transkimmer: Transformer learns to layer-wise skim. ar**v preprint ar**v:2205.07324 2022.
Brunner G, Liu Y, Pascual D, Richter O, Ciaramita M, Wattenhofer R. On identifiability in transformers. ar**v preprint ar**v:1908.04211 2019.
Pham T-H, Macháček D, Bojar O. Promoting the knowledge of source syntax in transformer nmt is not needed. Computación y Sistemas. 2019;23(3):923–34.
Google Scholar
Currey A, Heafield K. Incorporating source syntax into transformer-based neural machine translation. In: Proceedings of the fourth conference on machine translation (Volume 1: Research Papers), 2019. pp. 24–33.
Indurthi SR, Chung I, Kim S. Look harder: a neural machine translation model with hard attention. In: Proceedings of the 57th annual meeting of the association for computational linguistics, 2019. pp. 3037–3043.
Kitaev N, Kaiser Ł, Levskaya A. Reformer: The efficient transformer. ar**v preprint ar**v:2001.04451 2020.
Xu M, Wong DF, Yang B, Zhang Y, Chao LS. Leveraging local and global patterns for self-attention networks. In: Proceedings of the 57th annual meeting of the association for computational linguistics, 2019. 3069–3075.
Yang S, Lu H, Kang S, Xue L, **ao J, Su D, **e L, Yu D. On the localness modeling for the self-attention based end-to-end speech synthesis. Neural Netw. 2020;125:121–30.
Google Scholar
Kim S, Shen S, Thorsley D, Gholami A, Kwon W, Hassoun J, Keutzer K. Learned token pruning for transformers. ar**v preprint ar**v:2107.00910 2021.
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R. Transformer-xl: attentive language models beyond a fixed-length context. ar**v preprint ar**v:1901.02860 2019.
Ye D, Lin Y, Huang Y, Sun M. Tr-bert: Dynamic token reduction for accelerating bert inference. ar**v preprint ar**v:2105.11618 2021.
Bello I, Zoph B, Vaswani A, Shlens J, Le QV. Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision, 2019. pp. 3286–3295.
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901.
Google Scholar
Ding N, Chen Y, Han X, Xu G, **. ar**v preprint ar**v:2108.10604 2021.
Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, Tang J. Gpt understands, too. ar**v preprint ar**v:2103.10385 2021.
Liu X, Ji K, Fu Y, Du Z, Yang Z, Tang J. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. CoRR abs/2110.07602 2021. arxiv: 2110.07602
Li XL, Liang P. Prefix-tuning: Optimizing continuous prompts for generation. ar**v preprint ar**v:2101.00190 2021.
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ar**v preprint ar**v:2107.13586 2021.
Wu Z, **ong Y, Yu SX, Lin D. Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018. pp. 3733–3742.
Zhang D, Li S-W, **ao W, Zhu H, Nallapati R, Arnold AO, **ang B. Pairwise supervised contrastive learning of sentence representations. ar**v preprint ar**v:2109.05424 2021.
Hinton G, Vinyals O, Dean J. et al. Distilling the knowledge in a neural network. ar**v preprint ar**v:1503.02531 2015;2(7).
Wang J, Yu L-C, Zhang X. Explainable detection of adverse drug reaction with imbalanced data distribution. PLoS Comput Biol. 2022;18(6):1010144.
Google Scholar
Wang, L, Cao, Z, De Melo, G, Liu, Z. Relation classification via multi-level attention cnns. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2016. pp. 1298–1307.
Wang, Y, Yu, B, Zhu, H, Liu, T, Yu, N, Sun, L. Discontinuous named entity recognition as maximal clique discovery. ar**v preprint ar**v:2106.00218 2021.
Ba JL, Kiros JR, Hinton GE. Layer normalization. ar**v preprint ar**v:1607.06450 2016.
Hendrycks D, Gimpel K. Gaussian error linear units (gelus). ar**v preprint ar**v:1606.08415 2016.
Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z, Leaman R, Lu Y, Ji D, Lowe DM, et al. The chemdner corpus of chemicals and drugs and its annotation principles. J Cheminform. 2015;7(1):1–17.
Google Scholar
Li J, Sun Y, Johnson RJ, Sciaky D, Wei C-H, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database 2016;2016.
Doğan RI, Leaman R, Lu Z. Ncbi disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform. 2014;47:1–10.
Google Scholar
Smith L, Tanabe LK, Kuo C-J, Chung I, Hsu C-N, Lin Y-S, Klinger R, Friedrich CM, Ganchev K, Torii M, et al. Overview of biocreative ii gene mention recognition. Genome Biol. 2008;9(2):1–19.
Google Scholar
Loshchilov I, Hutter F. Decoupled weight decay regularization. ar**v preprint ar**v:1711.05101 2017.

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the Natural Science Foundation of China under Grant 61463050.

Author information

Authors and Affiliations

School of Information Science and Engineering, Yunnan University, Kunming, China
Zhengyi Guan & **aobing Zhou

Authors

Zhengyi Guan
View author publications
You can also search for this author in PubMed Google Scholar
**aobing Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ZG implements the code, analyzes the results, and writes the paper; XZ designs the study, analyzes the results, and write the paper; All authors read and approved the final manuscript.

Corresponding author

Correspondence to **aobing Zhou.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Guan, Z., Zhou, X. A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition. BMC Bioinformatics 24, 42 (2023). https://doi.org/10.1186/s12859-023-05172-9

Download citation

Received: 05 July 2022
Accepted: 03 February 2023
Published: 08 February 2023
DOI: https://doi.org/10.1186/s12859-023-05172-9

A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition

Abstract

Background

Results

Conclusion

Similar content being viewed by others

Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning

BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework

MMBERT: a unified framework for biomedical named entity recognition

Background

Related work

Method

Task definition

Model

Encoder layer

Convolution layer

Classifier

Decoding

Results

Datasets and metrics

Settings

Performance and comparisons

Discussion

Ablation study

Effect of different prefix length for PAMDFGA

Effect of PAMDFGA on different layers of BioBERT

Computational cost analysis

Case study

Attention visualization

Qualitative analysis

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation