1 Introduction

Event detection, as a crucial research area in natural language processing, aims to accurately identify and categorize specific events from text [1]. Event detection tasks are essential for information extraction [2], knowledge graph construction [3], text summarization, and many other applications. Despite extensive research efforts in recent years, this task still faces numerous challenges [4, 5]. Firstly, the semantic complexity of events and the diversity of text require algorithms with high generalization capabilities to capture various possible expressions. Secondly, many events are semantically related or similar, which increases the difficulty of the classification task, as traditional representations such as one-hot encoding cannot capture these subtle semantic differences [6]. Furthermore, since events may be intertwined with multiple entities and relationships, accurate event identification and classification require a deep understanding of context [7].

To the best of our knowledge, existing event detection algorithms in previous research have not adequately leveraged the semantic information contained within event labels [8]. In previous event detection algorithms, during the classification task, labels were typically transformed into one-hot vectors and used as supervisory signals in calculating the loss function. For an N-class classification problem, the cosine similarity between any two one-hot vectors for the labels would be zero. For instance, consider a classification task with three labels, "cat," "dog," and "car." The cosine similarity between the one-hot vectors of these three labels would be zero for all pairs. However, even though "cat" and "dog" are different labels, they both belong to the animal category. Therefore, the similarity between the "cat" and "dog" labels should be higher than the similarity between "cat" and "car" labels.

Models that treat labels as one-hot vectors encode only the event sentences and do not encode the event labels themselves [9]. As shown in Fig. 1, this type of model is referred to as a single-encoding classification model in this paper. Such an approach results in the loss of the semantic information carried by the labels themselves.

Fig. 1
figure 1

A comparison of single-encoding classification models and dual-encoding matching models for event detection methods

In the field of information extraction, question-answering-based methods are a common approach [10, 11]. Taking the example of named entity recognition, Li et al. [12] assign a natural language question to the entity label to introduce prior information. For instance, when extracting the "location" entity, the question can be "find the location in the text." This question is concatenated with the sentence to be extracted and fed into an encoder. The encoder's output is then used to extract the boundaries of the named entity. The reason question-answering-based information extraction works well is that the questions already contain prior semantic information [13], which is closely related to the principles of recent prompting learning [14]. In addition to introducing prior knowledge, the question-answering approach, as the question string contains prior knowledge for only one label at a time, ensures that only one label is extracted at a time, effectively avoiding the problem of boundary overlap between different label results [15].

In the field of information retrieval, such as in search engines, the ability of query sentences to retrieve target web pages [16] is due to the high semantic matching between the target web pages and the query sentences [17]. The typical approach to calculate semantic matching involves computing the similarity between the vector representations of the query sentence and the target web page [18]. This model, which utilizes two vectors for matching, is referred to in this paper as the dual-encoding matching model, as illustrated in Fig. 1b.

In the context of event detection tasks, this paper combines ideas from information extraction methods based on question-answering and semantic similarity computation techniques from the information retrieval domain. Firstly, a natural language definition for an event label is introduced as prior knowledge, and then, this definition is used to obtain the semantic information of the event label. Subsequently, this semantic information is matched with the encoding of the event sentence to determine the label for the event sentence based on semantic similarity. To obtain more accurate label semantic information, this paper employs contrastive learning to optimize the semantic representation of event labels.

Specifically, this paper makes the following contributions:

  1. 1)

    Abandoning one-hot vectors for labels and using low-dimensional dense vectors to represent labels. To capture the semantic information of labels, a pre-trained language model is employed to encode the natural language definitions of the labels.

  2. 2)

    Leveraging contrastive learning to bring the semantic encodings of event labels closer to the encodings of event sentences with the same label in the vector space, thereby enhancing the representational performance of label semantic encodings through concrete instances.

  3. 3)

    Employing multi-layer attention for semantic matching among events, arguments, and event sentences, thereby improving event detection performance.

  4. 4)

    Modeling event detection as a multi-class classification task of event sentences without trigger word extraction. Additionally, a fully connected classifier is defined for each event label, addressing the single-sentence multi-event scenario and mitigating the issue of poor classifier convergence due to sample imbalance.

2 Related works

Event detection is one of the classic tasks in the field of natural language processing. Its goal is to identify the specific type of event mentioned in a given text. Traditional event detection algorithms adopt a supervised learning paradigm, requiring more or less annotated samples. Currently, event detection algorithms can be categorized into three main types based on their specific implementations: pattern matching-based methods, machine learning-based methods, and deep learning-based methods.

The pattern matching-based method was the first proposed algorithm for event detection and can be traced back to Ellen's research achievements in 1993 [34]. The AutoSlog system proposed in this study uses a trigger word dictionary specific to the domain to detect potential events. This dictionary is automatically constructed based on part-of-speech tagging language templates. This method has seen rapid development in specific domains such as biomedicine [35] and finance [36]. However, the design and maintenance of matching templates require expert intervention; at the same time, due to the heavy reliance of templates on specific domain textual expressions, the generality and generalizability of this method are greatly limited [37].

To alleviate reliance on predefined templates, many scholars have attempted to use machine learning methods for event detection tasks [38, 39]. The basic approach of machine learning-based event detection is to extract features from texts and then train various classifiers using the annotated information of samples. Compared to pattern matching-based methods, the advantage of machine learning lies in reducing the workload of designing and constructing templates, as well as offering better generalizability and universality. However, the classic machine learning models used are not very effective in fitting various complex nonlinear relationships and are also quite sensitive to feature selection.

With deep learning achieving dominant performance in various fields [40], event detection methods based on deep learning have been successively proposed [41,42,43]. The main characteristic of deep learning methods is that they have larger dimensional word embedding representations and deeper network structures. This compensates for the limited nonlinear fitting capability of machine learning models and reduces the dependency on feature engineering. The DMCNN model [44] uses convolutional neural networks to automatically extract features at both the lexical and sentence levels. To better capture the complex relationships between local and overall context in documents, Zhao [45] and Liu [47] introduced the DMB-PN model in 2020 for few-shot event detection, combining dynamic memory networks with prototype networks for trigger word identification and event type classification. Existing methods often focus on feature extraction and improvements to learning algorithms but tend to overlook the modeling of relationships between samples. Furthermore, current few-shot event detection methods still rely on trigger word extraction, which may lead to performance degradation when dealing with complex events.

In response to these issues, this paper proposes a new event detection framework, combining question-answering-based information extraction methods with semantic similarity calculation techniques from the field of information retrieval. Our method optimizes the semantic representation of event labels by acquiring semantic information of event labels through natural language definitions and further refining these semantic representations using contrastive learning.

3 Methods

3.1 Overall algorithm design

In this study, we propose an event detection algorithm based on label semantic embedding, aimed at addressing the limitations of traditional event detection methods in few-shot learning and complex semantic understanding. The core idea of this algorithm is rooted in three key theoretical foundations: deep semantic understanding, contrastive learning, and attention mechanisms. Firstly, we recognize that in natural language processing, the precise identification and classification of events depend not only on superficial lexical matching but also require a deep understanding of the underlying semantic structures of language. Therefore, our algorithm employs a BERT encoder to extract deep semantic information from text, enabling the model to capture the subtle nuances and rich contextual information of language more accurately. Secondly, we observe that by bringing semantically similar samples closer in the vector space while pushing dissimilar samples apart, we can significantly enhance the model’s ability to capture semantic information in a limited number of samples. Based on this observation, we incorporate contrastive learning to optimize the semantic representation of event labels, thus improving the accuracy and robustness of event type determination. Finally, we integrate multi-level attention mechanisms to more finely process the complex relationships among events, arguments, and sentences. The application of attention mechanisms in our algorithm not only increases its flexibility but also allows the model to dynamically allocate weights across different semantic levels, thereby more effectively focusing on the key information in event detection.

Assuming the input sentence is \(S = [w_{1} ,w_{2} , \ldots ,w_{n} ]\). The main steps of the event detection algorithm based on label semantic embedding are as follows:

  1. 1.

    Firstly, employ a BERT encoder to perform context-aware encoding of the text to be extracted.

  2. 2.

    Send the definitions of named entity labels, event labels, and argument labels to the LSE encoding module to obtain semantic encodings for each label.

  3. 3.

    Feed the text encoding and event label semantic encoding into the contrastive learning fine-tuning module, fine-tuning the SupconLoss to align the text encoding with the event label semantic encoding in the semantic space.

  4. 4.

    Element-wise addition of the text encoding with entity label semantic encoding, and multi-layer attention-weighted fusion with argument label semantic encoding and event label semantic encoding to obtain event label semantic encoding containing contextual semantic information and argument semantic information.

  5. 5.

    Send the event label semantic encoding to multiple independent binary classifiers to obtain event detection results.

The overall architecture of the algorithm is depicted in Fig. 2. The algorithm comprises five main modules: the label semantic encoding module, contrastive learning fine-tuning module, attention module, and multiple binary classification modules.

Fig. 2
figure 2

Event Detection Algorithm Based on Label Semantic Encoding

3.2 Label semantic encoding module

The objective of the label semantic encoding module is to obtain low-dimensional dense vectors that encapsulate semantic information for a given label. These semantic label vectors, which contain semantic information, can be used for semantic similarity matching with other label vectors or sentence vectors, resulting in similarity scores that support downstream calculations of attention scores and classification. The specific workflow of this module is as follows:

Suppose we need to obtain the semantic encoding for label A, which can include event, entity, or argument labels. Initially, we are provided with the natural language definition of label A, which should comprehensively describe the characteristics of that label. For example, for the label "car," a possible definition could be "a machine used for transportation." Let's denote this natural language definition as:

$$Def_{A} = [d_{1} ,d_{2} ,...,d_{m} ]$$
(1)

To obtain the semantic information from a given short natural language text, we employ BERT [

$$BERT_{out} = BERT_{{ENCODER([CLS] + Def_{A} + [SEP])}}$$
(2)
$$BERT_{out} \in R^{{12 * (m + 2) * d_{emb} }}$$
(3)

In the original BERT model, the [CLS] vector is specifically designed for classification tasks. However, recent research has shown that using the [CLS] vector for classification tasks does not yield satisfactory results [22]. Therefore, in this paper, we employ the average pooling representation (RepA) from the last layer of BERT as an approximation of the entire sentence.

$$Rep_{A} = AveragePool(BERT_{out} [ - 1,\therefore ,:])$$
(4)

Hence, for label A, the overall process of obtaining label semantic embeddings, as described in Eqs. (1) to (3), can be summarized as follows:

$$LSE(A) = AveragePool(Bert_{ENCODER} ([CLS] + Def_{A} + [SEP]))$$
(5)

For event labels, named entity labels, and argument labels, this section provides example definitions. These definitions are sourced from the standard definitions in the data annotation manual, as shown in Table 1.

Table 1 Label Definitions Table (All label annotations please see the website of dateset ACE2005: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-events-guidelines-v5.4.3.pdf

3.3 Contrastive learning fine-tuning module

The semantic encoding of a specific event label should encompass the semantic information of that label. This semantic information is not only derived from the natural language definition of the label but should also originate from a large number of similar event sentences in the corpus. Because in the final stage of event classification, the label semantic encoding needs to undergo vector matching with the sentence encoding, if the encodings of similar sentences of the same type can be close in the spherical space, then cosine similarity can yield a better matching effect.

Contrastive learning can bring pairs of similar positive samples closer to each other in vector space while pushing pairs of dissimilar samples farther apart. This property helps the label semantic encoding learn semantics from the corpus. The primary role of the contrastive learning fine-tuning module is to align the label semantic encoding with the encoding of similar event sentences in spherical space using contrastive learning methods. The structure of the contrastive learning fine-tuning module is illustrated in Fig. 3.

Fig. 3
figure 3

Contrastive Learning Fine-Tuning Module

3.4 Label definition encoding

In this section, we opt for the same BERT encoder as used in the LSE module to ensure consistency in the semantic space.

Assuming the input sentence is \(S = [w_{1} ,w_{2} , \ldots ,w_{n} ]\)], we first obtain the encoding from the last layer of the BERT encoder.

$$X = [\overrightarrow {CLS} ,\overrightarrow {{x_{1} }} ,\overrightarrow {{x_{2} }} ,...,\overrightarrow {{x_{n} }} ,\overrightarrow {SEP} ] = Bert_{ENCODER} (S)[ - 1,:,:]$$
(6)

To obtain the global representation of event sentences, it is necessary to perform average pooling on the sequence, consistent with Sect. 3.2. Here, the output of the last layer of the BERT encoder is used, and global average pooling is applied to obtain the representation of the event sentence:

$$X^{avg} = averagePool(X[1; - 1])$$
(7)

After the average pooling, we obtain the global embedding \(x^{avg}\) for the event sentence. Semantic encodings for event labels are obtained using the LSE module.

$$LSE_{{event_{A} }} = {\text{LSE}}(event_{A} )$$
(8)

3.5 Named entity label information

Similar to event labels, semantic encodings for named entity labels are obtained using the label semantic encoding module. In the ACE2005 dataset, named entity annotations have already been provided, where the entity information for a sequence is denoted as EntityType, and ei represents the named entity label for the i-th pronoun. According to the named entity definitions table in Sect. 3.2, we obtain the sequence entity definition Entityseq, where entityi represents the definition of the i-th entity. Entityseq is then sent to the LSE module to obtain the entity encodings for the sequence, denoted as Xentity.

$$Entity_{type} = \left[ {\begin{array}{*{20}c} {e_{cls} ,e_{1} ,e_{2} ,...,e_{n} ,e_{sep} } \\ \end{array} } \right]$$
(9)
$$Entity_{seq} = [None,entity_{1} ,entity_{2} ,...,entity_{n} ]$$
(10)
$$X_{entity} = LSE(Entity_{seq} ),X_{entity} \in R^{{(n + 2) * d_{emb} }}$$
(11)

Since we are using the same text encoder for encoding, it can be assumed that both types of vectors have already been aligned to the same semantic space. Therefore, this paper performs feature fusion by directly adding them together:

$$X = X + X_{entity}$$
(12)

3.6 Dual-layer attention

When sequence representations have already incorporated the semantics of entities, it is necessary to perform information aggregation or filtering on the sequence information. Previous work has often used graph neural networks to aggregate sequence information. For instance, the JMEE model [23] incorporates more syntactic information into trigger word fusion to enhance classification performance. However, this approach requires knowledge of the syntactic dependency relationships in sentences, which necessitates the use of third-party tools to generate syntactic dependency trees. This introduces additional annotations and may result in errors in the output of syntactic dependency tools, leading to error propagation.

Attention mechanisms can be regarded as a form of soft connection in a graph attention network [24], where each word is interconnected with every other word. Therefore, choosing attention mechanisms to replace graph neural networks is a more reasonable choice.

For closed-domain event detection tasks, each event has fixed argument roles, with arguments being subsets of entities. Therefore, it is possible to initially employ semantic embeddings of argument labels to perform attention-weighted operations on sequence representations. This approach allows leveraging the semantics of arguments to match with the entity semantics contained in the sequence and aggregate entity information onto the argument label semantic encoding. Subsequently, the event label semantic encoding performs a second round of attention-weighted operations on the argument label semantic encoding. A schematic representation of the dual-layer attention mechanism is illustrated in Fig. 4.

Fig. 4
figure 4

Schematic Representation of the Dual-Layer Attention Mechanism

First, we utilize the LSE module to obtain semantic encodings for all argument labels, as outlined in the following equation, where q represents the number of argument labels, and argi denotes the definition of the i-th argument label.

$$\arg_{{{\text{seq}}}} = \left[ {\begin{array}{*{20}c} {None,\arg_{1} , \ldots ,\arg_{i} } \\ \end{array} } \right]$$
(13)
$$A{\varvec{rg}} = LSE(\arg_{{{\text{seq}}}} ) = [LSE({\text{None}}),LSE(\arg_{1} ),...,LSE(\arg_{{\text{i}}} )]$$
(14)

Next, we implement the first attention layer by defining parameter matrices \(W_{1}^{Q} \in R^{demb * } d_{hidden}\), \(W_{1}^{K} \in R^{demb * } d_{hidden}\),\(W_{1}^{V} \in R^{{d_{emb} * d_{value} }}\), \(W_{1}^{O} \in R^{{d_{value} * d_{emb} }}\), and \(Arg \in R^{{(q + 1) * d_{emb} }}\). The semantic encoding of argument labels serves as the query vector, while the sequence encoding serves as the key and value vectors for attention computation. The formula for calculating attention in a single attention head is as follows. After the weighted summation through attention, we obtain the argument label semantic encoding \(Ar\hat{g}\), which combines information from the sequence and entities:

$$Ar\hat{\user2{g}} = \left( {softmax\left( {\frac{{{\varvec{Arg}}W_{1}^{Q} \cdot ({\varvec{X}}W_{1}^{K} )^{T} }}{{\sqrt {d_{hidden} } }}} \right) \cdot {\varvec{X}}W_{1}^{V} } \right) \cdot {\varvec{W}}_{1}^{O}$$
(15)

The implementation of the second attention layer is essentially similar to that of the first attention layer. Firstly, parameter matrices \(W_{2}^{Q} \in R^{{d_{emb} * d_{hidden} }}\), \(W_{2}^{K} \in R^{{d_{emb} * d_{hidden} }}\), \(W_{2}^{V} \in R^{{d_{emb} * d_{value} }}\), and \(W_{2}^{0} \in R^{{d_{value} * d_{emb} }}\) are defined. Then, semantic encodings for all events' labels, denoted as Event, are obtained through the LSE module.

$${\text{event}}_{{{\text{seq}}}} = \left[ {\begin{array}{*{20}c} {None,{\text{event}}_{1} ,...,{\text{event}}_{{\text{p}}} } \\ \end{array} } \right]$$
(16)
$${\varvec{Event}} = LSE({\text{event}}_{{{\text{seq}}}} ) = [LSE({\text{None}}),LSE({\text{event}}_{1} ),...,LSE({\text{event}}_{{\text{p}}} )]$$
(17)

where, p represents the number of event labels. Event is used as the query vector, while \(Ar\hat{g}\) serves as the key and value vectors for attention computation. The attention calculation formula is provided below. Ultimately, we obtain the event label semantic vector \(schema \in R^{{(p + 1) * h_{emb} }}\), which integrates information from the sequence, entity labels, and argument labels. Its shape is identical to that of Event.

$$schema = (softmax\left( {\frac{{EventW_{2}^{Q} \cdot (Ar\hat{g}W_{2}^{K} )^{T} }}{{\sqrt {d_{hidden} } }}} \right) \cdot ArgW_{2}^{V} ) \cdot W_{2}^{O}$$
(18)

The aforementioned attention mechanism is an implementation for a single attention head and can be extended to multi-head attention. Following the approach outlined in “Transformers” [25] for multi-head attention, we incorporate X[0] and Event into schema using residual connections and pass them to the multi-binary classification module together.

$$schema^{\prime} = \left[ {\begin{array}{*{20}c} {X[0],X[0],...,X[0]} \\ \end{array} } \right] + schema + Event$$
(19)

3.7 Multi-binary classification module

An event sentence can potentially contain multiple events, making the traditional multi-class cross-entropy loss function unsuitable for event detection scenarios. Using a Sigmoid activation function as the final layer of the model with a threshold for classification poses challenges, as determining the threshold requires extensive experimentation. Furthermore, this method becomes less adaptable when the training data changes, making it less suitable for the rapidly evolving field of NLP. Additionally, due to the extreme class imbalance in event labels, a single classification network may not effectively train minority classes. In addressing multi-class multi-label problems, a common approach is to use an independent binary classification network for each label [26].

Therefore, this model’s solution to the problem is to employ a binary classification neural network for each event label. This network determines whether the current event sentence contains a specific event label, outputting 1 if the label is present and 0 if it is not. Each binary classification network is optimized using binary cross-entropy loss, ensuring that each event label is decoded independently without interfering with others.

Different weights can be assigned to each loss function based on the frequency of event occurrences. The computation of the loss function is as follows.

$$L = \left[ {l_{0} ,l_{1} ,...l_{p} } \right],\;where\,l_{i} = \left\{ \begin{gathered} 1\quad if\,{\text{Event sentence contains event i}}{.} \hfill \\ {0}\quad {\text{else}} \hfill \\ \end{gathered} \right.$$
(20)
$$Weight = \left[ {\begin{array}{*{20}c} {{\mathbf{w}}_{0} ,{\mathbf{w}}_{1} , \ldots ,{\mathbf{w}}_{p} } \\ \end{array} } \right]$$
(21)
$${\varvec{logits}}_{i} = {\text{LeakyReLU}}({\varvec{w}}_{i} \cdot {\user2{schema}}^{\prime}[i,:,:] + {\varvec{b}}_{i} )$$
(22)

The true label for each type of event contained in the event sentence is denoted as L, and the binary classification network parameters for the i-th label are represented as \(w_{i} \in R^{{d_{emb} * (p + 1)}}\) and \(b_{i} \in R^{p + 1}\).

3.8 Loss function

Since the existing ACE2005 dataset is labeled, to better utilize the characteristics of labeled data, this section chooses the SupConLoss [27] as the contrastive learning loss function. SupConLoss is a supervised contrastive learning loss function, which maximizes the logarithm probability of similarity between positive pairs. The similarity between vectors is calculated using cosine similarity. The original SupConLoss loss function formula is as follows:

$${\mathcal{L}}_{{out}}^{{sup}} = \sum\limits_{{{\text{i}} \in {\text{I}}}} {{\mathcal{L}}_{{out,i}}^{{sup}} } = \mathop \sum \limits_{{{\text{i}} \in {\text{I}}}} - \frac{1}{{|P(i)|}}\mathop \sum \limits_{{p \in P(i)}} \log \left( {\frac{{\exp (z_{i} \cdot z_{p} /\tau )}}{{\mathop \sum \limits_{{a \in A(i)}} \exp (z_{i} \cdot z_{a} /\tau )}}} \right)$$
(23)

where, i represents any sample, P(i) is the set of all positive samples in the same batch that belong to the same class as i, zi is the representation of the current sample, zp is the representation of a positive sample in the same class as the current sample, and za is the representation of any sample within the batch. τ is the temperature coefficient, and a smaller value of τ makes it easier to distinguish challenging samples.

As shown in Fig. 5, the selection of positive and negative samples follows an intra-batch positive and negative sampling strategy. Data within the same batch are combined to form positive samples if they share the same label, while other sample pairs in the batch are treated as negative samples for loss calculation. At the beginning of each batch, the semantic encoding of the corresponding event label is inserted. Each label semantic encoding can be considered as an instance of that label, contributing to the construction of positive and negative samples alongside other instances within the same batch.

Fig. 5
figure 5

Positive and Negative Sample Construction Method (Green Represents Positive Sample Pairs)

Applying the original SupConLoss loss function to the event extraction task yields the specific loss function calculation method as follows:

$${\mathcal{L}}_{{{\text{con}}}} = \sum\limits_{{{\text{i}} \in {\text{I}}}} - \frac{1}{|P(i)|}\sum\limits_{p \in P(i)} {\log } (\frac{{\exp (x_{i}^{avg} \cdot z_{ + } /\tau )}}{{\sum\limits_{a \in A(i)} {\exp } (x_{i}^{avg} \cdot z_{a} /\tau )}})$$
(24)
$${\mathbf{z}}_{a} = \left\{ {\begin{array}{*{20}c} {\quad x_{a}^{avg} \qquad if{\mkern 1mu} index{\mkern 1mu} of{\mkern 1mu} a \in [1,\frac{|P(i)|}{2}]} \\ {STE_{a} \qquad if{\mkern 1mu} index{\mkern 1mu} of{\mkern 1mu} a \in [\frac{|P(i)|}{2},|P(i)|]} \\ \end{array} } \right.$$
(25)
$${\mathbf{z}}_{ + } = \left\{ {\begin{array}{*{20}c} {\quad x_{ + }^{avg} \qquad if{\mkern 1mu} index{\mkern 1mu} of{\mkern 1mu} + \in [1,\frac{|P(i)|}{2}]} \\ {STE_{ + } \qquad if{\mkern 1mu} index{\mkern 1mu} of{\mkern 1mu} + \in [\frac{|P(i)|}{2},|P(i)|]} \\ \end{array} } \right.$$
(26)

In addition, in deep learning, there is a significant issue of sample imbalance. For datasets with biased distributions, models may suffer from overfitting or underfitting problems. One common approach is to balance sample labels through resampling and down sampling to ensure a roughly equal number of samples for each label. However, the ACE2005 event extraction dataset exhibits a severe long-tail distribution. For some labels, there are too few training samples, and resampling would only result in many duplicate sentences, which is not meaningful. Therefore, we need to impose constraints on the loss functions for each label at the loss function level.

In this paper, we set a hyperparameter alpha (alpha < 1) for certain labels with significantly larger data volumes, such as ATTACK, TRANSPORT, DIE, MEET labels, denoted as set A, to limit the impact of excessive data. Simultaneously, we set another hyperparameter beta (beta > 1) to address the underfitting issue caused by too few data for labels like PARDON, EXTRADITE, ACQUIT, denoted as set B.

The loss function uses binary cross-entropy:

$${\mathcal{L}}_{event} = \sum\limits_{i = 0}^{A} {{\text{alpha}}} *CELoss(logits_{i} ,l_{i} ) + \sum\limits_{j = 0}^{B} {{\text{beta}}} *CELoss(logits_{j} ,l_{j} )$$
(27)
$${\mathcal{L}} = {\mathcal{L}}_{con} + {\mathcal{L}}_{event}$$
(28)

The overall loss function for the event detection algorithm is defined as: