Introduction

The increasing availability of clinical care data, affordable computing power, and suitable legislation provide the opportunity for (semi-)automated decision support systems in clinical practice. An important step in the development of such a decision support system is the accurate extraction of relevant labels to train the underlying models. These labels are rarely directly available as structured data in electronic health records (EHRs)—and even if they are, they often lack the precision and reliability [41] for a clinical decision support system. Therefore, extraction of labels from free text in the EHR—which contains the richest information and the appropriate amount of nuance—is needed.

To this end, we need to consider the context in which medical terms are mentioned. One of the most important contextual properties in clinical text is negation, which indicates the absence of findings such as pathologies, diagnoses, and symptoms. As they make up an important part of medical reasoning, negations occur frequently: one study estimated that more than half of all medical terms in certain clinical text are negated [6]. Accurate negation detection is critical when labels and features from free text in the EHR are extracted for use in clinical prediction models. But improving information retrieval through negation detection has many other use cases in healthcare, including administrative coding of diagnoses and procedures, characterizing medication-related adverse effects, and selection of patients for inclusion in research cohorts.

Negation detection is not a trivial task, due to the large variety of ways negations are expressed in natural languageFootnote 1. It can either be performed with a rule-based approach or through machine learning. In this paper, we evaluate the performance of one rule-based method (based on ContextD [1]) and two machine learning methods (a bidirectional long-short term memory model implemented in MedCat [25], and a RoBERTa-based [28] Dutch language model) for detection of negations in Dutch clinical text.

In their simplest form, traditional rule-based methods consist of a list of regular expressions of negation triggers (e.g. “no evidence for”, “was ruled out”). When a negation trigger occurs just before or after a medical term in a sentence, the medical term is considered negated. Examples include NegEx [5], NegFinder [33], NegMiner [13] and ConText [19, 38]. Some approaches also incorporate the grammatical relationships between the negation and medical terms. For example, incorporating part-of-speech tagging to determine the noun phrases that a negation term could apply to (see e.g. NegExpander [3]) , or using dependency parsing to uncover relations between words (see e.g. NegBio [35], negation-detection [16], DepNeg [40] and DEEPEN [31]). Moreover, distinguishing between the different types of negations (syntactic, morphological, sentential, double negation) as well as adding word distance has been proven helpful (see e.g. NegAIT [32] and Slater et al. [39]). While usually tailored for English, some of these methods have been adapted for use in other languages, including French [11], German [8] and Spanish [7, 9], as well as Dutch [1].

The main advantages of rule-based negation detection methods are that they are transparant and easily adaptable and do not require any scarce labeled medical training data. Rule-based methods can be surprisingly effective. Goryachev et al. [17] demonstrate that the relatively simple NegEx can be more accurate than machine learning-based methods, such as a Support Vector Machine (SVM) trained on the part-of-speech-tags surrounding the term of interest.

The main disadvantage of rule-based methods is that they are by definition unable to detect negations that are not explicitly captured in a rule. Depending on the use case, this can severely hamper their performance. This is where machine learning methods come into play, as they may outperform rule-based methods by picking up rules implicitly from annotated data.

One such machine learning method is the bidirectional long-short term memory model (biLSTM), a neural network architecture that is particularly suited for classification of sequences such as natural language sentences. This model processes all words in a sentence sequentially, but in contrast to traditional neural network methods, a biLSTM takes the output of previous words into account to model relations between words in the sentence. For a biLSTM model the processing is bidirectional, meaning that sentences are processed in the (natural) forward direction, as well as the reverse direction.

Based on a conventional biLSTM (see e.g. Graves and Schmidhuber [18]), Sun et al. [42] developed a hybrid biLSTM Graph Convolutional Network (biLSTM-GCN) method. The biLSTM can also be combined with a conditional random field [20]. Other machine learning approaches include an SVM that has access to positional information of the tokens in each sentence Cruz Díaz et al. [10], and the use of conditional random fields in the NegScope system by Agarwal and Yu [2].

A more recent machine learning model is RoBERTa [28], a bidirectional neural network architecture that is pre-trained on extremely large corpora using a self-supervised learning task, specifically to fill in masked tokens. This masking does not require external knowledge as the selection of tokens to be masked can be performed automatically. RoBERTa is part of a family of models, which primarily vary in learning task, that are based on the transformers architecture [43]. Once pre-trained, a transformer model can be finetuned with supervised learning tasks for e.g. negation detection or named entity recognition. Lin et al. [27] show that a zero-shot language model such as BERT performs well on the negation task and does not require domain adaptation methods. Khandelwal and Sawant [23] developed NegBERT, a BERT model finetuned on open negation corpora such as BioScope [46].

The goal of the current paper is to compare the performance of rule-based and machine learning methods for Dutch clinical data. We conduct an error analysis of the types of errors the individual models make, and also explore whether combining the methods through ensembling offers additional benefits. Python implementations of all the evaluated models are available on GitHub.Footnote 2

Data

We used the Erasmus Medical Center Dutch clinical corpus (DCC) collected by Afzal et al. [1] (published together with ContextD) that contains 7490 Dutch anonymized medical records annotated with medical terms and contextual properties. All text strings that exactly matched an entry in a subset of the Dutch Unified Medical Language System (UMLS, [4]) were considered a medical term. These medical terms were subsequently annotated for three contextual properties: temporality, experiencer, and negation. In this paper we focus on the binary context-property negation. The label negated was given when evidence in the text was found that indicated that a specific event or condition did not take place or exist, otherwise the label not negated was assigned.

As illustrated in Fig. 1, we excluded 2125 records in total from further analysis, primarily because no annotation was present (this was the case for 2078 records), otherwise the file containing the source text or its annotation was corrupted (37 records), or the annotation did not correspond to a single medical term (10 records, e.g. only a single letter was annotated, or a whole span of text containing multiple medical terms). This left 5365 usable records for analysis, containing a total of 12551 annotated medical terms. A small number of medical terms were not processed by the RoBERTa-based models, because of the imposed maximum record length of 512 tokens in our implementation. We excluded these medical terms from analysis with other methods as well, resulting in a final set of 12419 annotated medical terms.

The corpus consists of four types of clinical records, which differ in structure and intent (for details, see Afzal et al. [1]). Basic statistics are presented in Table 1 and a representative example of each record type, including various forms of negation, is provided in Supplementary Material A.

Table 1 Basic textual statistics of the selected DCC records, showing the mean value per record and the boundaries of the second and third quartile (top) and the total count in the dataset (bottom)
Fig. 1
figure 1

Data flow diagram

Methodology

We employed three distinct methods to identify negations: a rule-based approach called ContextD, a biLSTM from MedCAT and a fine-tuned Dutch RoBERTa model. These methods are cross-validated using the same ten folds.

Rule-based approach

The rule-based ConTextD algorithm [1] is a Dutch adaptation of the original ConText algorithm [19].

The backbone of the ConText algorithm is a list of (regular expressions of) negation terms (“negation triggers”). A given medical term is considered to be negated when it falls within the scope of a negation trigger. The default scope in ConText is the remainder of the sentence after/before the trigger. Each negation trigger has either a forward (e.g., “no evidence of ...”) or backward scope (e.g., “... was ruled out”).

ConText has two more types of triggers aside from negation triggers: pseudo-triggers and termination triggers. Pseudo-triggers are phrases that contain a negation trigger but should not be interpreted as such (e.g., “not only” is a pseudo-trigger for the negation trigger “not”). Pseudo-triggers take precedence over negation triggers: when a pseudo-trigger occurs in a sentence, the negation trigger it encompasses is not acted upon. Termination triggers serve to restrict the scope of a negation trigger. For example, words like “but” in the sentence “No signs of infection, but pneumonia persists”, signal that the negation does not apply to the entire sentence. Using “but” as termination trigger prevents the algorithm to consider “pneumonia” to be negated while “infection” is still considered negated.

We used the Dutch translation of the original triggers from ConText, as produced by the ContextD authors.Footnote 3 These triggers were used in conjunction with MedSpaCyFootnote 4 [14], a Python implementation of the ConText algorithm. Because ConText defines the scope of a negation trigger in number of words or the boundary of the sentence (the default), raw text also needs to be tokenized and split into separate sentences. We used the default tokenizer and the dependency-parser-based sentence splitter of the nl_core_news_sm-2.3.0 model in spaCyFootnote 5, a generic Python library for NLP.

MedCAT’s biLSTM

The open-source multi-domain clinical natural language processing tool Medical Concept Annotation Toolkit (MedCAT) incorporates Named Entity Recognition (NER) and Entity Linking (EL), to extract information from clinical data [2). To assess the degree of consistency in categorization of the three reviewers, we computed Cohen’s kappa coefficient on the subset of errors that were shared by multiple models.

Table 2 Definitions and examples of the error categories used for error analysis. Negation terms are underlined; the annotated medical terms are in brackets. Examples translated from the Dutch source text

Results

We consider the model performance quantitatively by looking at overall performance metrics, and more qualitatively by analyzing and categorizing the errors that each model made.

Overall performance

From the 12419 medical terms in 5365 medical records, 1748 medical terms were marked as negated by the annotators. Of these, 1687 concepts were identified by at least one of the negation detection models.

The precision, recall and F1 score for each negation detection method are reported in Table 3. RobBERT achieved the highest scores overall, followed by the ensemble method for a few metrics and record types.

Table 3 Classification results across methods and data sources

Rule-based

While the performance of the rule-based method here was indeed comparable to the original (see the results in Table 5 of Afzal et al. [1]), some differences can be identified, that probably arise because we do not use exactly the same rules nor exactly the same dataset. First, there are two different variations of ContextD: the “baseline” rules, which were simply a translation of the original Context method [19], and the “final” rules, which were iteratively adapted using half of the dataset. Our set of rules is most similar to the “baseline” method, as we chose not to implement any of the modifications in the “final method” described in the paper. However, our set of rules likely still contains some elements of the “final method”. Second, while the original ContextD method was evaluated on only half the dataset (as the other half was used for finetuning the rules), we use the full dataset. In our evaluation, performance varied quite strongly over the folds, which indicates that the exact evaluation set that was used influenced the obtained results. Performance varied particularly strongly for the GP entries, which would also explain why the difference in performance with ContextD is largest for this category.

Machine learning (biLSTM, RobBERT)

The rule-based method is outperformed by both machine learning methods in almost all cases. Its performance varies strongly over the different record types: it performs worst for the least structured records, particularly the GP entries, and best for the most structured records, particularly the radiology reports. The performance gap between the rule-based and machine learning methods shows the same pattern: the less structured the record, the larger the gap.

The biLSTM model consistently outperformed the rule-based approach, except for the radiology reports category, where performance was approximately equal. In turn, RobBERT outperformed the biLSTM model with a difference of 0.02–0.05 in F1 score across record categories.

Additionally, we saw no consistent differences between different RobBERT implementations. The smallest 32-token window resulted in a slightly reduced accuracy, but drastically reduced computational resources (see Additional file 1 Tables S1 and S2).

Note that for the RobBERT and the biLSTM models we did not apply threshold tuning to optimize for precision or recall—we simply took the default threshold of 0.5 (also note that calibration is required if such a probabilistic measure is applied in clinical practice).

Model ensemble

The RobBERT method outperforms the other models as well as the ensemble method when scored on the complete dataset. On the individual categories the ensemble method performs worse or similar to the RobBERT method.

Figure 3 shows that the RobBERT method (all bars with “RobBERT”) makes fewer errors than the voting ensemble (all bars with more than two methods). In particular, the number of errors committed by RobBERT alone is smaller than the number of errors that are introduced when adding the BiLSTM and the rule-based method to the ensemble.

Fig. 3
figure 3

Number of misclassifications for different model combinations. The y-axis shows the number of errors in all possible intersections of the error sets made by the different models. That is, “All” is the number of entities that are misclassified by all three models; “Rule-based” is the number of entities that are misclassified by only the rule-based model; “RobBERT & BiLSTM” is the number of entities misclassified by both the RobBERT and biLSTM models, but not the rule-based model; etc

Error analysis

We obtained a Cohen’s Kappa score of 0.48 when annotating agreement on the error category. This is considered as moderate agreement [30]. The categories involved in disagreement are shown in Fig. 4. The uncommon negation meta-category is a source of disagreement, showing that some annotators consider a negation uncommon while others choose a semantic error category. Also other as a (semantic) catch-all category is responsible for disagreement. From the specific categories the speculation label is most often subject of disagreement.

The moderate agreement is not surprising given that the error cases represent challenging annotations. It is important to note that it is possible that multiple categories apply for the same error due to model-specific interpretations, which would have a negative effect on our perceived inter-annotator agreement.

Fig. 4
figure 4

Confusion matrix of inter-annotator disagreement

Fig. 5
figure 5

Error category frequencies

Table 4 Overview of error categories per model

We should note that the speculation category is somewhat domain specific: a clinician might have observed that a particular diagnostic test yielded no indications for a particular diagnosis; arguably this is not a negation of the diagnosis but merely the explicit absence of a confirmation. These signals may or may not be considered as negations, depending on, for example, whether a test is seen as conclusive or if instead additional testing is required for establishing a diagnosis. This reasoning would require the integration of external knowledge, for example through UMLS. More broadly, the dichotomisation of negation/not-negated is perhaps too coarse given the high prevalence of explicitly speculative qualifications in electronic health records. One clear issue with the dichotomy not-negated/negated is that it biases the annotations and models towards the non-negated class, because the negated label requires explicit negations whereas non-negated is everything else, i.e., negations are more strictly constrained. In some cases it is beneficial if the bias is reversed, for instance to obtain affirmations with low false positivity. A mitigation of the non-negation bias is to introduce a model specifically for affirmations/non-affirmations, or indeed a separate label for speculation (see, e.g. Vincze [45]).

Model-independent issues

The distribution of error categories over methods is shown in Fig. 5 and Table 4. The categories annotation error, speculation, ambiguous (together around 20–25% across models) may benefit from more specific annotation guidelines. The other errors can be classified as actual mistakes by the model. From these, preprocessing can potentially improve scope and punctuation-related errors. Modality can be improved using special-purpose classifiers similar to the current negation detection classifier. The remaining problematic errors are negation of a different term, uncommon negation, minus, and other errors (around 50% of all errors across models).

Annotation- and uncertainty-related errors

A significant amount of annotation errors was found, both for false positives and false negatives. This is consistent with Afzal et al. [1], who report that about \(8\%\) of the false positive negations were due to erroneous annotations (no percentage was reported for false negatives). These errors can be solved by improving the annotation, either through better guidelines or by more strict application and post-hoc checking of these guidelines.

Note that annotation errors will also be present in true negatives/positives, which will remain undetected in case the models make the same mistake as the annotators. If these annotation errors are randomly distributed this is a form of label smoothing, and as such the errors could be useful to reduce overfitting. However, it is likely that these annotation errors are not random and are indicative of inherent ambiguity.

The speculation and ambiguous categories (together around 20–25% across models) both stem from uncertainty-related issues, either expressed by the clinician (speculation) or in the interpretation of the text (ambiguous). These may present problems both during annotation and during model training, as the examples do not fully specify a negation, yet the intended meaning can often be inferred. More specific annotation guidelines could reduce these issues to some extent, by ensuring that the examples of a certain category are consistently annotated. However, the models may still be unable to capture such inferences, even if the examples are consistently annotated.

Typical examples for speculation are: There is no clear [symptom] or The patient is dubious for [symptom]. The English corpora BioScope [46], GENIA [24] and BioInfer [36] include uncertainty as a separate label, which may be beneficial for the current dataset as well.

Remaining errors

The other categories were more syntactic in nature. The word-level syntactic errors scope, negation of different term, and uncommon negation occur across methods. Other errors are due to the use of a minus to indicate negation or usage of colon and semicolon symbols (punctuation). The minus sign is however also used as a hyphen (to connect two words), which complicates handling of this symbol both in preprocessing and during model training.

As an example of possible mitigation measures, the following sentence produced a false negative (i.e., a negated term classified as non-negated) for the target term redness:

Previously an antibiotics treatment was administered(no redness).

 

In this example the negation word no is concatenated with an opening parenthesis and the previous word, which poses problems for tokenization. Such errors might be avoided by inserting whitespace during preprocessing.

The following sentence shows a false positive from the biLSTM classifier for the term earache:

since 1 day pain in the right lower lobe andcoughing, mucus, temp to38.2, pulmones no abnormality earache and deaf, oam right?

 

This is a sco** error, where the negation on the term “abnormality” is incorrectly extended to the target term “earache”. This issue would be difficult to correct in preprocessing, as it necessitates some syntactic and semantic analysis to normalize the sentence. This would closely resemble the processing by the rule-based system, therefore this is an example where model ensembling could be beneficial. However, given the multiple other textual issues in this sentence (such as missing whitespace, punctuation, and capitalization) a more robust alternative solution might be to maintain a certain standard of well-formedness in reporting, either through automatic suggestions, reporting guidelines, or both.

Rule-based

Almost half of the false positives for the rule-based method fell under the “scope” category (41%, see Table 4). The default scope for a negation trigger extends all the way to the start (backward direction) or end (forward direction) of a sentence. For long sentences, or when sentences are not correctly segmented, the negation trigger may then falsely modify many medical terms. This occurred particularly often for short and unspecific negation triggers, such as no (Dutch: niet, geen). Potential solutions include improving sentence segmentation, restricting the number of medical terms that a single negation trigger can modify, restricting the scope to a fixed number of tokens (the solution used by the ContextD “final” algorithm for certain record types), or restricting the scope by adding termination triggers (the ContextD “final” algorithm added punctuation such as colons and semicolons as termination triggers for some record types). We determined that 32 out the 140 false positives were caused by a missing termination trigger; adding just the single trigger wel (roughly meaning but) would have prevented 18 errors (5%).

Most other false positives were due to “negation of a different term”. These are perhaps more difficult to fix, but in some cases these could be prevented by adding pseudo-triggers that for instance prevent the trigger “no” from modifying a medical concept when followed by another term (e.g., “no relationship with”).

For the false negatives, the majority are caused by “uncommon negations”, i.e. negation triggers that were missing from the list of rules. A special case that caused a lot of errors was the minus (hyphen) symbol, which in clinical shorthand is often appended to a term to indicate negation. Other missing negation triggers that occurred relatively often were variations on negative (n=18, such as neg), not preceded by (n=8, e.g., niet voorafgegaan door), and argues against (n=5, e.g., pleit tegen). The obvious way to remedy these error categories is to simply add these negation triggers to the rule list, but this may introduce new problems. For instance, adding “-” as a negation trigger (as was done in the ContextD “final” algorithm) would negate any word that occurs before a hyphen (e.g., “infection-induced disease”). More generally, any change aimed at reducing the amount of false negatives (such as adding negation triggers) or false positives (such as restricting scope) is likely to induce a commensurate increase in false positives or negatives, respectively. The list of rules—and thereby the trade-off between recall and precision—will have to be adapted and optimized for each individual corpus and application.

BiLSTM

As shown in Table 4, for the biLSTM classifier around 5% of the errors are annotation errors, where the model actually predicted the correct label. For false negatives one of the largest categories is scope errors, which includes examples where a list of entities is negated using a single negation term, as well as examples where many tokens are present between the negation term and the medical term. For false positives, negation of a different term is a common error, which is problematic for all three methods. However, compared to the other methods the biLSTM has a more even distribution over error categories. The overall performance and the distribution over categories shows that the biLSTM is more robust against syntactic variation than the rule-based model, but not as generalized as RobBERT.

RobBERT

Negation of different terms, speculation, uncommon negation and the use of a hyphen (minus) to indicate negations are the largest potentially resolvable contributors to the RobBERT error categories, totalling to about \(50\%\) of the errors (Table 4).

RobBERT-base fills in missing interpunction, i.e. it expects interpunction based on the corpus it was trained on and in our clinical case we often find that interpunction is missing. A degenerative example is “The patient is suffering from palpitations, shortness of breath and udema, this can be an indication of <mask>”, RobBERT filled in the mask as : (a semicolon).

A large percentage of the false negatives were due to RobBERT mishandling hyphens. We also observed varying model output based on negation triggers being mixed lower/uppercase. Mishandling the hyphens can potentially be resolved by adapting the tokenizer to include the hyphen as a separate token or by adding white space.

We observed that words could have a varying negation estimate over the different tokens that make up the words, illustrated in Fig. 6. This variance is an artefact of using a sub-word tokenizer. This is potentially problematic for words consisting of many tokens, but it also allows for more flexibility because we can decide to (for example) take the maximum probability over the tokens per word. It can also occur that token-specific negations are required, for instance when the negation and the term are concatenated, as in “De patient is tumorvrij” (“The patient is tumorfree”). The possibility of concatenation is language dependent.

Fig. 6
figure 6

Intra-word negation variance. The token delimiter character Ġ is a result of the tokenization

The categories uncommon negation and negation of a different term can be reduced by expanding the training set with the appropriate samples.

Discussion

We compared a rule-based method based on ContextD, a biLSTM model using MedCAT and (finetuned) Dutch RoBERTa-based models on Dutch clinical text and found that both machine learning models consistently outperform the rule-based model in terms of the F1, precision and recall. Combining the three models was not beneficial in terms of performance. The best performing models achieve an F1-score of 0.95. This is a relatively high score for a cross-validated machine learning approach, and is likely near the upper bound of what is achievable for this dataset, considering the noise in labeled data (0.90-0.94 inter-annotator agreement).

Applicability

The performance of the assessed methods is well within the acceptable range for use in many information retrieval and data science use-cases in the healthcare domain. Application of these methods can be especially useful for automated tasks where a small number of errors is permitted, such as reducing the number of false positives during cohort selection for clinical trial recruitment. In this task, erroneously excluding a patient is less problematic, and the included patients can be checked manually for eligibility. Other data applications can benefit from this as well, such as text mining for identification of adverse drug reactions, feature extraction for predictive analytics or evaluation of hospital procedures.

However, the model still makes classification errors, which means it is not suitable as a stand-alone method to retrieve automatic annotations for medical decision support systems but can be used directly to improve existing label extraction processes. Application in a decision support system would require some sort of manual interaction with a specialist.

Additional aspects for model comparison

The model comparison (based on precision, recall and F1 score) shows that the RobBERT-based models result in the highest performance. However, additional considerations can play a role in selecting a model, for example computational and human resources. Fine-tuning and subsequently applying a BERT-based model requires significant hardware and domain expertise, which may not be available in clinical practice, or only available outside the medical institution’s domain infrastructure, which introduces security and privacy concerns.

In contrast, the biLSTM and rule-based models can be used on a personal computer, with only a limited performance decrease (\(\sim 0.03\) and \(\sim 0.08\) respectively) on each evaluation metric. The rule-based method has the advantage that model decisions are inherently explainable, by showing the applied rules to the end user. This may lead to faster adoption of such a system compared to black box neural models.

The used biLSTM method is part of MedCAT, which also incorporates named entity recognition and linking methods. Compared to the other assessed approaches, this is a more complete end-to-end solution for medical NLP, and is relatively easy to deploy and use, especially in combination with the information retrieval and data processing functionalities of its parent project CogStack [21]. Recently, MedCAT added support for BERT-based models for identification of contextual properties.

Limitations and future work

The study described in this paper has various limitations for which potential improvements can be identified. Regarding language models, the biLSTM network is trained on a relatively small set of word embeddings obtained from Dutch medical Wikipedia articles. This could be complemented or replaced with a larger and more representative dataset, to be more in line with the language models used in the RobBERT experiments. Alternatively, a corpus of actual electronic health records can be used for the best representative dataset, yet this leads to privacy concerns given the high concentration of identifiable protected health information in natural language, even after state of the art pseudonimization.

In the current approach the candidate terms for negation detection in the DCC are generated by performing medical named entity recognition, where each recognized entity is presented to the negation detection models. The context provided to each model, which is assumed be sufficient for determining the presence of negation, is defined to be the sentence around a term as determined by a sentence splitting algorithm. This results in scope-related errors such as incorrect delimitation of the medical term, incorrect sentence splitting, or the negation trigger being in a grammatically different part of the sentence. These issues can be reduced at various stages in the pipeline, either by improving the involved components, or by performing a sanity check on the generated example using part-of-speech-tagging or (dependency) parsing. Another approach to reduce the number of problematic candidates is to train additional classifiers on meta-properties like temporality (patient doesn’t remember previous occurrences of X) or experiencer (X is not common in family of patient). Furthermore, the domain-specific structure of the EHR records in the various categories could be leveraged, e.g., to discard non-relevant sections of the health record during processing for specific use cases.

Considering that several error categories are related to the availability of training data, we can to some extent improve the models using synthetic data or a larger set of manually annotated real data. This can alleviate the lack of balance between the negation and non-negation classes in the Dutch Clinical Corpus (currently 14% negations), which are problematic for both the biLSTM and the RobBERT models. Furthermore, we observe a significant amount of errors due to, or related to, ambiguity. Such errors are expected, not having errors related to ambiguity could indicate an overfitted model. This idea of error categorisation can also be extended to create a model for estimating the dominant error types in unseen data, i.e. to facilitate model selection and problem-specific model improvements.

In future work it is of interest to train the methods on a broader set of health record corpora, in order to increase the amount of data in general, making it less dependent on DCC specific distributions, and to alleviate the class balance and sparsity issues in particular.

In this work we compared a small number of methods, and this may have led to a conservative estimate on the performance of the resulting ensemble method. For future work it may be interesting to investigate a bespoke ensemble method where rule-based and machine-learning based methods are combined in a complementary fashion. One technique that is particularly interesting is based on prompting, which does not require any finetuning and thus allows pre-trained language models to be leveraged directly.

Unraveling the semantics of clinical language in written electronic health records is a complex task for both algorithms and human annotators, as we experienced during error analysis. However, the three assessed methods show a good performance on predicting negations in the Dutch Clinical Corpus, with the machine learning methods producing the best results. Given the sparse availability of NLP solutions for the Dutch clinical domain, we hope that our findings and provided implementations of the models will facilitate further research and the development of data-driven applications in healthcare.