1 General introduction

For centuries, physicians play an important role in ensuring good health. Indeed, a physician must be well trained and must be able to manage the disease and patient information in order to find the right treatment and make the right decision. In medicine, different types of information are used for treatment and can be found in narrative documents written by humans. For example, to make a medical prescription, the patient record and medication manufacturer are used [91]. Thanks to the development of information technologies and Hospital Information System (HIS), medical information is digitized into electronic records named Electronic Medical Record (EMR) [116] or Electronic Health Record (EHR). Digital records could be stored, managed, transmitted and reproduced efficiently. The widespread adoption of HIS has contributed to billions of records [90], and they are recognized as valuable resources for large-scale analysis.

Similarly, the number of diseases and medications is gradually increasing. In medical prescription, for example, the doctor must know all the indications and contraindications to prescribe a drug. Also, he has to take a lot of time to read a whole unstructured narrative medical leaflet especially for new doctors [25, 91, 92]. Today, medical records are available on the Internet even for any person. Nevertheless, the physician must manage the large amount of information that is written in natural language and must select the important information from the narrative documents. Many medical researchers are overwhelmed in the huge amount of medical data in their studies [158]. Indeed, the main source of errors in medicine is related to drug prescriptions. In 2006, there were 3900 prescription errors in Germany [91]. For 18 years, the Institute of Medicine in the United States reported that there were 237 million medication errors per year in England that were costly for the country [92]. Also, it reported that there were between 1700 and 22303 deaths per year due to adverse drug reactions. In addition, there are more than 234 million cases and 4 million deaths caused by coronavirus disease (COVID-19) [141]. Likewise, the information of this virus’s symptoms need to be analyzed by efficient tools in order to inform risk assessment, prevention and treatment strategy development and outcome estimation.

All these facts make the use of these available electronic documents more than a necessity. Hence, extracting useful information will be greatly helpful. Unfortunately, this process is not trivial mainly due to the huge number of documents, and consequently requires models able to deal with big data. This is hampered by the unstructured (or semi-structured) nature of such documents. In order to make Information Extraction (IE) feasible and efficient, structuring these documents in a more abstract form that is easily readable by machines/algorithms becomes a fundamental step. The main technologies of IE are the recognition of named entities and the extraction of relations between these entities. Entity recognition involves recognizing references to different types of entities such as medical problems, tests, allergies, risks, adverse events, drugs and treatments. In addition, detecting different sections in a document which can improve IE tasks by providing more context. Generally, medical text mining and IE help in medical decision and disease risk prediction.

In fact, several reviews related to IE in medical field have been published. Meystre et al.[98] have focused on research about IE from clinical narrative from 1995 to 2008. However, they have not discussed the research on IE from biomedical literature. Liu et al. [34], RoBERTa [14] extract relations between drug and disease entities such as “cure”, “no cure”, “prevent”, “side effect”, and “other relations”. This method use a combination of features for each sentence by the help of UMLS meta-thesaurus to provide semantic annotations by configuring MetaMap system in order to extract concepts and semantic types. By combination of NLP technique and UMLS knowledge, many features can be extracted such as frequency, lexical, morphological, syntactic and semantic features. Thus, these features are passed to an SVM model to predict the relation associated with the sentence using MEDLINE 2001 [115] as a standard training set. As advantage, this method can extract correct and adequate features while they are relevant to discover interesting relationships between concepts. Furthermore, it outperforms other methods for all types of relations especially in terms of f1-score and for the “cure” relation. Note that this performs better in a multidimensional context and is suitable for semantic relations in natural language texts. However, the lack of training data for “no cure” relation leads to low performance for all comparison methods. In addition, this method requires an annotated training set.

Kim et al. [60] propose a sequence labeling hybrid method to recognize family members and observations entities in EHR text notes, and extract relations between them in addition to living status. A rule based system is used to select family member entities by matching relevant noun terms by the help of PoS. Then, a number of BiLSTM models trained on dependency-based embeddings [62] as static embedding and Embeddings from Language Models (ELMo) [105] as context-dependent embedding. In addition, MetaMap [9] maps semantic types from UMLS and aligns them with entities to choose relevant ones. the family members and observations recognized by these BiLSTM models are ranked and have been voted based on models’ f1-scores. The heuristic rules are used to normalize the family member entities by a simple dictionary-based map**, and determine family side by looking at cue words considering the degree of relatives. Thus, two Online Gradient Descent (OGD) [20] models are trained on lexical features based on the identified entities to determine living status and observations associated with family members. Hence, alive and healthy scores are assigned for living status phrases using cue words. Likewise, negation attribute is assigned to observations using ConText algorithm [22] with customized trigger terms. As advantage, voting ensemble of BiLSTM models contributes in terms of diversity to achieve better performance, and provides efficient and convenient integration of individual LSTM models which are not deterministic. In addition, this method is substantially benefited from a combination of 2 datasets. Integrating heuristics and advanced IE models lead to high level of performance. The performance is improved especially on RE and benefited by the large training set and the pre-trained embeddings. However, choosing the voting ensemble threshold can achieve best performance for one task but not the highest accuracy for other task. Also, some positive relations which rely on 2 entities in different sentences can be missed by using a carriage return character to filter examples.

5.3 Section detection

Generally, section detection can improve IE methods. By definition, a section is a segment of text that groups together consecutive clauses, sentences or phrases. It shares the description of a patient dimension, patient interaction or clinical outcome. In fact, unstructured text has sections defined by the author. There are two types of sections which are explicit sections that have titles, and implicit sections that do not have titles [81]. In addition, the section has a level, where it can be a section or a sub-section. The problem is that the author is free and even the precise defined templates are often ignored, which leads to less uniform titles and finds many sections without titles [134]. In addition, there are different types of documents due to the lack of a standardized typology.

Section detection aims to improve the performance of clinical retrieval tasks that deal with natural language such as entity recognition [76], abbreviation resolution [164], cohort retrieval [37] and temporal RE [66]. Section titles and orders in a prospectus may differ from one source to another. For this, standardizing the sections can help avoid this problem [25]. There are several applications of section detection. Generally, it consists in giving more context to the task by knowing the specific position of a concept. Section detection is used for information retrieval to support cohort selection and identification of patients with risk factors. In addition, some tasks can be improved like co-NER and reference resolution by adding the section as a feature. Another application is that this detection is useful for distinguishing sensitive terms in de-identification. Also, section detection methods can improve tasks that consider the order of events by identifying temporal sections. In addition, this detection can be applied for document quality assessment. Furthermore, these methods can select supporting educational resources by extracting relevant concepts. The main challenges of section detection are the production of a benchmark dataset, the integration of texts from different institutions, the selection of the types of measures to be evaluated, the adaptation of different natural languages and the integration of the most robust methods into high-impact clinical tasks.

According to an analysis of 39 section detection methods [106], there are several methods that adapt pre-built methods for this task. In addition, almost all section detection methods have custom dictionaries created by the authors, and most of them use a controlled terminology named UMLs Meta-thesaurus which is the most common. Also, precision and recall are very high in these methods, but recall is always lower. In addition, almost all methods detect sections that have titles, and half of them can detect sections without titles. Also, more than half of the methods do not detect sub-sections. Concerning the type of narration, most of the methods work on Discharge Summaries in the first class, maybe because of the predefined internal structure or because of the fact that this type of document is frequently transferred. Thus, Radiology Reports are in the second place. There are a few methods that deal with other than the English language. Generally, when the genre of the text, the medical specialty or the institution imposes more restrictions, the methods may have better results.

The methods of section detection can be classified according to the technique used into rule-based methods, machine learning-based methods and hybrid methods.

5.3.1 Rule-based methods

In this approach, there are three types of rules which are exact match, regular expressions and probabilistic rules. This approach requires dictionaries as additional information. The dictionaries can be title flats which are lists containing possible terms used as titles, hierarchical title dictionaries, or synonym and word variant dictionaries containing modifications or abbreviations of standard titles. In addition, the methods in this approach can handle different features such as the most frequent concepts/terms in each section, probabilities assigned to headings or formatting features that represent the shape of the section e.g., number of words, capitalization, a question mark, enumeration pattern, size, etc. But let us keep in mind that feature values may vary depending on the corpus, e.g., the size of the section. Most methods use non-probabilistic rules and require titled sections. With exact matching, recall decreases when the level of standardization of headings is low. Regular expressions can search and match the expected text even with variants. Indeed, the minority of methods is based on probabilistic rules. These rules complement exact matching and regular expressions. They are based on text features to give very high results. These rules are more advanced to detect even unannotated sections. Most methods are based on this approach.

Beel et al. [13] have proved that style information, specifically font size, is very useful for detecting titles in scientific PDF documents in many cases. The authors used a tool to extract formatting style information from a PDF file such as font size and text position. Then, they used a simple heuristic rule to select the three largest font sizes on the first page. Thus, identifying the texts that have these sizes as titles. In fact, this method outperformed an approach based on SVM, which uses only text, in accuracy and even in runtime. Moreover, this technique is independent of the text language because it only considers the font size. However, this method depends on the font size and requires the existence of formatting information.

The approach of Edinger et al. [37] is to identify sections in medical documents and use them in queries for information retrieval. To do this, the authors prepared a list of variations of all section titles for each document type. Variations in terminology, punctuation and spelling were selected to identify the most common section titles using a set of documents for each type. To identify titles in a document, a simple exact search is applied to find them by their variations. Then, the headings are annotated and the document text has been segmented according to these headings. As an advantage, the use of sections in the query instead of searching the whole document increased the accuracy of the search. Thus, this method can avoid retrieval of irrelevant documents. However, it has a smaller recall where the other method can retrieve more relevant documents. In addition, the exact search is accurate enough for section detection.

Lupşe and Stoicu-Tivadar [91] have proposed a method that supports prescribing by extracting and structuring information from medical records. The principle of this approach is to detect sections of the text and unify their titles using regular expressions and a set of section titles. Then, it removes empty words and applies the Stemming algorithm on the sections to root the words without touching medical terms. Thus, this method can suggest drugs that match the patient’s disease, are not contraindicated and do not conflict with other diseases, treatments or allergies of the patient. Indeed, this approach reduces medical errors in drug prescriptions and structures the necessary drug information. However, there are some medical terms that are still modified by Stemming. Also, this method is tested only with the Roman language.

Zhang et al. [158] have tried to effectively use temporal information in the text of electronic medical documents to structure them and help medical researchers to examine clinical knowledge and to facilitate computer-aided analysis. This method is based on rules to perform a few successive steps. These steps consist of correcting pronunciation errors, dividing texts according to grammatical rules, describing medical facts and events and finalizing by processing temporal expressions. However, these texts have little temporal information. In addition, the method gives the same weighting for different speech words.

5.3.2 Machine learning-based methods

This approach uses learning models, where the most popular are CRF, SVM and Viterbi. These methods also require training and testing datasets. These data can be created manually, with rules, with an automated method or with a combination of active learning and remote supervision. The size of the datasets used in these methods analyzed is between 60 and 25842 texts. There are also methods that use the same data for training and evaluation, and can be called cross-validation. This approach also can handle different types of features for this approach such as syntactic features that represent the structure of the sentence, grammatical roles and PoS. Semantic features represent the meaning of words and terms in a sentence. Contextual features are relative or absolute features of a line or section in the text such as the position of the section in the document and layout features. Lexical features are at the word level that can be extracted directly, e.g., the entire word, its prefix, suffix, capitalization, its type (e.g., number or stop word), among others. Indeed, the predictive value of the features had a great positive impact. These methods are not easily adaptable to other contexts. Size variation impacts performance and adaptability to new data. There is a lack of large datasets for learning and evaluation. Also, the construction of training data is time consuming. Non-standard abbreviations in titles and inappropriate phrases in sections result in incorrect classifications. The use of very specific terms for subheadings results in accuracy degradation.

The method of Haug et al. [49] consists in annotating each section in a medical document by its main concept. This approach is based on Tree Augmented Naive Bayesian Networks (TAN BN) to associate topics with sections as their semantic features. For this purpose, this method was trained using features generated by extracting N-grams from the text of section titles, in combination with the document type. In fact, the identification of the section topic improves the accuracy and avoids errors when extracting a specific information. Thus, this task can reduce the natural language processing effort and prepares the document for more targeted IE. However, n-grams have limitations with complex and large documents. In addition, this Bayesian model does not consider the consistent sequencing of section topics.

The method of Deléger and Névéol [32] classifies each line in French clinical documents into its specific high-level sections such as header, content and footer. Thus, a statistical CRF model is trained based on some information about the line taking into account the first token in surrounding lines, first two tokens in the current line, the first token is in uppercase, relative position of the line, number of tokens, presence of preceding empty lines, digits and e-mail addresses. As advantage, the performance is very high especially for content and header lines. It is well noted that the headers and footers are very present in the document and should be identified to focus on the core medical content. However, the granularity level of sections is very high while there are more useful sections within the content that are not identified.

Lohr et al. [86] have trained a logistic regression model on a manually annotated German clinical discharge summaries, short summaries and transfer letters to automatically identify sections using BoW statistics as features for each sentence. As advantage, this method achieves promising results in terms of f1-score. Furthermore, these authors have chosen a set of feasible and relevant categories for annotation. In addition, a sentence was chosen as an annotation unit while it has an appropriate granularity. However, the method does not perform well for categories that barely appear in the corpus.

Lupşe and Stoicu-Tivadar [92] have made a method that consists of homogenizing the sections of drug package inserts by standardizing the section names. At first, the method collects all section names from all drug package inserts, and prepares unique and common reference names that represent different kinds of sections. Then, machine learning is used to find the appropriate reference for each section name. Through this method, access to drug information has been improved for better processing. Moreover, this technique can be used in clinical decision applications to provide the necessary data to physicians. Thus, it helps especially new young doctors or those who start a new specialty. Neural network leads to highest results in the extraction of relevant information and outperforms cosine similarity according to the f1-score metric. Moreover, this model can be generalized to any language or domain. However, this model is appropriate only for records where sections are defined by headings.

The method of Chirila et al. [25] consists in supporting the prescription of drugs by structuring and categorizing the text into sections. For this, a machine learning model was trained to associate each part of the text with its appropriate section. According to the results of this method, the accuracy of the CNN based model is more superior especially with uniform name sections. Moreover, this method was applied on the Roman language where there is no dataset in this language with fully structured information. However, the execution time of CNN increases significantly when compared with a model based on Naive Bayesian classification. Moreover, this method is applied only on the Roman language.

Goenaga et al. [42] have tested rules and machine learning based methods for section identification on Spanish Electronic Discharge Summaries. They have found that machine learning-based method gives the best results. This method is based on transfer learning using FLAIR model [3] and generates character embeddings for sequence of tokens to annotate them by a BiLSTM-CRF model. Indeed, the rules have the lower results especially when an incorrectly marked section affects the surrounding sections even by carefully designing rules which is a time-consuming process. In contrast, the FLAIR method can identify sections even with variations or where headings are absent while it can learn from the headings and the vocabulary inside the sections. Also, training the method on data with more variability is useful to keep obtaining higher efficiency on different types of data. However, the results degrade when testing the trained model on different data and the degradation may be drastic in some cases. Furthermore, errors can be caused by high variability with the lack of training. In addition, implicit and mixed sections are the cause of several errors.

Nair et al. [100] have proposed a method to classify the sentences of i2b2 2010 clinical notes into different major SOAP sections using BiLSTM model with the fusion of Glove, Cui2Vec and ClinicalBERT embeddings. As advantage, the contextual embeddings and the transfer learning provide an efficient solution to this task. Also, the authors have found that 500 sentences per section is a sufficient starting point to achieve a high performance. However, they have considered only 4 sections and ignore other sections and sub-sections. Moreover, they have not considered the context-sensitivity of clinical sentences.

5.3.3 Hybrid methods

These methods use rules often for the construction of training data. These methods are weaker than rule-based methods, more ambitious than rule-based methods in dealing with different types of documents, and better than Machine Learning methods.

Jancsary et al. [55] have trained CRF to recognize (sub)sections in report dictations giving lexical, syntactic categories, BoW, semantic type and relative position features for each word. The training data is constructed by aligning the corrected and formatted medical reports with the text from automatic speech recognition while the annotations are generated by map** (sub)headings to the (sub)section labels using regular heading grammar. As advantage, this method can detect various structural elements even without explicit dictated clues. Furthermore, it can automatically assign meaningful types for (sub)sections even in the absence of headings. In addition, it is still effective under ideal conditions and can deal with the errors of real-life dictation. However, the manual correction is required to solve the errors of the automatically generated annotations which impact the segmentation results.

Apostolova et al. [7] have constructed a training set by hand-crafted rules to train SVM to classify each medical report sentence into a semantic section using multiples sentence features such as orthography, boundary, cosine vector distance to sections and exact header matching. As advantage, a high-confidence training set is created automatically. Also, the classification of semantically related sections is significantly improved by boundary and formatting features. Furthermore, the segmentation problem could be solved when the NLP techniques are applied. Moreover, using SVM classifier outperforms a rules-based approach. However, it is hard to classify a section when its sentences are often interleaved with other sections.

Ni et al. [104] have classified medical document sections into pre-defined section types. These authors applied two advanced machine learning techniques: One is based on supervised learning and the other on unsupervised learning. For the supervised technique, a heuristic model pre-trained on old annotated documents is used to select a number of new candidate documents that will be annotated by people and will be used for learning. For the unsupervised technique, a map** method was used to find and annotate sequences of words, which represent section titles, by their corresponding section types using a knowledge base. A maximum entropy Markov model was used for section classification. The chosen model is faster in learning and allows richer features. In fact, the techniques used can reduce the cost of annotation and allow a quick adaptation on new documents for section classification. In addition, both techniques can achieve high accuracy. However, the supervised technique requires more annotation cost than the other technique. In addition, the performance of the unsupervised technique is highly dependent on the quality of the knowledge base.

Dai et al. [29] have proposed a token-based sequential labeling method with the CRF model for section heading recognition using a set of word features such as affix, orthographic, lexicon, semantic and especially the layout features. To construct training data, they have employed section heading strings from terminology to make candidate annotations. Then, three experts are used to manually correct the annotations of top most section headings. As advantage, this was the first work which treats section detection as a token-based sequential labeling task and outperforms sentence-based formulation and dictionary-based approaches. This method has an integrated solution which avoids the development of heuristics rules to isolate heading from content. Also, layout features improve the results and can recognize section headings that are not appearing in the training set. However, it is difficult to recognize rare or nonstandard topmost section headings. In addition, subsections are not taken into consideration. Furthermore, some section headings are not the topmost in some records. Also, the absence of layout information can decrease the recall.

Sadoughi et al. [117] have applied section detection on clinical dictations in real time. They used automatic speech recognition to transform the speech into plain text. Also, a unidirectional LSTM model, which tracks short and long term dependencies, is run on the text to annotate its section boundaries using Word2vec vectors to represent the input words. To do this, regular expressions were applied on a set of reports to annotate the headings. Each time, a post-processing task is applied on each section to transform the text into a written report. As advantage, the post-processing task can become faster with the processing of a complete section each time, instead of re-executing after each dictated word. Thus, the post-processing of the previous section happens in parallel with the dictation of the current section without disturbing the user. Moreover, the post-processor can benefit from the full context of the section during the transformation. Thus, real-time section detection ensures that the medical report is directly usable for other processes after dictation. However, the detection of a section depends only on the words dictated so far without seeing the whole document. This prevents it from exploiting all the information in the document to provide a better quality result.

6 Methods dealt with issues related to the nature of entities

In this section, we highlight some solutions proposed by the state-of-the-art methods to solve problems related by the nature of the medical NER. Thus, we have cited and classified some methods into four categories based on the main problems: Ambiguity, Boundary detection, Name variation and Composed entities. Table 1 shows some of the methods used to deal with these various problems.

Table 1 Some methods that addresses the issues related to the nature of named entities. The abbreviation "NI" in this table means Not Included

6.1 Ambiguity

The ambiguity is when a medical named entity can belong to more than one class depending on the context. Thus, for some entities, we should explore the preceding words and the position of the entity in the text in order to know its meaning. Generally, that is the most important problem in entity recognition while most studies focus on how to provide more context to recognize the entity. As a special case, the abbreviations are likely to be ambiguous. Moreover, it is hard to determine the expansion of an abbreviation with a very small number of characters. Thus, an abbreviation mostly may have different meanings depending on the context. As common solutions, works try to enrich context information by word or character embedding, knowledge base and word position in the text such as section, surrounding words, PoS, etc. Also, they try to capture the contextual dependence and relation. Lei et al. [76] have reached a higher performance by merging the word segmentation and the section information. Ghiasvand and Kate [41] have benefited from UMLS which provide a lot of entity terms which are declared as unambiguous. Thus, they do an exact matching for these terms to annotate a maximum number of unambiguous entities to partially solve the ambiguity. Xu et al. [150] have benefited from more context to solve the ambiguity problem by using categories representation by Word2vec, PoS, dependency relation and semantic correlation knowledge. However, the method may miss some medical entities in the non-medical terms filtering step. The method of Zhou et al. [41] have trained a classifier by the medical terms found in UMLS to learn how to expand the boundary of words. Thus, the classifier is applied to all noun phrases in which the detected entity occurs in order to select the entity with the highest score. However, automatically obtaining noun phrases can make mistakes. Also, sometimes we may find named entities that are not noun phrases. In order to avoid incorrect identification for the entity boundary, Deng et al. [33] have ensured the integrity and accuracy of the named entity by the bidirectional storage of text information. In addition, the IOB labeling method is introduced. Also, they have used character-level embedding which can avoid poor segmentation. However, the phenomenon of nesting entities leads to unclear definition of boundary and results in poor accuracy. Zhao et al. [161] have extracted all noun phrases from each sentence as candidate entity mentions based on a set of PoS patterns. The method of Li et al. [150] can handle nested entities by identifying entity candidates based on the dependency relationships between words. Thus, medical native noun phrases, such as single nouns and maximum noun phrases, are extracted. The method of Ghiasvand and Kate [41] is able to detect nested entities by obtaining all noun phrases, with nested ones, using a full parsing. The method of Li et al. [126] have proposed a cost-effective and efficient trigger-based graph neural network to cast the problem into a graph node classification task.

7 Knowledge discovery methods

In this section, we cite examples of studies that applicate text mining methods to discover new knowledge in medical field. Thus, this knowledge can be used to make medical decisions or to gather useful information and conclusions for some medical tasks. Indeed, knowledge discovery is the final and the most important step in medical data processing. This step consists in discovering a needed knowledge from the extracted information which directly supports the medical staff.

The method of Sudeshna et al. [125] is able to predict the probability of having heart disease by using the supervised SVM algorithm to classify the data. This method takes particular symptoms, given by the patient, and the patient’s health record. Based on this data, this method can suggest diseases, treatments, medications and dietary habits for the doctor. The latter can confirm and send the information to the patient. As advantage, this system is reliable, easy to understand and adaptive. Also, the disease is automatically analyzed more efficiently and is easily identified. In addition, this method can identify the best medication for the disease.

Aich et al. [2] have proposed a method to analyze abstracts of articles related to Parkinson’s disease to automatically find the relationship between walking and Parkinson’s disease. This method is based on text mining. It does a simple pre-processing on the text. Then, it provides a graphical visualization to indicate the word frequencies through a Word Cloud. Also, it categorizes the similar terms using a hierarchical classification and K-means based on a distance calculation between words. Indeed, this approach has a great potential to classify a text into different groups. However, this method can be improved by providing more articles to be analyzed or by exploring other classification approaches.

The review of Al-Dafas et al. [4] have shown that applying algorithms of data mining techniques on patients’ health data is able to help doctors to make the right decision at the right time. Thus, these techniques can detect the cancer disease early without surgical intervention. As advantage, they can reduce treatment costs and medical errors in diagnosing the disease. The authors suggest to adapt machine learning-based data mining in the medical diagnosis process.

Table 2 Benchmark datasets used in IE.

8 Benchmark datasets and supplement resources

In this section, we cite the most important benchmark datasets used in the IE task which are manually annotated and considered as gold standard. Most of these datasets are used in shared tasks and used to train and evaluate IE state-of-the-art methods. Table 2 shows some details about these datasets. Generally, the annotation focus on information related to diseases, medicament and chemical entities. Many datasets are constructed from PubMed articles and especially the abstracts because they are easier to collect. Other datasets are obtained from discharge summaries and clinical reports which are de-identified to hide personal information. However, the available datasets are in textual form while the medical documents are originally available on PDF form. Thus, useful information is not available such as formatting style, which is important to define the document structure. Indeed, there is some recent datasets Cohen et al. [28], Islamaj et al. [53] obtained from full-text article which can provide richer information rather than just an abstract.

Also, many methods use supplement resources such as dictionaries and ontologies. The most popular resources that used in IE are UMLS [19] meta-thesaurus and Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) [36] ontology. The UMLS meta-thesaurus in its 2021 update can cover 16543671 terms in different languages (11755677 in English) which are associated with 4536653 medical concepts. This large biomedical thesaurus is collected from different sources such as SNOMED-CT, MEDLINE, MeSH, NCBI, etc. Also, it provides 127 semantic types of the concepts such as Disease/Syndrome, Clinical drug, Therapeutic or preventive procedure, laboratory or test result, etc. Furthermore, it provides 54 semantic relations between these semantic types, such as Process of, Result of, Property of, Part of, Associated with, Complicates, Affects, Causes, etc.

The SNOMED-CT ontology covers more than 1314668 clinical terms with their description and they are associated with different concepts. There are 19 top-level concepts which are body structure, finding, event, environment / location, linkage concept, observable entity, organism, product, physical force, physical object, procedure, qualifier value, record artifact, situation, social context, special concept, specimen, staging scale and substance. In addition, it provides more than 3092749 relation mentions of 122 types of relations such as “is a”, “finding site”, “associated morphology”, “method”, “causative agent”, etc.

Table 3 Rule-based NER methods
Table 4 Dictionary-based NER methods
Table 5 Machine learning-based NER methods
Table 6 Machine learning-based RE methods
Table 7 Rule-based RE methods

9 Experimental analysis

In this section, we have collected results of some methods for each IE task such as NER, RE and section detection. Thus, we have organized these methods according to the used technique, dataset and features. Hence, we can make an experimental analysis according to their available results. As a look at all the results in general, we find that most methods use the machine learning technique. For the NER, the best result is achieved by a machine learning-based method, while a rule-based method gets the best result for RE and a hybrid method is the best for section detection.

Named Entity Recognition: Tables 3, 4 and 5 show the results of rule-based, dictionary-based and machine learning-based named entity recognition methods, respectively. The A, P, R and F symbols refer to Precision, Recall and f1-score, respectively. The best f1-score (96.05%) was for the method of Popovski et al. [107] which uses rules based on computational linguistics and semantic information to describe food concepts in order to annotate entities in 1000 food recipes. This method is aimed to extract food entities which can be useful in medical field such as food safety. For disease and symptoms and some other named entity types, the method of Deng et al. [33] , based on machine learning and tested on 20% characters from 1600 TCM patents’ abstracts, gets the best result in terms of f1-score (94.48%). This method is based on fine-tuning character embeddings with the use of BiLSTM-CRF model. For the i2b2-2010 benchmark dataset, a model based on machine learning gets the best result in terms of f1-score (87.45%), which uses LSTM-CRF model with contextualized and GloVe embeddings. Thus, machine learning-based model using context-specific embedding with the model based on LSTM and CRF can get better results. In fact, the unsupervised models [41, 159], predictably, get the worst results but it was acceptable (51.06% and 59.43% f1-score), and it is worth avoiding manual effort.

Relation Extraction: Tables 6 and 7 show the results of machine learning-based and rule-based RE methods, respectively. The P, R and F symbols refer to Precision, Recall and f1-score, respectively.. The best method [154] gets the best f1-score (96.06%) which have used rules to generate candidates and classify them based on clinical BERT model. It was trained and tested on n2c2 2018 benchmark dataset and thus it is the best method applied on this dataset which is annotated by relations linking concepts to their medication. In fact, this method can be considered as hybrid-based method while it uses the rules with machine learning. The best machine learning-based methods [95, 146] have achieved 94% f1-score and have trained and evaluated on n2c2 2018 benchmark dataset to link concepts to their medication. These methods have used a fine-tuned BERT as embedding. However, BERT embedding is not suitable neither for i2b2 2010 dataset, which is annotated by relations between medical problem, treatment and test entities, nor for n2c2 2019 dataset which is annotated by relations between family members, observations and living status. The methods [146, 153] applied on these datasets with fine-tuned BERT have almost the worst results. Indeed, there is machine learning based method [119] applied on the text of tweets collected from Twitter social media. As expected, it gets the worst result in terms of f1-score (56.7%) while it is a very difficult task to apply a medical IE on the social media text. This type of text is less structured compared to professional documents and articles while the writer of tweets is often not an expert on medicine and does not pay attention to correct writing. Thus, getting 54% of precision and 63% of recall and 56.7% of f1-score in this type of text is an appropriate result.

Section Detection: Tables 8, 9 and 10 show the results of rule-based, machine learning-based and hybrid section detection methods, respectively. The A, P, R and F symbols refer to Accuracy, Precision, Recall and F1-score, respectively. In fact, the section identification methods are not well experimented while this task needs a more standardized evaluation method such as a benchmark dataset. Moreover, some methods use different segmentation and granularity level which make comparing state-of-the-art so complicated. However, according to the results, the best method according to the accuracy metric is a rule-based method [91] which has used a simple regular expression to match titles from a set of section titles and is tested on 3 datasets. The maximum accuracy score (90%) was achieved using 1630 prospectuses. This method outperforms a machine learning based method in terms of accuracy, and this is the only available accuracy result we have. However, when a larger dataset is used the result may decrease significantly. Also, it may differ from a dataset to another while the regular expressions may be inappropriate for other types of documents. For example, the method applied on 3814 prospectuses gets 88% of accuracy while the same method applied on 3002 prospectuses from another source get 66% of accuracy. The best method in terms of f1-score (94.19%) is a hybrid method [29] which uses a terminology to make a semi-supervised annotation of the training set and train a CRF model on layout, affix, orthographic, lexicon and semantic features for sentence tagging. This method is trained and evaluated on a dataset which is already considered as a benchmark for another different IE task and contains patient records such as discharge summaries and procedural notes, etc. Also, this method has the best result in terms of f1-score among the methods applied on discharge summaries. In second place, a method based on machine learning has achieved 93.03% f1-score which uses BiLSTM-CRF with FLAIR character embedding fine-tuning on reports of hospital discharges. Thus, the use of CRF model can lead to the best results in section detection task.

Table 8 Rule-based section detection methods
Table 9 Machine learning-based section detection methods
Table 10 Hybrid section detection methods

10 Literature review

In the last years, we can see that the machine learning methods are well studied in the IE, where many models are tested with many types of features. Generally, CRF is the most popular model which is a sequence of tagging model while the problems are mostly defined as tokens labeling especially in the named recognition task. Recent works have showed that the fusion between BiLSTM and CRF give better result [33, 119, 161], because BiLSTM can consider the order from double directions which makes it able to well understand a sequence of features. Hence, CRF can perform an accurate labeling using the features provided by BiLSTM.

However, the manual effort problem is one of the most important challenges in this field. Many methods are destined to search an automatized or semi-automatized techniques to obtain a high quality annotated dataset due to the lack of training data [41, 65, 161]. In addition, there are different fields and different types of data, and the medical field is evolving day after day, which makes one annotated training set not enough to be used permanently. A manually annotating dataset need much effort and is time consuming, and that is why we need to automatize this step to enable the model to easily adapt new types of data. Comparing to rules, a machine learning model can learn implicit and deep rules using automatically generated features which make possible to explore and predict the best output in many new cases, whereas the rules usually only deal with pre-defined cases although they can be smooth to exploit the context and knowledge. In fact, rules are more precise when they can be exactly matched on the data, but cannot discover new knowledge beyond the defined rules. Also, rules fail in many cases, especially for variable data.

Indeed, some methods try to automatize the construction of rules. An important recent study [161] uses graph propagation method to find new rules giving few seeding manual rules that are easily constructed by experts. Thus, even rules can be automatically generated. In addition, the graph based techniques can be exploited, while they are able to treat the relationships inside data. The synonyms and antonyms relations, contextual similarity and many other relation types between words and phrases are very important and can be well exploited using a graph technique. In addition, even the social media can be very critical area for this task, where the data is more noised and contains many grammatical mistakes, and especially false information by which people can be influenced. Some recent works are aimed for social media [61, 119], for example, Shi et al. [119] have used machine learning models by the help of Concern Graph to apply entity recognition on pandemic concern entities in Twitter. Recently, the social media have played an important role during the COVID-19 pandemic period and it is crucial to automatically understand and supervise a lot of people’s interactions. Indeed, social media is a rich source of information which mostly contains unstructured and confused textual and other multimedia data. Thus, some studies are applied to extract information from that data in order to perform some tasks such as associating tags to posts [71], identifying relevant information [94], etc. It is worth noting that the graph of users’ relations such as following relations can be exploited too to enhance the medical IE. Thus, some important tasks such as community detection and influence identification [45, 56, 63, 99] can be combined with medical IE tasks in social media.

Talking about supplement resources, is good to construct a big resource which tries to cover all things, but it cannot be. Some studies try to automatically update these resources by an online phase, for example, by using a search engine [150]. Indeed, the most used additional resources are UMLS [19] and SNOMED-CT [36] that can cover a lot of concepts, languages, semantic types and much information about terms, which make it possible to even annotate a raw text or extract more information using some matching techniques. Some studies tried to find a smooth technique to use the information inside dictionary in order to overcome the limited coverage. For example, new relationships can be learned using the synonym relationships provided by the SNOMED-CT ontology using CNN model [8]. Also, we can make a gazetteer tagger using the power of machine learning which learns features from a dictionary to provide an output which improves another machine learning model [127]. Indeed, the relation between machine learning and dictionary based methods is well studied, while using machine learning can make the use of a dictionary more robust to override the limits of the resource. Thus, these studies try to find a better way to exploit the knowledge of ontologies. Furthermore, the knowledge inside the dictionary can be used as additional feature for machine learning models [102] and it gives a useful information about the context.

Talking about features, most recent works focus on distributed representation for words as well for characters such as BERT and Word2vec, which can well extract context information for machine learning methods. Furthermore, the models which generate these embeddings can be used in different manners, where they can be pre-trained from other sources, trained during the whole model training or fine-tuned. In particular pre-trained and fine-tuned language models such as BERT have achieved state-of-the-art performance on many natural language processing tasks. For example, the work of Yang et al. [154] have showed that clinical pre-trained transformers achieve better performance for RE. In addition, other types of features can be beneficial such as knowledge and syntactic features. Indeed, there is some information that is not well exploited although that has showed a good potential to improve the IE task. Generally, this task can be improved by giving more context. Indeed, the study of Lei et al. [76] have proved that section information is able to improve the entity recognition task, but it is not well exploited in this field. In addition, Tran and Kavuluru[136] have used sub-headings to improve the RE. Furthermore, the formatting style of the document can be very important, while it have proved its ability to detect section titles in the study of Beel et al. [13] using only font size information. A medical document is generally created on PDF format which provides more useful information than a raw text. However, all available benchmark datasets are provided only as an annotated raw text. Furthermore, most datasets which contain medical articles are available only with annotated titles and abstracts. Recently, some researches [53, 65] are trying to construct new annotated datasets especially for NER with full-text articles which frequently contain more detailed information. These articles are usually collected from PMC [112].

Concerning the named entities, most works focus on the disease entity type, even most datasets [35, 79, 108] and the widely used meta-thesaurus [19] provide annotations and information about this type of entity. Thus, extract information about disease in medical texts is a very important task in the medical field. Indeed, the named entity boundary problem is another challenge [33, 41, 150], while solving this problem lead to obtain a better result especially according to the exact matching metric. However, this metric always give a very low score comparing to partial matching metric. Generally, this problem is related to the noun phrase chunking which is adopted by most works.

A lot of studies in the medical IE are destined for the Chinese language. Although English language is more suitable for this task, the Chinese researchers are trying to improve this task for them while they are very interested by the evolution in medical field generally. However, the Chinese language in medical field is more difficult compared to English especially on the segmentation while it has complicated syntax rules, and on the lack of Chinese data. Hence, many studies are destined to deal with this language problems [33, 76, 150]. The segmentation is generally used to provide samples and extract features from them. It can be performed on word, phrase, sentence and section level. Indeed, Deng et al. [33] found that making features for a sequence of characters is more suitable especially for Chinese language.

11 Conclusion and Future Research Directions

The IE in the medical field is very interesting especially to find information about diseases. Generally, this task provides more knowledge, supports persons to find relevant information and helps doctors to release the best decision, for example, to choose the right treatment, make the appropriate drug prescription or discover causes and effects of some diseases. A large amount of unstructured medical textual information is terrible to be manually analyzed by doctors while is considered as a heavy treasure of information.

In our survey, we conclude that the rule-based and hybrid methods are generally the promising techniques for IE where they have shown the best results. However, the rules depend highly on a specific domain. Thus, it is difficult to adapt these methods on new type of data since a manual effort by domain experts is needed. Thus, generating rules that are constructed dynamically to adapt the data type is a promising direction. Most methods focus on the combination of CRF and BiLSTM models which is very beneficial for sequence-tagging tasks. However, the CRF model is usually used for flat NER, which is not appropriate for nested and discontinuous named entities. Domain-specific embeddings, especially which are provided by BERT model, are used by many methods and give better results. There is a lack of the medical data where these data contain more privacy compared to other fields. The manual annotation is also a big problem for machine learning and many recent methods are going to automate it. Thus, it is a challenging issue to provide a high quality data for training and other supplement data which can cover the new medical terms, several types of data, all possible cases, multiple languages, different types of labels, etc. Also, all datasets are available in textual format while the formatting style is very important and should be more exploited. Indeed, it adds a very useful information on the text which is especially used to understand the structure of the document and even the meaning of words.

Indeed, section detection is a challenging issue and showed a positive impact on the performance of several IE tasks. Mainly, the position of a concept in a document can provide more contextual information. However, this task is not well covered by the research work and methods. Thus, benchmark datasets should be constructed for it. Another issue is combining rule-based, dictionary-based and deep-learning approaches to well benefit from them in one hybrid method. Many ideas have been proposed and can be exploited further. Indeed, we can use rules to prepare data for machine learning or we can use machine learning to generate rules. Besides, we can ameliorate the dictionary matching by machine learning or we can annotate a lot of data by a dictionary to train a machine learning model. Also, we can use the rules by constructing regular expressions to perform a dictionary matching or we can use dictionary as a supplement resource to support the rules. In addition, we can make features by using dictionary and rules based techniques.

Some important directions about IE in medical documents are not well covered by our analysis. Thus, the research themes in these directions can be further developed since they have provided promising results. Indeed, the difficulty of handling multi-language documents is an interesting issue which makes it hard to adapt the model for data with different languages. Hence, it is good if we can benefit from data of multiple languages. For example, we can use the transfer learning in order to benefit from the knowledge learned from a pre-trained model. Consequently, the model will be able to adapt the knowledge on another language rather than training the model from scratch on one language. Thus, we can combine the learned knowledge from more than one dataset of different languages. Generally, adapting a method on different languages can solve the lack of data and benefit from the knowledge gathered from different data sources. In addition, while most of researches are aimed for English and some for Chinese language, the other languages have a poor chance to be well treated. Therefore, unifying multiple languages in one method would be a really important achievement.

In fact, document summarization is another major and very challenging issue. This task can depend on the NER while it consists in making a text summarization which contains only the most important information. Hence, we can easily recognize a relevant and a very reduced readable part of text in a document instead of reading a whole text. Thus, the useless part of text can be eliminated even for other tasks of IE. Due to the diversity and the growing quantity of medical information, persons need to quickly assimilate and determine the content of a medical document. Thus, document summarization helps persons to quickly determine the main points of a document. However, this research field has not yet reached maturity, while variety of challenges still needs to be overcome such as handling large scale data, providing sufficient annotated data, etc.

Another important issue and possible research direction, which has been discovered especially during the COVID-19 pandemic, is about analyzing the propagated medical information in social media. The content on social media is very different from documents especially when the information is written by normal users and not medical experts. Thus, the natural language processing will be much more difficult while we can find unstructured texts and many typos. As well, we can easily find a big number of users influenced by false medical information which represents a critical problem. Likewise, the social media environment is rich of useful information more than just a document. Therefore, we can easily benefit from the reactions on the post, the owner profile, contacts, etc. Thus, IE in social media is very challenging more than in medical documents and is needed to detect the spread of false information and understand the people’s interactions with medical information.