Introduction

With COVID-19 swee** across the world, the challenge of the pandemic has rapidly accelerated the pace of scientific publications [1, 2]. As approximately 10,000 new articles on COVID-19 and SARS-CoV-2 are published every month [3], the ability to accurately extract the crucial semantic topics from the large rapidly-growing COVID-19 literature has become of great importance to many biomedical applications [4,5,6,7].

In recent decades, curators at the National Library of Medicine (NLM) have been employing Medical Subject Headings (MeSH) to manually identify and curate semantic topics for scientific articles [8,9,10], which is also known as the process of semantic indexing. However, it is non-trivial to manually curate such substantial biomedical articles, which heavily relies on intensive labour and tremendous investment. In this scenario, experts have to examine the full body of each biomedical article and manually assign it with a series of suitable pre-defined semantic topic terms from the large vocabulary of MeSH headings. Although this manual topic assignment has relatively reliable accuracy, it is inevitably time-consuming and prohibitively expensive [11,12,13]. In addition, due to the emerging hotspots of COVID-19, such manual topic curation is much more difficult to keep up to date. Moreover, lacking a pertinent biomedical taxonomy will further increase the challenges of the topic curation for COVID-19. Hence, there is an urgent need for automatic semantic indexing techniques that are able to efficiently and robustly identify biomedical topics in a newly emerged topical field, such as the COVID-19 domain. Figure 1 shows an example to illustrate the challenges of the semantic indexing task for the COVID-19 domain. In the figure, the article (PMID: 32,373,993) has already been curated and indexed by MEDLINE experts with nine different MeSH semantic topics.

Fig. 1
figure 1

taken from PubMed

An example of MeSH semantic indexing

In view of machine learning, automatic semantic topic indexing with MeSH terminologies is considered a large-scale multi-label topic identification problem. Despite the promising results from early efforts [14,15,16,17], there is still a significant gap between such automatic methods and their applications for effective searching and querying in the COVID-19 domain. On the one hand, there is a lack of a specialized biomedical taxonomy for COVID-19 as traditional MeSH indexing research concentrates on the general scientific domains. Even worse, with tens of thousands of topic terms in the large-scale vocabulary of MeSH headings, it almost inevitably leads to extremely imbalanced label distribution for the ground-truth semantic topics [17]. On the other hand, there is also a severe lack of benchmark datasets for the COVID-19 semantic indexing research. At present, fighting the COVID-19 pandemic poses an extreme scenario that highlights the importance of automated semantic indexing techniques as professionals and practitioners desperately require a well-structured knowledge base to acquire new insights from recent coronavirus findings [18,19,20]. However, lacking such a standard dataset drastically limits the development of the topic identification techniques for the COVID-19 domain. Therefore, constructing a universal dataset for COVID-19 semantic indexing is of great importance.

In light of these concerns, this article is devoted to the topic identification problem of COVID-19 semantic indexing. Theoretically, the COVID-19 semantic indexing can be conceptualized as a typical case of labeling texts with a range of centralized topics from heterogeneous sources. The need for such kind of semantic labeling is crucial for an emerging thematic area. Typically, neither consensus domain taxonomy nor sufficient annotated training data are available in such emerging topical areas. In addition, such an emerging domain also lacks a conventionalized venue for publications and likely finds a variety of related publications in neighboring fields. In this regard, we first introduce a new COVID-19 Semantic Indexing (CovSI) corpus constructed from a wide range of COVID-19 related biomedical articles, which addresses the data absence in such an emerging domain. We then propose a novel deep neural network adopting a multi-probe attention mechanism to address the challenges of semantic indexing from heterogeneous data for the specific field, i.e., COVID-19. Since there is no such specialized topic taxonomy for COVID-19 so far, the classic and widely used MeSH controlled vocabulary is employed for the study. To construct the CovSI corpus, we extract the metadata from multiple authoritative resources, including MEDLINE [12], PubMed Central (PMC) [21], and COVID-19 Open Research Dataset (CORD-19) [1], respectively. All extracted metadata is then merged to build the CovSI corpus. On top of the CovSI corpus, we propose a novel semantic indexing framework based on multi-probe attention neural network (MPANN) to address the fundamental problem of semantic indexing for the emerging domain of COVID-19. The proposed method begins by ranking all MeSH topic terms for each article through a k-nearest neighbor (KNN) based masking approach, which is able to select the most relevant candidate topics and significantly reduce the complexity of the MeSH controlled vocabulary without any prior knowledge of the domain. It then represents multiple context-aware inputs for potential biomedical clues with a transformer encoder and subsequently feeds the encoded representations to the downstream attention-based neural network for further feature extraction. Specifically, four different semantic probes, namely Context Probe, Candidate Term Probe, Journal Probe, and Dynamic Topic Probe, are exploited during the feature extraction phase in order to address the heterogeneous nature of the data sources. The basic idea of these probes relies on that the context-aware textual information carries meaningful biomedical background knowledge from different semantic aspects, which provides informative features to discriminate topics for the input article. For instance, COVID-19 related literature is likely to express the conceptional terminologies of Coronavirus and SARS-CoV-2, which are suggestive indicators for topic selection. In this view, associating the expressive contexts with the sieved candidate topic terms can help the MPANN model pay more attention to the possible target topics during the classification. Moreover, given a wide variety of sources of publications, COVID-19 articles may allow attention directly to the journals that are most likely within a relationship to the specific topic, such as journals on respiratory diseases for COVID-19. After extracting the feature representations at both term-level and document-level, MPANN adopts a linear multi-view classifier to conduct the final MeSH recommendation. To improve the overall performance, the proposed method is pre-trained using a large number of MEDLINE articles to learn the general biomedical representation, and further fine-tuned on the CovSI dataset to better obtain COVID-19 related knowledge.

Our primary goal is to construct a publicly available dataset for the COVID-19 semantic indexing research and develop a versatile machine learning approach with robustness and generalizability, which can be easily applied to COVID-19 and robustly scaled up to other biomedical domains, especially those new emerging topics. Experimental results on the dataset show the merit and effectiveness of our proposed approach in such a specific domain of COVID-19. The main contributions of this work are summarized as follows:

  1. (a)

    We construct a pertinent and comprehensive corpus targeting the COVID-19 semantic indexing research. We believe such a corpus could largely benefit the related works for COVID-19 and foster the development of biomedical text mining technologies.

  2. (b)

    We propose a novel semantic indexing approach that is able to effectively scale up to the COVID-19 domain. Our study demonstrates the superiority of the proposed method which outperforms the current state-of-the-art performance.

  3. (c)

    We make the related resources of the proposed method publicly available to the research community. We believe that our work is capable of offering some essential foundations for researchers under the current pandemic crisis.

Related work

In recent decades, to facilitate the research of biomedical topic curation, a series of automated methods [22,23,24,25,26,27,28,29,30,31,32] and challenging competitions [33, 34] have been developed to improve the time-consuming, costly, and labor-intensive semantic indexing process.

Learning-to-rank (LTR) is one of the most popular information retrieval approaches developed for semantic indexing [35]. The main idea of LTR is to model the topic identification problem as a ranking problem, where the top-ranked semantic topics are recommended as true labels. To this end, NLM developed the famous retrieval tool Medical Text Indexing (MTI) [13, 22], which has been assisting NLM human curators since 2002. Specifically, MTI has two separate components: MetaMap Indexing and PubMed Related Citations. Once texts from a biomedical article are fed into MTI, it automatically recommends suitable MeSH topics to the human curators.

To encourage worldwide research on biomedical topic curation, a series of semantic indexing competitions have been held annually by the BioASQ community since 2013 [33]. Participants involved are required to predict new MEDLINE articles with relevant MeSH topics. As the competitions have provided large-scale practical and realistic benchmarks, many efficacious studies have emerged since then. MeSHLabeler [23] developed an LTR-based hybrid system with textual representations for multiple integrated classifiers. To handle the prediction bias generated by the integrated classifiers, MeSHLabeler adopted a normalization schema to improve prediction accuracy and won first place in the BioASQ 2014 competition. MeSHNow [24] proposed another hybrid machine learning approach, which combined multi-label classification, KNN, and MTI, to generate the set of candidate MeSH terms for each article. Under the effectiveness of the LTR-based framework, MeSHNow successfully extracted the highest-ranked semantic topics and reached the state-of-the-art performance on the BioASQ 2014 dataset.

With the success of deep neural networks [36,37,38,39,40], deep learning-based approaches have brought remarkable breakthroughs in various biomedical semantic indexing tasks [25,26,27,28,29,30]. DeepMeSH [27] proposed a neural semantic representation method to address the BioASQ 2015 semantic indexing task. It first utilized the feature representations of ‘document to vector’ (D2V) and ‘term frequency with inverse document frequency’ (TFIDF) to tackle the topic selection problem. It then ranked the identified topics via an LTR-style framework to determine the final MeSH recommendation. FullMeSH [28] took advantage of an Attention-based Convolution Neural Network (AttentionCNN) to tackle the large-scale semantic indexing problem. Specifically, it combined the AttentionCNN with traditional machine learning methods (including KNN, SVM, etc.) to generate semantic evidence for the topic selection problem. Instead of manual feature engineering, the attention mechanism exhibited remarkable potential on account of an automatic feature representation without too much human interference. Benefiting from the AttentionCNN structure, all evidence extracted from the full text is fused into the downstream LTR module to conduct the final MeSH recommendation. AttentionMeSH [29] was another effective attention-based neural model. It utilized a bidirectional Recurrent Neural Network (RNN) with an attention mechanism to index MeSH topics for biomedical articles. It first narrowed down the large MeSH vocabulary through a masking method and then employed the RNN to derive deeper contextual representations. As a result of the capability of the deep neural representation, AttentionMeSH enabled the model to associate more textual evidence with plausible MeSH topics. MeSHProbeNet [25] and MeSHProbeNet-P [26] are two homogenous deep learning methods, which incorporated both RNN and attention mechanisms. The main difference between the two methods is that MeSHProbeNet-P presented multiple semantic probes as inputs based on MeSHProbeNet, which is able to acquire deeper semantic insights into biomedical knowledge from original plain texts. Contrasting the LTR-based models, MeSHProbeNet and MeSHProbeNet-P take the entire topic vocabulary of MeSH headings to perform the unified multi-label classification without any ranking solutions. Both MeSHProbeNet and MeSHProbeNet-P reached state-of-the-art performance on the dataset of BioASQ 2018 Task8a, and MeSHProbeNet won first place during the online competition.

Recently, in response to the worldwide pandemic, the focus of research has drastically shifted towards the specific concepts and sub-concepts of coronavirus. The BioCreative-VII community proposed the challenging task of the LitCovid Track [

Fig. 2
figure 2

The construction framework of the CovSI corpus

Corpus analysis

Table 1 presents the statistical information of the constructed CovSI corpus. After the metadata merging, there are 87,207 COVID-19 related biomedical articles reserved in the CovSI corpus. Each article contains 15 different attribute fields, such as PMID, title, abstract, body text, journal name, and MeSH terms. These abundant attributes assure comprehensive coverage for research on COVID-19 topics. Most of the curated articles are filled with valid contents, including title, abstract, journal name, as well as MeSH annotations, which guarantee the indispensable information for the downstream semantic indexing research. A large number of 1,161,962 MeSH topic terms with more than 10 thousand unique term types are kept as annotations in the corpus. However, despite trying the best to fill the attributes, approximately 50% of body texts, keywords, and chemical information are still missing due to the incompleteness of the online information. It is observed that articles have around 13 indexed MeSH terms on average, which indicates an extremely imbalanced term distribution, as most MeSH terms may never be observed in an article.

Table 1 The attribute statistics in the CovSI corpus

After the data construction, the CovSI corpus is further randomly divided into three subsets by the ratio of 8:1:1, which indicates the training set, development set, and test set, respectively. Table 2 shows the statistics of the three subsets. Note that each article is able to bring around 13 MeSH terms on average, which guarantees a similar term distribution for all subsets. The CovSI corpus will be freely available to global research communities for applying recent advances in natural language processing and other artificial intelligence techniques to generate new insights in support of the ongoing fight against the pandemic.

Table 2 The statistic information of different CovSI datasets