1 Introduction

Globally, the SARS-CoV-2 virus had a destructive effect on communities since the upsurge of the COVID-19 pandemic in November 2019 [1]. Medical communities and researchers are under increased pressure to remain current with the articles due to the rapid growth of research articles [3]. It intends to condense the article or document that holds the relevant facts by acquiring crucial information in a short duration. Text summarization has its uses in various application domains, such as for generating news summaries, email summaries, financial reports, research article summaries, and medical informative reports to track the patient’s treatment. Condensing a vital piece of knowledge into a summary will be beneficial since the internet is abundant with relevant and irrelevant information on any topic. It is challenging, as well as time-consuming, for humans to summarize such a large amount of data. Therefore, this has given rise to the demand for powerful and convoluted summarizers.

Generally, there are two categories of text summarization approaches—extractive and abstractive [4]. Extractive summarization techniques generate a summary using the most relevant sentences from the given document, whereas abstractive summarization techniques construct new phrases from the source document. The summary generated by abstractive-based summarization models is more similar to the summary created by humans. Though summaries from abstractive models tend to be more meaningful and grammatically correct, these may not be factually accurate as proved by several researchers [5, 6]. In addition, Zhong et al. demonstrated that extractive-based approaches outperformed abstractive equivalents in human assessment [7]. That makes it inadequate for different domains where factual persistence is crucial, like in medical research articles [5]. Text summarization has another category known as hybrid summarization models that combine both the concepts of extractive and abstractive models. See et al. in [8] acknowledge the benefit of hybrid summarizing approaches, and employ a pointer-generator technique in which the model is primarily abstractive, and detects and replicates significant facts straight from the source material to eliminate factual inaccuracy. Transformer-based methods are employed to create salient features, and an extractive summary is then constructed to ensure factual consistency.

The pre-trained transformer architecture has evolved over the past few years. The T5 language model [9] has shown significant performance improvements over the baseline transformer model and is used for generative linguistic tasks like abstractive summarization [10] and query generation. Furthermore, pre-trained Bidirectional Encoder Representations from Transformer (BERT) language models have exhibited positive results for various NLP challenges, including text summarization. Transformers have become increasingly proficient at capturing semantic knowledge, but they pose new challenges. Their processing capacity is restrained by the number of tokens they can handle simultaneously. Furthermore, the transformer’s attention mechanisms can fine-tune at a high computational cost. As a result of these constraints, recent text summarization methods often analyze a truncated version of documents [11, 12]. An unsupervised technique was chosen due to its appeal of not requiring manually labeled datasets for training. By utilizing graph-based knowledge and leveraging correlations within sentences, the most appropriate sentences are extracted to construct the summary.

The proposed technique presents a notable advantage over established methods, which is its novel combination of unsupervised automatic text summarization using a graph-based technique and a pre-trained model. This hybrid approach is referred to as CovSumm. Utilizing language models that are pre-trained on large corpora in the proposed approach allows for a superior interpretation of the textual content. These language models have undergone rigorous training on the huge amount of data, acquiring them to proficiently capture the semantics of the natural language. The integration of graph-based methods in the proposed framework also grants for a more exhaustive investigation of relationships within the sentences. This, in turn, facilitates the recognition of the most critical knowledge to be incorporated into the summary. Overall, the proposed approach presents a new and effective solution for addressing the challenges posed by information overload and data redundancy in COVID-19 scientific literature, providing a comprehensive summary that enables researchers to stay informed on the latest studies and advancements in the field.

This study is characterized by several contributions, which comprise:

  1. 1.

    An innovative hybrid approach for unsupervised automatic text summarization is proposed that extracts salient sentences in Covid-19-related scientific literature for generating the summary.

  2. 2.

    The hybrid approach is an integration of graph-based techniques and pre-trained language models that achieves a text summarization without requiring large quantities of manually labeled data.

  3. 3.

    Based on the evaluation outcomes, the hybrid approach presented in this study surpasses current unsupervised techniques for automatic text summarization that are considered state of the art.

This article comprises several sections, namely: Sect. 2, which provides an outline of previous research in the field; Sect. 3, which describes the proposed approach; Sect. 4, which presents the experimental results; and Sect. 5, which concludes the paper with a discussion of future directions.

2 Related work

This part offers an overview of the existing literature on extractive text summarization. Document summarization has been largely investigated over the last few years. To summarize a document, extractive summarization algorithms select a set of statements from the source text. Sentences are extracted from the original corpus, scored, and rearranged as they appeared in the original document for a meaningful summary generation by the algorithm. Employing these methods requires the sentences to be converted into feature vectors to estimate the similarity. Term frequency—Inverse document frequency (TF-IDF) [13], skip-thought vectors [14], Bidirectional Encoder Representations (BERT) [15], Global vectors for word representation (GloVe), and Word2Vec embedding [16] are widely used word embedding techniques used to encode sentences into sequences of vectors.

In 1958, Luhn [17] first presented a concept for text summarization that involves scoring sentences based on the frequency of significant word-containing phrases. This idea was subsequently refined by Edmundson and Wyllys in 1961 [18]. Mihalcea and Tarau [19] presented the TextRank algorithm as a graph-based system that was originally based on Google’s PageRank algorithm. The LexRank algorithm is the graph-based revised interpretation of TextRank presented by Erkan and Radev [20]. Bishop et al. in [21] considered the LEAD method as the baseline; it takes the initial N sentences for summary, and final summary sentences are selected at random using the RANDOM model. Using singular vector decomposition (SVD), latent semantic analysis (LSA) retrieves semantically rich phrases and generates a summary [22]. Nenkova and Vanderwende [23] proposed a system known as SumBasic that tends to generate generic summaries from multiple documents.

A wide range of NLP applications show enhanced performance by using pre-trained models [24, 38]. In the medical domain, the authors of [26] presented an unsupervised and extractive-based summarization technique that uses hierarchical clustering, which groups contextual embedding of sentences according to BERT encoders, and the most appropriate sentences were excerpts from within the group for generating summaries. Furthermore, [27] presented an extractive-based unsupervised technique based on the GPT-2 transformer model. The authors have used pointwise mutual information for sentence encoding to determine whether sentences and documents are semantically similar. On the medical journal dataset, the presented technique outperformed previous benchmarks. In a recent work [28], Ju et al. introduced an unsupervised extractive method for scientific documents using the pre-trained Sci-BERT model; experiments were performed on ar**v, PubMed, and COVID-19 datasets. CAiRECOVID was proposed in [29] for mining scientific literature in response to an input query; it was composed of a question-answering system with multi-document summarization. The medical researchers assessed the performance of the model using the Kaggle CORD-19 dataset and verified its efficacy, as reported in their study. To enhance language comprehension through unsupervised techniques of pre-training and fine-tuning, a method known as Generative Pre-Training (GPT) was introduced in [24]. This approach aimed at improving the understanding of language. Language representation is handled by BERT, a method proposed by [15]. A hybrid approach was introduced in the work of Bishop et al. [21], where they presented an innovative method called GenCompareSum. Evaluated on scientific datasets, this method outperforms both unsupervised and supervised models.

There are plenty more examples of unsupervised extractive summarization in the literature. The method known as the Learning Free Integer Programming Summarizer (LFIP-SUM) is described in [30]. This technique involves the formulation of an integer programming problem through the use of pre-trained sentence embedding vectors. Additionally, it employs principal component analysis to identify the optimal number of sentences to extract and evaluate their significance. What sets LFIP-SUM apart from conventional models is that it does not necessitate labeled training data. The study concludes that this approach offers significant advantages over existing methods and has potential applications in various domains. Belwal et al. [31] presented a method for extractive text summarization that utilizes a graph-based approach, which involves assigning weights to graph edges for ranking sentences. The weights assigned to the edges depending on the correlation between the sentences, which is determined through a vector space model and topic modeling. Additionally, the suggested technique employs topic vector generation to obtain the topic of interest in a given document. Additionally, a semantic similarity measure is incorporated to determine the sentence relevance, resulting in two approaches for creating the topic vector: combined and individual. The method’s primary contribution is a general mechanism that reduces the input document’s dimension to the topic vector, enabling the comparison of sentences with the vector and achieving impressive results in terms of rouge parameters. In [32], the authors introduced an unsupervised method for extractive summarization which combines K-Medoids clustering and Latent Dirichlet Allocation (LDA) topic modeling to minimize topic bias. The findings of this study demonstrate that this approach, with a stronger emphasis on subtopics, outperforms conventional topic modeling and deep learning approaches in unsupervised extractive summarization. The graph-based summarization technique proposed in the study in [33] takes into account both the resemblance among individual statements and their relation to the entire document. This approach employs topic modeling to determine the pertinence of specific edges to the topics discussed in the text, as well as a semantic measure to evaluate the similarity between nodes. The method EdgeSumm [34] employs four distinct methods to form a summary of a document. In the first method, a unique graph-based model is constructed to represent the document. The subsequent two techniques are responsible for identifying pertinent sentences from the text graph. Finally, when the model-generated summary surpasses the necessary word limit, a fourth technique is employed to get the crucial sentences for the summary. EdgeSumm fuses extractive techniques to leverage their strengths and mitigate their limitations. A distance-augmented sentence graph is used in [35] to model sentences with greater granularity, resulting in the improved characterization of document structures. Additionally, the model is adapted to the multi-document setting by linking the sentence graphs of input documents using proximity-based cross-document edges. Ranksum [36] is an unsupervised extractive text summarization methodology for single documents. The technique employs four features, namely, the topic information, semantic content, significant keywords, and position of each sentence, to generate rankings indicating their degree of saliency. The rankings are then weighted and fused to produce a final score for each sentence. Ranksum employs probabilistic topic models, Siamese networks, and a graph-based method to derive the rankings and eliminate redundant sentences.

3 Proposed methodology

The proposed method CovSumm is an unsupervised hybrid extractive approach for Covid-19 document summarization that fuses the sentence scores of GenCom-pareSum [21] which relies on a generative transformer-based model (T5) and BERT for sentence scoring, with that of TextRank [19] which is a graph-based model based on the cosine similarity measure. These are two distinctive and complementary lines of research for unsupervised extractive text summarization. The process flow is depicted in Fig. 1. The proposed framework consists of two branches, the left branch is GenCompareSum and the right branch is TextRank. Subsequently, using different methodologies, sentence scores are generated by the two methods; these are ultimately fused in the proposed hybrid framework in order to generate the final summary. The detailed steps are explained in the subsections below.

Fig. 1
figure 1

A comprehensive architecture of the proposed model CovSumm

3.1 Data-preprocessing and heuristic sentence extraction for dataset generation

The CORD-19 dataset containing Covid-19 scientific literature [25, 37] was used for evaluation purposes. The documents published between January 1, 2021 and December 31, 2021 were extracted to construct the summarization corpus. For this dataset, it was considered that the original abstract of the paper is used as the gold summary and the actual paper content is used as the input document. To evaluate the proposed model’s performance, a comparison is made among the gold summary and the proposed model summary. After eliciting the research articles within the specified time frame, the documents with empty abstract fields were removed. Furthermore, the non-English documents and duplication were removed based on the paper title. After all the pre-processing steps mentioned above, 840 scientific articles were left. The BERT model is restricted to 512 words since the transformer models are constrained by this constraint. CORD-19 documents average 6970 words in length, which is more than the length constraint of input available to existing pre-trained language models. Working with long sequences requires high computational power. Since the majority of the preceding study evaluating transformer-based methods uses truncated documents [7, 10,11,12, 21, 37], a corpus was also constructed for evaluating the proposed method by truncating long articles to 512 words long. In consideration that most of the salient knowledge of a research article is documented at the start of the research article, starting from the top of the article, we successively assimilate sentences to form one paragraph until 512 words long [1].

3.2 Sentence scores generation using the pre-trained T5 model

The subsequent part describes the sentence score generation by utilizing the pre-trained T5 model of GenCompareSum. The input to the model is corpus C which consists of k documents \(C=\left\{{D}_{1},{D}_{2},{D}_{3},\dots .,{D}_{k}\right\}\). Each document has several sentences \(D=\left\{{q}_{1},{q}_{2},{q}_{3},\dots .,{q}_{n}\right\}\) and the output from the system is \({D}^{{{\prime}}}=\left\{{{q}^{{{\prime}}}}_{1},{{q}^{{{\prime}}}}_{2},{{q}^{{{\prime}}}}_{3},\dots ,{{q}^{{{\prime}}}}_{m}\right\}\), where m < n. Firstly, the documents should be split into sentences, then grouped into several sections. The T5 transformer model receives these sections as its input and develops a fixed number of text fragments for each section. The output from the T5 text generation model is fragmented text \(F=\left\{{f}_{1},{f}_{2},{f}_{3},\dots .,{f}_{l}\right\}\) for each section. To remove the redundancy, N-gram blocking is employed along with a combination of combined text fragments. The weights \(W=\left\{{w}_{1},{w}_{2},{w}_{3},...,{w}_{p}\right\}\) for text fragments are obtained for the top p text fragments. Based on the BERT score, the correlation between the sentence from the source document and the fragment chosen in the previous step is calculated.

A similarity matrix is derived based on the scores calculated above. The similarity scores are multiplied by the weights generated for each fragment and summed up over the text fragments to get the final sentence scores using the T5 model. Final scores from the T5 model are summarized as \(S=\left\{{s}_{1},{s}_{2},{s}_{3},\dots ,{s}_{n}\right\}.\)

The equation to calculate the final scores for every sentence i is as stated below.

$${s}_{i} \left| {T_{5} } \right. =\sum_{t=1}^{t=p}{w}_{t}*BERTScore\left({s}_{i},{f}_{t}\right)$$
(1)

3.3 Sentence scores generation using the graph-based model

In this section, the TextRank method of text summarization is described. A total of k documents constitute the corpus C, such that \(C=\left\{{D}_{1},{D}_{2},{D}_{3},\dots .,{D}_{k}\right\}\), which is the input for the models. Each document contains several sentences \(D=\left\{{q}_{1},{q}_{2},{q}_{3},\dots .,{q}_{n}\right\}\). In the data preprocessing step, sentence tokenization and word tokenization are performed. Subsequently, punctuations and stopwords are removed, and the text is converted to lowercase. The pre-trained GloVe embedding is utilized to obtain the target sentence vector \(Vec=\left\{{vec}_{1},{vec}_{2},{vec}_{3},\dots ,{vec}_{n}\right\}\) per document. The similarity among the sentences is computed by utilizing the cosine similarity formula. The formulation of cosine similarity is as stated.

$$\mathrm{cos}\left(\theta \right)=\frac{A\cdot B}{\parallel A\parallel \cdot \parallel B\parallel }$$
(2)

Using the embedding vectors denoted by A and B, we can calculate the correlation between two sentences. The degree of correlation can be measured on a scale from 0 to 1, with a value of 0 indicating minimal similarity and a value of 1 indicating maximal similarity. For calculating scores, graphs are constructed using cosine similarity matrices. The graph \(G\left(V,E\right)\) comprises of vertices and edges, where the vertices denote the nodes and the edges denote the links between them. The nodes in the graph correspond to the sentences present in a document, whereas the edges signify the degree of similarity among the sentences within the document. The process of generating the graph involves representing sentences as nodes, and calculating the weights of the edges connecting nodes using Eq. (2),

The score of a node can be computed using the given formula:

$${s{^{\prime}}}_{i} \left| {Graph} \right. =\left(1-d\right)+d\times \sum_{j\in In\left({V}_{i}\right)}\frac{1}{\left|\mathrm{out}\left({V}_{j}\right)\right|}s\left({V}_{j}\right)$$
(3)

Regarding a directed graph, the collection of nodes that direct toward a specific node \({V}_{j}\) is designated as \(\mathrm{In}\left({V}_{j}\right)\), whereas the collection of nodes that \({V}_{j}\) directs is designated as \(\mathrm{out}\left({V}_{j}\right)\). Here, a dam** factor \(d\) is used to control the chances of randomly hop** from one node to another node. This value is typically set to between 0 and 1 and determines the weight given to the random walk component in the overall score calculation. By adjusting the dam** factor, one can fine-tune the relative importance of local and global graph structure in the ranking process. To generate graphs, the Networkx library is utilized. The default value of the dam** factor is 0.85. The PageRank function of the Networkx library is employed for generating the sentence scores, every node is assigned a score that reflects its importance within the network. This score is used to rank the nodes in order of importance, with the most highly-ranked nodes typically being those that are most central or influential in the network; the sentence scores are denoted by \(S{^{\prime}}=\left\{s{^{\prime}}_{1},s{^{\prime}}_{2},s{^{\prime}}_{3},\dots ,s{^{\prime}}_{n}\right\}.\)

Although the algorithm was initially intended for use with directed graphs, it can also be utilized with undirected graphs if certain conditions are met. Specifically, the algorithm can be applied to undirected graphs in which each vertex has the same number of incoming and outgoing edges.

3.4 Extracting relevant sentences based on scores for the final summary

The proposed method presents two sentence scores, one from the T5 model denoted as \(S=\left\{{s}_{1},{s}_{2},{s}_{3},\dots ,{s}_{n}\right\}\) and another from the graph-based model denoted as \(S{^{\prime}}=\left\{s{^{\prime}}_{1},s{^{\prime}}_{2},s{^{\prime}}_{3},\dots ,s{^{\prime}}_{n}\right\}.\) The hybrid methodology proposed in this study combines these two scores to generate the final sentence scores. To combine these scores, the following formulation is used to integrate the sentence scores shown in (1) and (3).

$$ {\text{Score}}_{i} = \left( {1 - \beta } \right)*\left( {s_{i} \left| {T5} \right.} \right) + \left( \beta \right)*\left( {s_{i}^{\prime } \left| {{\text{Graph}}} \right.} \right) $$
(4)

Here \(\beta \) (Beta) is the influence factor, and its value ranges from 0 to 1. This study shows it is possible to shift more weight to the model that performs better. After obtaining the final scores from the proposed method, the top N sentence excerpts are chosen for the final summary.

4 Experimental result and analysis

4.1 Dataset used

The updated CORD-19 [37] Corpus was downloaded from Kaggle on March 31, 2022. The White House, in partnership with various research associations, assembled the CORD-19 corpus of scholarly articles about COVID-19. This database contains scientific papers from PubMed Central, PubMed, WHO’s Database, and other pre-print servers such as bioRxiv, ar**v, and medRxiv. High computational power is required to work with a large number of documents. Training is not enforced for the unsupervised methods though; we have tested the proposed methodology on the scientific papers in the database dated from January 1, 2021 to December 31, 2021, consisting of 840 documents in total.

4.2 State-of-the-art models for extractive text summarization

Text summarization is the mechanism of producing a brief overview of a document by extracting and linking the salient phrases within it. This task has garnered significant attention from the Natural Language Processing (NLP) community, and several unsupervised approaches have been developed to address it. These techniques utilize various heuristics, such as term or phrase frequency, sentence similarity, and sentence placement, to determine the most salient sentences. Recently, neural network-based approaches have also emerged, which typically train a model to directly predict the most influential sentences in an article. The proposed methodology is compared with the following powerful unsupervised summarization techniques. In Table 1, state-of-the-art techniques for unsupervised extractive-based text summarization are displayed.

Table 1 State-of-the-art models for extractive text summarization

4.3 Evaluation metric

In [39], the authors presented the ROUGE score metric. Machine translation and summarization are generally evaluated using this set of metrics. This metric employs the correlation between the proposed methodology-generated summary with the gold summary. ROUGE-1 estimates the number of unigrams that match the method output against the reference output. ROUGE-2 is analogous to ROUGE-1, but it measures the number of bi-grams. The longest common subsequence (LCS) between a system summary and ground truth summary is calculated using ROUGE-L. Based on F1 scores computed using Pyrouge version 1.5.5, the performances of the proposed method and the state-of-the-art approaches were analyzed and compared in terms of the ROUGE-1, ROUGE-2 and ROUGE-L metrics.

The following formulae are used to calculate recall, precision, and F1 score for ROUGE-1 and ROUGE-2:

Recall score: The ratio of total counts of co-occurring n-grams encountered in both the system-generated summary and gold summary to total counts of n-grams in the reference summary.

$$\mathrm{Recall}=\frac{{\sum }_{q\in {Summary}_{model}}{\sum }_{{gram}_{n}\in q}{Count}_{match}\left({gram}_{n}\right)}{{\sum }_{q\in {Summary}_{ref}}{\sum }_{{gram}_{n}\in q}Count\left({gram}_{n}\right)}$$
(5)

Precision score: The ratio of total counts of co-occurring n-grams encountered in both the model-generated summary and gold summary to total counts of n-grams in the model-generated summary.

$$Precision=\frac{{\sum }_{q\in {Summary}_{model}}{\sum }_{{gram}_{n}\in q}{Count}_{match}\left({gram}_{n}\right)}{{\sum }_{q\in {Summary}_{model}}{\sum }_{{gram}_{n}\in q}Count\left({gram}_{n}\right)}$$
(6)

F1 score: The F1 score can therefore be estimated as follows using the recall and precision values.

$$F1=2*\frac{\mathrm{Precision}*\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$$
(7)

The following formulae are used to calculate recall, precision, and F1 score for ROUGE-L.

Recall score: In the system-generated and gold summary, the total number of LCS n-grams encountered is divided by the total number of n-grams in the gold summary.

$$Recall=\frac{{\sum }_{q\in {Summary}_{model}}{\sum }_{{gram}_{n}\in q}{LCS}_{match}\left({gram}_{n}\right)}{{\sum }_{q\in {Summary}_{ref}}{\sum }_{{gram}_{n}\in q}Count\left({gram}_{n}\right)}$$
(8)

Precision score: In the system-generated and gold summary, the total number of LCS n-grams encountered is divided by the total number of n-grams in the system-generated summary.

$$Precision=\frac{{\sum }_{q\in {Summary}_{model}}{\sum }_{{gram}_{n}\in q}{LCS}_{match}\left({gram}_{n}\right)}{{\sum }_{q\in {Summary}_{model}}{\sum }_{{gram}_{n}\in q}Count\left({gram}_{n}\right)}$$
(9)

F1 Score: To estimate the F1 score for ROUGE-L, one can use the equation that is shown in Eq. (7).

4.4 Parameter settings

The models were executed on Google Colab Pro with GPUs on NVIDIA Tesla T4 in addition to high RAM. To find the best parameter setting for the proposed model, a detailed performance analysis was conducted with different parameter settings. A list of all the optimal parameter settings for the model is presented in Table 2. Stride is used to make the chunks or sections as the input to the T5 transformer, whose value is experimented in the range of 2 and 10. The temperature value has been experimented within the range of 0.2 to 1. The number of salient texts is the parameter that helps to select the top-weighted fragments; it is experimented for the range of values from 5 to 12. N-gram blocking which is used to reduce the redundancy is tested for the values 3 and 4. BERTScore is employed to compute the degree of similarity between the input text and the fragments produced by using the T5 transformer, it is tested for base models, i.e., bert-base-uncased and allenai/scibert_scivocab_cased. The count of sentences parameter helps to include the top sentences for the final summary generation, its value is tested within 4–10. The β value used to calculate the scores is tested for the range of 0–1.

Table 2 Optimal parameter setting for the proposed methodology

4.5 Discussion of results

Unsupervised extractive-based summarization models are compared and analyzed on the truncated version of the CORD-19 corpus using the ROUGE metric. The study shows, according to Table 3 and Fig. 2, that the hybrid proposed methodology surpassed the listed techniques, with the F1 scores of 40.14%, 13.25%, and 36.32% for ROUGE1, ROUGE-2 and ROUGE-L, respectively. Whereas ROUGE-1 gains the highest F1 score of 40.14%. The lowest performing is the RANDOM method with F1 score of ROUGE-1: 33.26% and ROUGE-2: 8.17%. SumBasic, with a score of 27.63%, is the model performing the least well for ROUGE-L. In conclusion, the proposed CovSumm model performs well for all metrics including ROUGE-1, ROUGE-2, and ROUGE-3.

Table 3 Performance of powerful summarization models by F1 scores on CORD-19 corpus
Fig. 2
figure 2

The F1 scores for ROUGE-1, ROUGE-2, and ROUGE-L were utilized to assess the performance of various unsupervised summarization models on the CORD-19 Corpus

The efficacy of the proposed approach is analyzed with a distinct number of sentences, and the outcomes are illustrated in Table 4 and Fig. 3.

Table 4 Experimental analysis of the presented approach as per the parameter count of sentences against ROUGE Scores
Fig. 3
figure 3

The graph depicts how the total count of sentences influences the efficacy of the proposed technique when evaluating the CORD-19 corpus using the ROUGE metric a The number of sentences vs ROUGE-1 score b The number of sentences vs ROUGE-2 score c The number of sentences vs ROUGE-L score d The number of sentences vs ROUGE-1, ROUGE-2, and ROUGE-L score

Table 4 illustrates how the number of sentences used to produce the final summary generation affects the performance of the proposed hybrid model. The experiment was performed on values ranging from 4 to 10. According to the analysis conducted, the model performance increases with an increasing number of sentences. However, after 7, the performance starts to decline as the number of sentences increases. On selecting 7 sentences for the final summary, ROUGE-1, ROUGE-2, and ROUGE-L have the highest F1 value of 40.14%, 13.25%, and 36.32%, respectively. On value 4, ROUGE-1, ROUGE-2, and ROUGE-L scored the lowest with F1 scores of 35.78%, 11.87%, and 32.10%, respectively. From Fig. 3, it was analyzed that the proposed hybrid model performs best when selecting seven sentences for the summary generation for all the F1 scores of ROUGE-1, ROUGE-2, and ROUGE-L.

Additionally, the performance of the proposed model for different values of \(\beta \) is represented in Table 5 and Fig. 4. Table 5 presents a comparison of the efficacy of the proposed technique according to the parameter \(\beta \). An experiment was conducted with values ranging from 0 to 1. It was found that the model performance increases with an increasing \(\beta \) value, but gradually declines after 0.3. The highest F1 scores were achieved by ROUGE-1, ROUGE-2, and ROUGE-L, for \(\beta \) = 0.3. According to the F1 score, ROUGE-1, ROUGE-2, and ROUGE-L scored the lowest values of 39.92%, 13.05%, and 36.06%, respectively, for \(\beta \) = 0.6. According to the analysis from Fig. 4, can be inferred that the proposed methodology performs best when the \(\beta \) value is set to 0.3 for all the F1 scores of ROUGE-1, ROUGE-2, and ROUGE-L.

Table 5 Experimental analysis of the developed approach based on the parameter \(\beta \) value against the ROUGE values
Fig. 4
figure 4

The graph illustrates how the \(\beta \) Values influence the efficacy of the proposed methodology when evaluating the CORD-19 corpus using the ROUGE metric a \(\beta \) Values vs ROUGE-1 score. b \(\beta \) Values vs ROUGE-2 score c \(\beta \) Values vs ROUGE-L score d \(\beta \) Values vs ROUGE-1, ROUGE-2 and ROUGE-L score

Figure 5 illustrates a sample of the original text utilized as input for the CovSumm model. The reference summary used to calculate the ROUGE scores is the gold summary, while the summary produced by the proposed model is the generated summary.

Fig. 5
figure 5

Summary sample a Original text as input b Gold summary used as a reference summary c Summary generated from the proposed system

5 Conclusion

In this study, the CovSumm model was proposed for the summary generation of the scientific papers related to Covid-19. CovSumm is an unsupervised hybrid approach, a fusion of the recently introduced GenCompareSum method involving the pre-trained T5 model for language generation and the TextRank graph-based algorithm, to produce summaries for scientific research articles. The major advantage of CovSumm is that it is a hybrid approach that combines the strengths of two distinctive and complementary lines of research for unsupervised extractive text summarization: T5 language model and graph-based algorithm. This combination allows for a more comprehensive and non-redundant summary of COVID-19-relevant scientific publications, as demonstrated by the higher scores of the ROUGE evaluation metric as compared to the state-of-the-art methodologies. Another benefit of the proposed approach is that it can effectively handle the problem of information overload that is prevalent in a massive volume of COVID-19-relevant scientific studies. This would help to keep researchers and medical associations informed on the latest COVID-19-related information. The experimental results and analysis depict that the proposed model CovSumm outperforms various unsupervised summarization methods on CORD-19 corpus. The proposed CovSumm model will boost and give direction to researchers in further studies on effective summarization of scientific literature. The use of domain-specific pre-trained transformer models [42] may help to boost the performance scores. Further, this model can be experimented with several transformer-based language models like PEGASUS, GPT2, BART, etc., for language generation tasks, as well as with different graph-based approaches, and can be evaluated on various other biomedical datasets.