A hybridized red deer and rough set clinical information retrieval system for hepatitis B diagnosis

Mishra, Madhusmita; Acharjya, D. P.

doi:10.1038/s41598-024-53170-5

A hybridized red deer and rough set clinical information retrieval system for hepatitis B diagnosis

Article
Open access
Published: 15 February 2024

Volume 14, article number 3815, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

A hybridized red deer and rough set clinical information retrieval system for hepatitis B diagnosis

Download PDF

Madhusmita Mishra¹^na1 &
D. P. Acharjya¹^na1

359 Accesses
Explore all metrics

Abstract

Healthcare is a big concern in the current booming population. Many approaches for improving health are imposed, such as early disease identification, treatment, and prevention. Therefore, knowledge acquisition is highly essential at different stages of decision-making. Inferring knowledge from the information system, which necessitates multiple steps for extracting useful information, is one technique to address this problem. Handling uncertainty throughout data analysis is also another challenging task. Computer intelligence is a step forward to this end while selecting characteristics, classification, clustering, and develo** clinical information retrieval systems. According to recent studies, swarm optimization is a useful technique for discovering key features while resolving real-world issues. However, it is ineffective in managing uncertainty. Conversely, a rough set helps a decision system generate decision rules. This produces decision rules without any additional information. In order to assess real-world information systems while managing uncertainties, a hybrid strategy that combines a rough set and red deer algorithm is presented in this research. In the red deer optimization algorithm, the suggested method selects the optimal characteristics in terms of the degree of dependence on the rough set. In order to determine the decision rules, further a rough set is used. The efficiency of the suggested model is also contrasted with that of the decision tree algorithm and the conventional rough set. An empirical study on hepatitis disease illustrates the viability of the proposed research as compared to the decision tree and crisp rough set. The proposed hybridization of rough set and red deer algorithm achieves an accuracy of 91.7% accuracy. The acquired accuracy for the decision tree, and rough set methods is 82.9%, and 88.9%, respectively. It suggests that the proposed research is viable.

Fuzzy Logic and Correlation-Based Hybrid Classification on Hepatitis Disease Data Set

Evolutionary and Neural Computing Based Decision Support System for Disease Diagnosis from Clinical Data Sets in Medical Practice

Article 27 September 2017

Hepatitis Disease Diagnosis Using Multiple Imputation and Neural Network with Rough Set Feature Reduction

Introduction

Progressively data generation and its use have been increased in various sectors because of the wide spread of information and communication technology. The healthcare sector is one among many sectors which produce data exponentially. These sectors employ technologies that compress enormous volumes of data per microsecond. As data volumes increase, data analysis techniques including feature extraction, rule generation, and data reduction become more important¹. As a basis, many knowledge inference techniques were developed and computational intelligence is widely used to this extent. Dealing with ambiguity and imprecision in decision-making processes is computational intelligence’s core objective. Knowledge gained for information systems ought to be precise, intelligible, transparent, and visually conveyed. Machine learning is frequently used for the selection of characteristics, creation of rules, categorization, and grou** of them. Healthcare applications must retain the importance of data pertaining to a certain condition by selecting features and creating decision rules. Both selections of features and production of rules have been added based on the healthcare applications.

Besides, knowledge inference’s primary issue is choosing important attributes. Recent times have seen the use of swarm algorithms to select key features to classify a system. These techniques produce outputs that are more approximate than accurate. In the literature, a variety of swarm algorithm techniques are discussed. Engineering applications employ a variety of methods, such as Particle Swarm Optimization (PSO)², fish algorithm³, bat-inspired algorithm⁴, and whale optimization algorithm⁵ for feature selection. However, the use of these techniques in healthcare applications is limited and not widespread.

This research effort in consideration the Red Deer (RD) optimization technique⁶. The RD algorithm is meta-heuristic and imitates the natural behavior of RD. It was originally presented in 2018 and has demonstrated promising outcomes in resolving a variety of optimization issues. RD optimization is a population-based algorithm that uses a herd of RD’s collective intellect to discover the best answer to a problem. However, it deviates from PSO in some ways, including how the population moves and how the leader is chosen. Moreover, it also incorporates a few PSO characteristics, including the steering of search spaces, velocity control to control population movement, and the application of a fitness function to assess the caliber of potential solutions⁷.

For knowledge inference, on the other hand, cutting-edge computer techniques are being developed. To begin with, a fuzzy set⁸ is presented to deal with uncertainty. For knowledge inference, while controlling uncertainties, soft set⁹, Rough Set (RS)¹⁰, and various concepts are also introduced. Despite these equivalent procedures, RS is utilized in a variety of technical and scientific domains¹¹. The variation of RS and its applications are found in emerging areas such as science and engineering^12,13. An equivalency relation is the prime concept of RS. Further, binary relations, fuzzy equivalence relations, and intuitionistic fuzzy equivalence relations have been presented as alternatives to the equivalence relation^14,15. clinical information retrieval systems and knowledge inference both make use of these improvements^16,17. The RS has been expanded with numerous concepts at the same time. As an illustration, the RS has been combined with several algorithms, such as the neural network^18,19, genetic algorithm²⁰, fish swarm, cuckoo search, and shuffling frog lea** algorithm²¹. Even yet, the literature shows that the fusion of RS with swarm optimization is quite rare^{$f:(Q\times P)\rightarrow V$. If $P=(C\cup D)$, then the information system is referred to as a decision system. It is to be noted that C is the set of conditional attributes and D is the set of decisions. An equivalence relation, IND(R), as defined in Eq. (1) is the prime notion of RS where $R\subseteq (Q\times Q)$.}

$$\begin{aligned} IND(R)=\{(q_i, q_j): f_p(q_i) = f_p(q_j)\hspace{0.2cm} \forall \hspace{0.2cm} p\in P\} \end{aligned}$$

(1)

The equivalence relation R divides the set Q into several classes which may be expressed as Q/R. Consider $X\subseteq Q$ on which the perception is to be inferred. The approximations concerning the lower and upper, represented by ${\underline{R}}X$ and ${\overline{R}}X$, respectively, approximate the X. Equation (2) defines the lower approximation, while Eq. (3) defines the upper approximation.

$$\begin{aligned} {\underline{R}}X = \cup \{Y\in Q/R: Y\subseteq X\} \end{aligned}$$

(2)

$$\begin{aligned} {\overline{R}}X = \cup \{Y\in Q/R: Y\cap X\ne \phi \} \end{aligned}$$

(3)

There are two situations that result from the lower and upper approximation, like ${\underline{R}}X = {\overline{R}}X$ or ${\underline{R}}X \ne {\overline{R}}X$. In the first case, X is a crisp set whereas X is a rough set in the second. The boundary line objects in the latter scenario are designated as $BN_R(X)= {\overline{R}}X - {\underline{R}}X$. Suppose there are two equivalence relations on Q, $A\subseteq P$, and $B\subseteq P$. The A-positive region of B is described in Eq. (4).

$$\begin{aligned} POS_A(B)=\cup _{X\in Q/B}\hspace{0.1cm}{\underline{A}}(X) \end{aligned}$$

(4)

The definition of $k=\gamma _A(B)$, the measure of $B's$ dependence on A, is elucidated in Eq. (5). B is independent of A if $k=0$. Likewise, if $k=1$, then B completely depends on A. In contrast, B is dependent on A partially and $0<k<1$.

$$\begin{aligned} \gamma _A(B) = k = \frac{|POS_A(B)|}{|Q|} \end{aligned}$$

(5)

The notion $\psi \rightarrow \phi $ is the common form of a decision rule. In the decision rule, conditions are denoted by the symbol $\psi $, and the decision is represented by the symbol $\phi $. Support, strength, and precision are three essential characteristics related to decision rules. The decision rule’s support is denoted by the concept $S(\psi ,\phi ) = card(||\psi \wedge \phi ||)$. Likewise, the strength of the decision rule is represented as $\sigma (\psi ,\phi ) = S(\psi ,\phi )/card(||\psi ||_{\phi })$. In Eq. (6), where $NS(\psi , \phi )$ stands for non-support of a decision rule, the accuracy of the decisions is defined.

$$\begin{aligned} Accy = \frac{|S(\psi , \phi )|}{|NS(\psi , \phi ) + S(\psi , \phi )|} \end{aligned}$$

(6)

A quick explanation of the rough set rule making process is given. A categorical information system is analysed as part of the rule generating algorithm to obtain candidature rules. The computational procedures involved in the rough set rule creation procedure is presented below.

An analytical interpretation

Consider a liver disease diagnosis system, as indicated in Table 1. Ten patients’ worth of information is included. Five symptoms of liver disease are ascites ($p_1$), spiders ($p_2$), edema ($p_3$), bilirubin ($p_4$), and albumin ($p_5$). It shows that patient $q_1$ has ascites, spiders, no edema, very high bilirubin, and high albumin is classified as having liver disease. The remaining patients in Table 1 are likewise personified in a similar way.

Table 1 A sample decision table of liver disease.

Full size table

We obtain $Q/R=\{\{q_1, q_3\}, \{q_2, q_{10}\}, \{q_4, q_7\},\{q_5, q_8, q_9\}, \{q_6\}\}$ by applying equivalence relations on the features $P = \{p_1, p_2, p_3, p_4, p_5\}$. Taking $X = \{q_1, q_3, q_5, q_6, q_8, q_{10}\}$ into account, we obtain ${\underline{R}}X = \{q_1, q_3, q_6\}$ and ${\overline{R}}X = \{q_1, q_2, q_3, q_5, q_6, q_8, \ q_9, q_{10}\}$. $BN_R(X) = \{q_2, q_5, q_8, q_9, q_{10}\}$ are the boundary line objects as a result. The boundary line portions, lower, and upper approximations of the RS are outlined in Fig. 1 from a broad perspective.

Convictions of red deer optimization

Since ancient times, Scotland has supported populations of Red Deer (RD). Male stags and female hinds are the two main categories of RD. This animal exhibits extraordinary behavior when it is reproducing. Stags yell often loudly during this time of year to draw female hinds. Mostly, hind prefers males who yell frequently. The notions of the mating phenomenon are the foundation of Red Deer Optimization (RDO). It is a population-based meta-heuristics algorithm in which Male RDs (MRDs) have been chosen initially. Rest is regarded as a hind. MRD begins by roaring, then they split into two teams known as commanders and stags. Together, these two teams battle for control of the harem. Additionally, the quantity of hinds is related to the roaring and fighting skills of the commanders. As a result, in the harems, the commanders have numerous hindmattings. Further, a hind is mated by the closest male stags⁶.

Exploration and exploitation are two stages of the algorithm’s operation. The loudness of MRD in the search space promotes local search exploitation. Likewise, the manner in which combating between stags and commanders is taken into account in local searches to provide improved solutions. In the exploring stage, harems are also created and assigned to the commanders. The commanders mating with the hinds of the relevant harems and other harem during this phase. The matting phase of the algorithm, which creates RD offspring, is another stage of this optimization³⁵.

Let us consider a population of RD’s defined in Eq. (7). Further, the fitness of each RD is obtained using Eq. (8), where m is the number of features.

$$\begin{aligned} q^{RD} = \{p_1, p_2, p_3, \cdots , p_m\} \end{aligned}$$

(7)

$$\begin{aligned} Fitness = f(q^{RD}) = f(p_1, p_2, p_3, \cdots , p_m) \end{aligned}$$

(8)

The procedure starts with an elementary population of size n. Further, the population is categorized into MRD and Hind RD (HRD). While HRDs are thought of as diversification, MRDs have intensification characteristics in the population. Besides, the MRDs enhance their ranks using the Eq. (9), where UB refers to upper bound and LB refers to lower bound of the search space. The constants $s_1$, $s_2$, and $s_3$ are the random numbers between 0 and 1 and refer to the three stages of roaring.

$$\begin{aligned} q_{new}^{MRD} = \left\{ \begin{array}{ccc} q_{old}^{MRD} + s_1\times ((UB - LB)\times s_2 + LB) &{} if &{} s_3\ge 0.5\\ &{} &{} \\ q_{old}^{MRD} - s_1\times ((UB - LB)\times s_2 + LB) &{} if &{} s_3 < 0.5 \end{array} \right. \end{aligned}$$

(9)

Further using Eq. (10), we calculate the number of commanders, where $\alpha \in [0, 1]$ refers to a random number and $N^{MRD}$ refers to the total number of MRDs. Similarly, $N^{Com}$ refers to the number of commanders. Equivalently, the number of stags $N^{Stag}$ is defined as $N^{Stag} = N^{MRD} - N^{Com}$.

$$\begin{aligned} N^{Com} = round(\alpha . N^{MRD}) \end{aligned}$$

(10)

The fighting behavior between commanders and stags that leads to two offspring is expressed analytically using Eqs. (11) and (12) respectively. Please take note that $b_1$ and $b_2$ are uniformly distributed random numbers between 0 and 1.

$$\begin{aligned} q_{new1} = \frac{(Com + Stag)}{2} + b_1\times ((UB-LB)\times b_2 + LB) \end{aligned}$$

(11)

$$\begin{aligned} q_{new2} = \frac{(Com + Stag)}{2} - b_1\times ((UB-LB)\times b_2 + LB) \end{aligned}$$

(12)

Further, it develops a harem. It is a herd of hinds that a male commander captured. The Objective Fitness (OF) of the male commander determines the number of hinds in a harem. Therefore, using $V_n = \nu _n - Max_{i}{\nu _i}$, hinds are distributed among commanders, where $\nu _n$ is the power of the $n^{th}$ commander and $V_n$ is its normalized value. Using Eq. (13), the normalized power of the commander is calculated.

$$\begin{aligned} Pow_{n} = \left| \frac{V_n}{\sum _{i=1}^{N^{Com}}V_i}\right| \end{aligned}$$

(13)

Consider $N^{Hind}$ is the total number of hinds, and $N-{n}^{harem} = round(Pow_{n}\times N^{Hind})$ can be used to calculate the number of hinds in a harem. Furthermore, a commander uses $N_{n}^{Harem_{mate}} = round(\delta _1 \cdot N_{n}^{Harem})$ to do the deer mating activity, where $\delta _1\in [0, 1]$ refers to the initial parameter concerning the percent of hinds are the parents in the same harem. The offspring produced by the mating process is defined in Eq. (14), where $c\in [0, 1]$ refers to a uniformly distributed random number.

$$\begin{aligned} q_{off} = \frac{(Com + Hind)}{2} + c\times (UB - LB) \end{aligned}$$

(14)

Moreover, a commander uses $N_{k}^{Harem_{mate}} = round(\delta _2 \cdot N_{k}^{Harem})$ to do the deer mating activity, where $\delta _2\in [0, 1]$ refers to the initial parameter concerning the percent of hinds are the parents in another harem. Here, k refers to a randomly chosen harem. Finally, the remained stag mates with the nearest hind. A distance function defined in Eq. (15) is used for a stag to find the nearest hind, where the distance between $i^{th}$ hind and a stag is denoted as $d_i$ and J refers to the dimensional space. The flow diagram of RDO is presented in Fig. 2.

$$\begin{aligned} d_i = \left\{ \sum _{j\in J}\left( q_j^{Stag} - q_j^{Hind_i}\right) ^2\right\} ^{1/2} \end{aligned}$$

(15)

Overview of proposed research

This section outlines the four-phased research design that is anticipated. At the early stage, a medicinal record system for hepatitis B is gathered. The medicinal record system demonstrates that independent of decisions, the conditional parameter values of different patients hold the same. Physicians’ differing opinions are the main reason and it ultimately results in uncertain data analysis. As clinical information systems involve uncertainty, uncovering hidden information can be difficult. As a result, when analyzing data, it is crucial to cope with ambiguous and partial information in classification. Hence, the main goal of this phase is to remove missing data and noise. The proposed Rough Set Red Deer Optimization (RSRDO) algorithm is used to further examine the processed decision system in the next phase. It aids in locating the prime features that influence the decision system. In the third phase, RS is used to develop the clinical information retrieval system. The decision rules are further validated during the fourth phase of the research design. Figure 3 displays the planned research’s block diagram.

In order to infer knowledge and develop a clinical information retrieval system, this study uses an integrated data analysis procedure that combines RS and Red Deer Optimization (RDO) algorithms. The proposed RSRDO controls the uncertainties that occur in the clinical decision system. Another goal of this study is to achieve high accuracy of classification with less number of conditional features. The medicinal record system for hepatitis B disease is used to build a clinical information retrieval system using the projected integrated technique RSRDO. We assume that a deer will find the best parameter subset given a binary bit string of length m. In this instance, m indicates the conditional parameters in the medicinal record system. If one of the component values in the solution vector is 1, the related conditional attribute is chosen. Similarly, if the component value is 0, the conditional attribute is not chosen for develo** a clinical information retrieval system. Moreover, the fitness of each solution vector is determined using a fitness function as stated in Eq. (16).

$$\begin{aligned} Fitness\hspace{0.1cm} f(q) = \alpha \gamma (q) + \beta \left( 1 - \frac{m_{s}}{m_c}\right) \end{aligned}$$

(16)

In Eq. (16), the degree of dependency is referred to $\gamma (q)$ as described in Eq. (5). The terms “$m_s$” and “$m_c$” stand for selected parameters and total parameters correspondingly. The notion $\alpha $ refers to the degree of dependency whereas $\beta $ refers to the weight of other parameters considered. It is to be noted that $\alpha + \beta = 1$. Besides, the value of $\alpha $ must be high. The fitness function identifies the most relevant attributes pertaining to the disease hepatitis B. The procedure of the suggested RSRDO algorithm is defined in the earlier section.

The clinical information retrieval system for hepatitis B divides the condition into two groups: acute and chronic. This exploratory research uses the proposed RSRDO technique to find the important features, and RS to create a clinical information retrieval system. The clinical information retrieval system is further developed using RS based on the important feature values. So, the decision rules produced by using the integrated approach assist doctors in making the correct decision. Simultaneously, it saves the life of a person by saving time, and money.

The primary conditional features in the hepatitis B disease decision system are established using the RSRDO technique. Further, irrelevant features are eliminated from the decision table and decision rules are generated. It is also checked with domain experts that, the reduced decision system is suitable for building a clinical information retrieval system. The reduced medicinal record system is partitioned into two sections known as the training section and the testing section. A total of 70% of the data are in the training section, while 30% are in the testing section. The generation of RS decision rules is applied to the training section. Using Eq. (6), each decision rule’s accuracy is calculated. Moreover, each decision rule’s support, non-support, and strength are calculated. For the creation of a clinical information retrieval system, a threshold of 65% is considered. The decision rule whose accuracy is less than 65% is discarded.

The clinical information retrieval system is further examined using 30% of testing data in the validation phase. Various measures like recall (Recl.), precision (Precn.), accuracy (Acc.), and F-score are considered for obtaining the accuracy of the model. The F-score is computed to balance precision, and recall, and to analyze uneven data classification. These numerous measures are defined mathematically using Eqs. (17), (18), (19), and (20) respectively. It employs the terms $t^p$, $f^p$, $t^n$, and $f^n$ for true positive, false positive, true negative, and false negative respectively.

$$\begin{aligned} Recall\hspace{0.1cm} (Recl.) = \frac{|t^p|}{|t^p + f^n|} \end{aligned}$$

(17)

$$\begin{aligned} Precision\hspace{0.1cm} (Precn.) = \frac{|t^p|}{|t^p + f^p|} \end{aligned}$$

(18)

$$\begin{aligned} F-score = 2 \times \left( \frac{Precn. \times Recl.}{Precn. + Recl.}\right) \end{aligned}$$

(19)

$$\begin{aligned} Accuracy\hspace{0.1cm} (Acc.) = \frac{|t^p + t^n|}{|t^p + f^p + t^n + f^n|} \end{aligned}$$

(20)

Experimental research on hepatitis B

In this section, a clinical investigation of hepatitis B is described. The hepatitis B virus (HBV) is the cause of hepatitis B, a deadly liver illness. It significantly affects the state of world health. It can lead to persistent illness and significantly increases the risk of cirrhosis and liver cancer-related death. It is found in the liver. In addition to managing blood sugar levels and detoxifying the body, the liver also manages digestion, energy production, glycogen storage, and other bodily functions. Cells in the liver tissue are impacted by hepatitis, which compromises their functionality. Hepatitis A, B, C, D, and E are only a few of the several types. Hepatitis B is the most common liver infection, nonetheless, in the entire world. Using razors that have been used by an infected person, injecting drugs using an infected syringe, and contact with infectious bodily fluids are the main ways it is disseminated. Some of the common symptoms include jaundice, fever, skin itching, lack of appetite, weakness, ascites, abnormal blood clotting, dark urine, headache, pale stools, joint pain, and stomach bleeding. Figure 4 displays the hepatitis B signs and symptoms.

The features of the hepatitis B disease and its categories are listed in Table 2 below. Information for this data set was gathered from the UCI repository³⁶. Additionally, medical records from some primary health centers in West Bengal, India, are taken into account for the analysis. It has one decision parameter $a_d$ with 19 conditional features $p_1, p_2, \ldots , p_{19}$. These conditional features are further categorized taking assistance from expert physicians. For instance, four groups of alk phosphate have been categorized: 26–96; 96.1–147; 147.1–194; and 194.1–295. These groups are nominated as 1, 2, 3, and 4 respectively for analysis. However, the data analysis is unaffected by this representation.

Table 2 Features of hepatitis B and its categorization.

Full size table

The hepatitis B medicinal records are divided into two classifications, chronic and acute, according to the judgment. In this experimental study to create a clinical information retrieval system, an integrated RSRDO approach is used. Based on the values of the conditional features, a specific judgment is taken into consideration for each patient. Therefore, using the RSRDO feature selection algorithm and rough set is crucial to produce decision rules. This aids the clinical information retrieval system in making a preliminary diagnosis of an illness. A sample medicinal record system is illustrated in Table 3.

Table 3 Medicinal record system of hepatitis B disease.

Full size table

Result analysis of proposed model

The investigation is conducted using a computer system with an Intel Core i5-4200U CPU running at 1.60 GHz and 2.30 GHz, Windows 10, and 32GB of RAM. The evaluation of the investigation is done using Python. Furthermore, the proposed RSRDO procedure is used to find the significant features in the data of 643 patients. A total of 1000 iterations are taken into account for the investigation. Besides, we have considered 10 runs and each feature’s significance is calculated. Further, we have computed the average of all the runs to get the accuracy of each feature. The significance of each feature for all 10 runs is presented in Table 4. The primary characteristic is the total number of characteristics with significance values over the trend line. Figure 5 exhibits the feature’s significance relating to the proposed RSRDO algorithm. Nine features in all have been chosen for analysis. Gender ($p_2$), steroid ($p_3$), fatigue ($p_5$), anorexia ($p_7$), palpable spleen ($p_{10}$), histology ($p_{14}$), bilirubin ($p_{15}$), SGOT ($p_{17}$), and albumin ($p_{18}$) are the indicated features. The information system also removes other features like age ($p_1$), antivirals ($p_4$), malaise ($p_6$), liver big ($p_8$), liver firm ($p_9$), spiders ($p_{11}$), ascites ($p_{12}$), varices ($p_{13}$), alk phosphate ($p_{16}$), and protime ($p_{19}$).

Table 4 Significance of characteristics in each run.

Full size table

The reduced medicinal record system is split into 70% (450) training data and 30% (193) testing data for building a clinical information retrieval system making use of the RS. The training set includes 137 acute instances and 313 chronic instances of hepatitis B. The RS technique is imposed to investigate these 450 training data for generating decision rules. The hepatitis B decision system’s decision rules, which were created from training data, are shown in Tables 5 and 6 respectively. Decision rules that have an accuracy rate of less than 65% are discarded as candidate rules. Furthermore, these decision rules are validated by making use of 193 testing data.

Table 5 Training decision rules of hepatitis B confining proposed RSRDO.

Full size table

Table 6 Training decision rules of hepatitis B confining proposed RSRDO (continued).

Full size table

It shows from the data used for training, 67 decision rules were generated. It is evident from Tables 5 and 6 that there are 5 rules that are less accurate than the specified conception value 65%. Hence, these 5 rules are removed. Further, 62 decision rules are analyzed using 193 (30%) medical records. It includes 84 records of acute cases and 109 records of chronic cases. The testing analysis is presented in Tables 7 and 8 respectively.

Table 7 Testing decision rules of hepatitis B confining proposed RSRDO.

Full size table

Table 8 Testing decision rules of hepatitis B confining proposed RSRDO (continued).

Full size table

Tables 7 and 8 show that the rules 3, 9, 10, 19, 29, 32, 44, 60, and 62 are removed because of accuracy lower than 65%. It is evident that decision rules are reduced by 14.52% as a result of the testing study. The confusion matrix is also developed in order to assess the correctness of the suggested RSRDO procedure. The confusion matrix for the RSRDO procedure is shown in Table 9. It is seen that the RSRDO procedure has acquired a 91.7% accuracy level.

Table 9 Performance measure of RSRDO over hepatitis B disease.

Full size table

Comparison analysis

This section conducts a comparison of the RSRDO procedure with the Decision Tree (DT) procedure, RS procedure, and Red Deer Optimization—Decision Tree (RDODT) procedure. The individual comparison analysis is shown in the subsection that follows.

Comparison analysis with RS

Making use of 450 patient records of the training data, RS data analysis is performed, which takes into account all hepatitis B features. 34 decision rules are generated from the RS data analysis³⁷. Since every rule has a score of at least 65%, it is regarded as a candidate rule. The decisions created using the RS are shown in Table 10. It shows that the RS model produces 6% more rules than the RSRDO model.

Table 10 Training decision rules of hepatitis B confining RS.

Full size table

Making use of 193 patients’ data from the testing dataset, the 34 decision rules are examined further in detail. All the decision rules have an accuracy of more than 65% and hence are selected. Moreover, the confusion matrix of the RS model, which takes into account all classes, is computed and shown in Table 11. Table 11 demonstrates that the RS model provides an accuracy of 88.9%. However, the RSRDO predictive accuracy is 91.7%. Because of this, the proposed RSRDO approach offers 2.8% more accuracy than the RS model. Figure 6 shows different measurements for the two models, RSRDO and RS to make them easier to understand.

Table 11 Performance measure of RS over hepatitis B disease.

Full size table

Comparison analysis with DT

The outcomes of the projected RSRDO approach and the Decision Tree (DT) approach have been compared in this Sect. ³⁸. The DT approach is used to produce the decision rules while taking into account all 19 characteristics. DT algorithm is used to attain an accuracy of 82.9%. Figure 7 displays the decision rules that the DT procedure generated. It follows that the RSRDO model is 8.8% more accurate than the DT procedure. Similarly to this, the RS procedure is 6.0% more accurate than the DT procedure.

Comparison analysis with RDO-DT approach

The outcomes of the projected RSRDO approach and the RDO-Decision Tree (RDODT) procedure have been compared in this section. DT approach is used to construct the decision rules while taking into account the chosen RDO approach characteristics. Using the RDODT technique, an accuracy of 88.6% is attained. The decision rules produced by the RDODT procedure are shown in Fig. 8. Consequently, it is evident that the DT approach has an accuracy of 5.8% lower than the RDODT procedure. In a similar vein, the RDODT model has 0.3% less accurate than the RS approach. However, the proposed RSRDO procedure is 3.1% more accurate than the RDODT procedure.

A comparative analysis of all models relating to recall, precision, f-measure, accuracy is presented in Table 12. From the analysis, it is clear that the proposed model RSRDO performs better across all the measures.

Table 12 Comparative performance measure of all models over hepatitis B disease.

Full size table

Research contributions and limitations of the study

This section highlights the research contributions and limitations of this research work. In this study, the following contribution has been made.

1.
To help doctors diagnose hepatitis B illnesses, a novel clinical information retrieval system that combines the RS and RD algorithms has been outlined.
2.
Using a hepatitis B clinical information system, a novel feature selection approach that integrates RD optimization and the degree of dependency of the RS is given and examined.
3.
The advocated RSRDO model’s experimental effectiveness is assessed over the assessment of hepatitis B.
4.
In terms of accuracy, the suggested approach RSRDO is also contrasted with the RS, DT, and RDODT models. It is discovered that RSRDO outperforms other models despite having the fewest features in the decision system.
5.
In comparison to the RS model, the suggested model RSRDO produces 55.9% more decision rules while achieving a high accuracy of 91.7%.

Limitations of the experimental research

A clinical information retrieval system is developed by the integration of an RS and red deer algorithm in the suggested research study. Qualitative data are supported by the RS data analysis. Thus, with the assistance of specialized professionals, an attempt has been made to convert the information system’s continuous data values to subjective information. Without consulting specialized professionals, a fuzzy RS could be able to handle this more effectively. Similar to this, the RSRDO algorithm does not balance local and global search in feature selection. It is because, loudness of MRD in the search space promotes local search exploitation. Likewise, the manner in which combating between stags and commanders is taken into account in local searches to provide improved solutions. These two represent the main research work limitations that may be solved in further studies.

The proposed algorithm RSRDO is a meta-heuristic algorithm. On employing the data partition approach, the problem can be scaled to a larger datasets. Besides, RS supports parallel processing also. All meta-heuristics algorithms never provides optimal solution to all problems. In general the meta-heuristic algorithms are problem specific. So, the proposed RSRDO algorithm may not provide optimal solution to all the problems. It can be studied in future research. At present, it can be considered as a limitation.

Conclusions

Each short while, more and more healthcare data are being gathered. Data often exhibits uncertainties, which is a frequent feature. It takes a lot of work to analyze such data and produce any useful information. In order to achieve this, this work introduces the RSRDO clinical information retrieval system, which combines RS and RDO for disease diagnosis. Over the hepatitis B diagnosis system, the integrated RSRDO clinical information retrieval system is being examined. Additionally, the classic RS, DT, and RDODT models are contrasted with the clinical information retrieval system RSRDO. The suggested RSRDO procedure outperforms all mentioned procedures with an accuracy of 91.7%. For the RS, DT, and RDODT procedures, the obtained accuracy is 88.9%, 82.9%, and 88.6%, respectively. The investigation shows that the RSRDO procedure is 2.8% more accurate than the RS procedure. Likewise, the RSRDO procedure is 8.8% more accurate than the DT procedure and 3.1% more accurate than the RDODT procedure. Furthermore, 55.9% more rules are generated by the suggested RSRDO model than by the conventional RS model.

This research has mainly several practical advantages. The projected clinical information retrieval system uses minimum number of features for the diagnosis of hepatitis B. The helps in reducing the cost of treatment for patients while diagnosis of hepatitis B. Besides, the clinical information system generates more number of rules with less number of features. These rules help the physician to look in deeper for the hepatitis B diagnosis. As a result, the disease can be detected at an early stage and it can save the life of a patient. The findings obtained through this research work, it is anticipated, will help doctors choose the best course of action.

In the information system, it is discovered that several features have continuous values. With the assistance of discipline experts, these continuous values of the features are classified. Therefore, improved accuracy could result from the hybridization of fuzzy RS and RDO. Furthermore, the decision rules acquired can be subjected to formal concept analysis to identify the primary influencing elements. It is intended as a future line of research.

Data availibility

The supporting data set was gathered from the UCI repository³⁶. Additionally, medical records from some primary health centers in West Bengal, India, are taken into account for the analysis. The UCI repository data is available in a publicly accessible repository, whereas the data collected from primary health centers in West Bengal, India will be made available on request due to restrictions such as privacy or ethics.

References

Lavrač, N. Selected techniques for data mining in medicine. Artif. Intell. Med. 16, 3–23. https://doi.org/10.1016/S0933-3657(98)00062-1 (1999).
Article PubMed Google Scholar
Poli, R., Kennedy, J. & Blackwell, T. Particle swarm optimization: An overview. Swarm Intell. 1, 33–57. https://doi.org/10.1007/s11721-007-0002-0 (2007).
Article Google Scholar
Li, X.-L. An optimizing method based on autonomous animats: Fish-swarm algorithm. Syst. Eng. Theory Pract. 22, 32–38 (2002).
Google Scholar
Yang, X.-S. & He, X. Bat algorithm: Literature review and applications. Int. J. Bio-inspir. Comput. 5, 141–149. https://doi.org/10.1504/IJBIC.2013.055093 (2013).
Article Google Scholar
Mirjalili, S. & Lewis, A. The whale optimization algorithm. Adv. Eng. Softw. 95, 51–67. https://doi.org/10.1016/j.advengsoft.2016.01.008 (2016).
Article Google Scholar
Fathollahi-Fard, A. M., Hajiaghaei-Keshteli, M. & Tavakkoli-Moghaddam, R. Red deer algorithm (rda): A new nature-inspired meta-heuristic. Soft. Comput. 24, 14637–14665. https://doi.org/10.1007/s00500-020-04812-z (2020).
Article Google Scholar
Jain, M., Saihjpal, V., Singh, N. & Singh, S. B. An overview of variants and advancements of PSO algorithm. Appl. Sci. 12, 8392. https://doi.org/10.3390/app12178392 (2022).
Article CAS Google Scholar
Dubois, D. & Prade, H. Fuzzy sets, probability and measurement. Eur. J. Oper. Res. 40, 135–154. https://doi.org/10.1016/0377-2217(89)90326-3 (1989).
Article MathSciNet Google Scholar
Molodtsov, D. Soft set theory-first results. Comput. Math. Appl. 37, 19–31. https://doi.org/10.1016/S0898-1221(99)00056-5 (1999).
Article MathSciNet Google Scholar
Pawlak, Z. Rough set theory and its applications to data analysis. Cybern. Syst. 29, 661–688. https://doi.org/10.1080/019697298125470 (1998).
Article Google Scholar
Pawlak, Z. & Skowron, A. Rudiments of rough sets. Inf. Sci. 177, 3–27. https://doi.org/10.1016/j.ins.2006.06.003 (2007).
Article MathSciNet Google Scholar
Wang, G.-Y. et al. A survey on rough set theory and applications. Chin. J. Comput. 32, 1229–1246 (2009).
Article MathSciNet ADS Google Scholar
Pawlak, Z. Ai and intelligent industrial applications: The rough set perspective. Cybern. Syst. 31, 227–252. https://doi.org/10.1080/019697200124801 (2000).
Article Google Scholar
Morsi, N. N. & Yakout, M. M. Axiomatics for fuzzy rough sets. Fuzzy Sets Syst. 100, 327–342. https://doi.org/10.1016/S0165-0114(97)00104-8 (1998).
Article MathSciNet Google Scholar
Dubois, D. & Prade, H. Rough fuzzy sets and fuzzy rough sets. Int. J. General Syst. 17, 191–209. https://doi.org/10.1080/03081079008935107 (1990).
Article Google Scholar
Kong, G., Xu, D.-L. & Yang, J.-B. Clinical decision support systems: A review on knowledge representation and inference under uncertainties. Int. J. Comput. Intell. Syst. 1, 159–167. https://doi.org/10.1080/18756891.2008.9727613 (2008).
Article Google Scholar
Pawlak, Z. Rough set approach to knowledge-based decision support. Eur. J. Oper. Res. 99, 48–57. https://doi.org/10.1016/S0377-2217(96)00382-7 (1997).
Article Google Scholar
Li, R. & Wang, Z.-O. Mining classification rules using rough sets and neural networks. Eur. J. Oper. Res. 157, 439–448. https://doi.org/10.1016/S0377-2217(03)00422-3 (2004).
Article Google Scholar
Jelonek, J., Krawiec, K. & Slowiński, R. Rough set reduction of attributes and their domains for neural networks. Comput. Intell. 11, 339–347. https://doi.org/10.1111/j.1467-8640.1995.tb00036.x (1995).
Article Google Scholar
Khoo, L.-P. & Zhai, L.-Y. A prototype genetic algorithm-enhanced rough set-based rule induction system. Comput. Ind. 46, 95–106. https://doi.org/10.1016/S0166-3615(01)00117-8 (2001).
Article Google Scholar
Acharjya, D. P. & Abraham, A. Rough computing-a review of abstraction, hybridization and extent of applications. Eng. Appl. Artif. Intell. 96, 103924. https://doi.org/10.1016/j.engappai.2020.103924 (2020).
Article Google Scholar
Wang, X., Yang, J., Teng, X., **a, W. & Jensen, R. Feature selection based on rough sets and particle swarm optimization. Pattern Recogn. Lett. 28, 459–471. https://doi.org/10.1016/j.patrec.2006.09.003 (2007).
Article CAS ADS Google Scholar
Inbarani, H. H., Azar, A. T. & Jothi, G. Supervised hybrid feature selection based on PSO and rough sets for medical diagnosis. Comput. Methods Programs Biomed. 113, 175–185. https://doi.org/10.1016/j.cmpb.2013.10.007 (2014).
Article PubMed Google Scholar
Lakhan, A., Mohammed, M. A., Abdulkareem, K. H., Hamouda, H. & Alyahya, S. Autism spectrum disorder detection framework for children based on federated learning integrated CNN-LSTM. Comput. Biol. Med. 166, 107539 (2023).
Article PubMed Google Scholar
Al-Fahdawi, S. et al. Fundus-deepnet: Multi-label deep learning classification system for enhanced detection of multiple ocular diseases through data fusion of fundus images. Inf. Fusion 102, 102059 (2024).
Article Google Scholar
Mohammed, M. A., Lakhan, A., Abdulkareem, K. H. & Garcia-Zapirain, B. Federated auto-encoder and xgboost schemes for multi-omics cancer detection in distributed fog computing paradigm. Chemom. Intell. Lab. Syst. 241, 104932 (2023).
Article CAS Google Scholar
Guyon, I. & Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003).
Google Scholar
Chandrashekar, G. & Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 40, 16–28 (2014).
Article Google Scholar
Cai, J., Luo, J., Wang, S. & Yang, S. Feature selection in machine learning: A new perspective. Neurocomputing 300, 70–79 (2018).
Article Google Scholar
Świniarski, R. W. Rough sets methods in feature reduction and classification. Int. J. Appl. Math. Comput. Sci. 11, 565–582 (2011).
MathSciNet Google Scholar
Nahato, K. B., Harichandran, K. N. & Arputharaj, K. Knowledge mining from clinical datasets using rough sets and backpropagation neural network. Comput. Math. Methods Med. 2015 (2015).
Kim, K.-J. & Jun, C.-H. Rough set model based feature selection for mixed-type data with feature space decomposition. Expert Syst. Appl. 103, 196–205 (2018).
Article Google Scholar
Lu, Z., Qin, Z., Zhang, Y. & Fang, J. A fast feature selection approach based on rough set boundary regions. Pattern Recogn. Lett. 36, 81–88 (2014).
Article ADS Google Scholar
Zhang, Q., **e, Q. & Wang, G. A survey on rough set theory and its applications. CAAI Trans. Intell. Technol. 1, 323–333. https://doi.org/10.1016/j.trit.2016.11.001 (2016).
Article Google Scholar
Zitar, R. A., Abualigah, L. & Al-Dmour, N. A. Review and analysis for the red deer algorithm. J. Ambient. Intell. Humaniz. Comput. 14, 8375–8385. https://doi.org/10.1007/s12652-021-03602-1 (2023).
Article PubMed Google Scholar
Hepatitis. UCI Machine Learning Repository, https://doi.org/10.24432/C5Q59J (1988).
Myles, A. J., Feudale, R. N., Liu, Y., Woody, N. A. & Brown, S. D. An introduction to decision tree modeling. J. Chemom. 18, 275–285. https://doi.org/10.1002/cem.873 (2004).
Article CAS Google Scholar
Quinlan, J. R. Decision trees and decision-making. IEEE Trans. Syst. Man Cybern. 20, 339–346. https://doi.org/10.1109/21.52545 (1990).
Article Google Scholar

Download references

Funding

The research was funded by the School of Computer Science and Engineering, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India.

Author information

These authors contributed equally: Madhusmita Mishra and D. P. Acharjya.

Authors and Affiliations

Vellore Institute of Technology, School of Computer Science and Engineering, Vellore, 632014, India
Madhusmita Mishra & D. P. Acharjya

Authors

Madhusmita Mishra
View author publications
You can also search for this author in PubMed Google Scholar
D. P. Acharjya
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.M. has conceptualized the problem, and methodology based on her research work. Hybridization of rough computing with bio-inspired computing is a challenging task and it has been carried out by M.M. The implementation and analysis are being carried out by Madhusmita under the supervision of D.P.A. Besides, D.P.A. has thoroughly reviewed the paper since its drafting stage by Madhusmita. The figures, tables, and their presentation are carried out by author D.P.A. In addition, D.P.A. has thrown light to expand the comparative study using various other techniques.

Corresponding author

Correspondence to D. P. Acharjya.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mishra, M., Acharjya, D.P. A hybridized red deer and rough set clinical information retrieval system for hepatitis B diagnosis. Sci Rep 14, 3815 (2024). https://doi.org/10.1038/s41598-024-53170-5

Download citation

Received: 11 October 2023
Accepted: 29 January 2024
Published: 15 February 2024
DOI: https://doi.org/10.1038/s41598-024-53170-5
Springer Nature Limited

A hybridized red deer and rough set clinical information retrieval system for hepatitis B diagnosis

Abstract

Similar content being viewed by others

Fuzzy Logic and Correlation-Based Hybrid Classification on Hepatitis Disease Data Set

Evolutionary and Neural Computing Based Decision Support System for Disease Diagnosis from Clinical Data Sets in Medical Practice

Hepatitis Disease Diagnosis Using Multiple Imputation and Neural Network with Rough Set Feature Reduction