FormalPara Key Points

Machine learning (ML) has the potential to assist with signal detection and supplement traditional pharmacovigilance (PV) surveillance methods

The objectives of this pilot study were to apply a proof-of-concept ML model to detect any potential new safety signals for two of AbbVie’s marketed products and to determine whether the model is capable of detecting safety signals earlier than humans

Our model was able to detect one true safety signal for Drug X six months earlier than humans; model optimization to further improve predictive performance may be needed before implementation

Pharmaceutical companies and health authorities should continue to explore the utilization of artificial intelligence (AI)/ML tools in their PV surveillance methods to aid with the vastly increasing amounts of complex safety data and contribute to the literature

Introduction

Pharmacovigilance (PV) is the science of identifying, evaluating, and minimizing the risk of adverse events (AEs) and providing timely risk mitigation measures to maintain a favorable benefit-risk profile of a drug product [1]. Safety surveillance is an integral part of PV that entails qualitative and quantitative analyses designed to identify previously unknown AEs or a new aspect of known AEs to ensure the ongoing safety of a drug product.

The safety signal management process is a set of activities performed to determine whether new risks associated with a drug product have emerged or whether known risks have changed their characteristics based on all available evidence stemming from relevant information sources [2]. Information arising from one or multiple sources, including observations and experiments, that may suggest a new potentially causal association or a new aspect of a known association between an intervention and an AE or set of related AEs is considered a safety signal that must undergo assessment for validation or refutation as part of the signal management process [3]. Contributing information sources may include, but are not limited to, individual case safety reports (ICSRs), aggregate data from active surveillance systems or studies, data mining scores, and scientific literature reviews. Figure 1 outlines the signal management process.

Fig. 1
figure 1

Workflow of the signal management process

However, challenges to performing PV surveillance activities for signal detection, such as the rapidly growing volume of data, increasing data sources (e.g., social media), delay in real-time availability of data, and systems’ susceptibility to human errors, continue to pervade the pharmaceutical industry. With the public release and increasing utilization of machine learning (ML) or artificial intelligence (AI) technologies like OpenAI’s ChatGPT, Google’s Bard, and Microsoft’s Bing Chat to help with efficiency-related tasks, these PV challenges have also transformed into opportunities within the pharmaceutical industry to use AI technologies as smart tools to assist with surveillance activities.

There are limited industry-wide standards that apply ML technologies in PV. Nonetheless, the number of ICSRs in PV databases are increasing, and given that individual case review is time-consuming, new approaches should be explored to help prioritize which reports require further evaluation by safety teams [4]. A previously published article revealed that ML algorithms have better accuracy in detecting new signals than traditional disproportionality analysis methods, which typically encounter some issues, such as background noise and the possibility of generating false-positive results [5]. In other words, the promising results delivered by the available ML technologies in the public domain reinforce the potential that AI has in assisting with safety signal detection and in supplementing traditional PV surveillance methods. Additionally, the results of a systematic literature review of recently published industry-based research on the use of ML for safety purposes highlighted the low number of papers being verified to be industry contributions (8%, n = 33/393). Of these industry papers, only 18% (n = 6/33) described signal detection and data ingestion [6]. Although there is an exponentially growing interest in using ML approaches to overcome barriers in the pharmaceutical industry, there is not enough published guidance and precedence to explore the increasing potential of ML technologies to overcome those barriers.

The objectives of this pilot study were to apply a proof-of-concept ML model to detect any potential new safety signals for two of AbbVie’s marketed products and to determine whether the model is capable of detecting safety signals earlier than humans.

Methods

Definitions

A “true signal” is defined as any AE determined to be causally associated with the drug product and included in the drug product label at the time of data analysis (e.g., also known as adverse drug reaction [ADR]). A “potential new signal” is defined as an AE that might be caused by the drug product and requires further assessment. This assessment is a manual process that includes the review of all cases (i.e., ICSRs) that reported that specific AE with the use of that specific drug product. From this assessment, the signal is either confirmed (when a causal association between the AE and the drug is established) or refuted (when a causal association between the AE and the drug cannot be established).

Individual AEs are coded at the preferred term (PT) level in the Medical Dictionary for Regulatory Activities (MedDRA). [7]

Scheme of Machine Learning Pipeline

The workflow of our ML pipeline began with the splitting of the whole dataset for each drug into the training set and the test set. The training set was used for model training and hyperparameter tuning, and the test set was a dataset held out separately for the purpose of evaluating model performance after the final model was obtained based on the training set. On the training set, threefold cross-validation was conducted to select the best model algorithm and hyperparameter setting, while the use of cross-validation aimed at controlling for over-fitting due to the high complexity of ML algorithms and a large number of features provided in the modeling training and selection process. Then, the final model was trained with the selected best hyperparameter setting, based on the whole training set. In the end, our final model performance was evaluated based on the hold-out test set. Figure 2 provides an illustration of the workflow of our ML pipeline. For each drug, this ML pipeline was applied respectively to generate model performance results and any potential new signals for a further manual assessment performed by humans.

Fig. 2
figure 2

Workflow of the machine learning pipeline for potential new signal detection. ADR adverse drug reaction (i.e., true signal), PT preferred term

Data Source and Train-Test Splitting

For Drug X (a mature product), post-marketing (PM) data from 2017 to 2018 were extracted from AE reports at the PT level, including demographic features of the patients (e.g., age, race, country, etc.) and characteristics of AEs (e.g., event seriousness, event outcome, time to event, etc.), and used for both training and testing sets. For Drug Y, clinical trial data from Phase 3 trials were extracted for the training set, and PM data from 2021 to 2022 were extracted for the testing set. During the collection of PTs for signal classification, we retained the PTs with the number of occurrences ≥ 5 or the PTs with < 5 occurrences and at least one occurrence of a serious report. As all patient data reviewed did not contain identifiable information, no informed consent, ethics committee, or Institutional Review Board approval was sought or required.

Given that raw safety data are retrieved on a case level, they were transformed to the granularity level of PTs by aggregating the data of all cases associated with each PT. In terms of features used for ML modeling, the following were included after review by internal safety experts: number of occurrences of a PT, number of occurrences of all other PTs, gender, age group, country, event seriousness, event outcome, change in dose due to AE, reporter event causality to drug, AE time-to-onset, and MedDRA System Organ Class. For each feature, aggregation was done by summing up the number of cases or subjects (depending on whether the feature was at the case level or subject level) that belonged to each category of the feature for each PT. In terms of missing data for a case, a new unknown category was created and assigned to the case [8].

The train-test splitting of the whole dataset for each drug was conducted in a sequential way where the data used for model training were collected prior to the data for model testing. This sequential data split approach mimics the process in which the model was trained based on labeled true signals (i.e., labeled ADRs) and then applied to future AE data for safety signal monitoring. Moreover, this approach prevented information leakage from the use of future true new signals. Table 1 shows the breakdown of true signal PTs and non-signal PTs in both the training and test sets. Given that, for the training and test sets, we retained the PTs based on the aforementioned occurrence and seriousness criteria, only a subset of true signal PTs in the test set was available to the ML models in the training set, which facilitated a conservative assessment of model performance.

Table 1 Number of signal vs. non-signal preferred terms in datasets for Drugs X and Y

When the test set was applied, each PT was predicted under one of the two categories: signal PT or non-signal PT. The signal PTs included either a true signal (i.e., labeled ADRs) or a potential new safety signal that would require further assessment by the safety team to confirm or refute the new signal.

Machine Learning Algorithms

The gradient boosting-based ML approaches [9,10,11,12] were chosen as the main modeling methodology because of their superior performances in a variety of data science prediction tasks. Boosting, based on the idea of combining a committee of iteratively trained weak classifiers (e.g., decision tree stumps) to produce a powerful final model, has been one of the most powerful learning ideas in ML over the last two decades. In contrast to the recently popularized deep learning (e.g., deep neural networks) approaches that often require training on very large data sets to achieve good performances, gradient boosting-based approaches are still able to perform very well with data sets of smaller sizes. We chose to use the implementation of gradient boosting algorithms by XGBoost [10], which is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.

The tuning of the XGBoost machine learning algorithm was conducted through cross-validation, utilizing the grid search over the following hyper-parameters for Drug X: n_estimators with the grid [10, 20, 30, 50, 60, 75], max_depth with the grid [4, 6, 8, 10, 15, 20], and learning_rate with the grid [0.05, 0.10, 0.25, 0.50]. For Drug X, the final selected hyper-parameter values are as follows: n_estimators = 50, max_depth = 4, and learning_rate = 0.10. The hyperparameters and their search grids for Drug Y are the same and the final selected parameter values are as follows: n_estimators = 30, max_depth = 4, and learning_rate = 0.10.

Model Performance Evaluation

Model performance was first evaluated based on objective quantitative metrics. Prediction accuracy provides a good metric to calibrate classification performance for a non-rare outcome. However, prediction accuracy can look overly optimistic for a poorly performed model under rare events. For example, the accuracy would be extremely high for a useless model that predicts every case to not be a signal. We used a pair of classification performance metrics, sensitivity (also known as true positive rate or recall) and positive predictive value (also known as precision), to measure the model performance. These two metrics are defined in Table 2 based on the four types of classification results often presented in a so-called confusion matrix: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Sensitivity and positive predictive value (PPV) are defined to be TP/(TP + FN) and TP/(TP + FP), respectively. The specificity and negative predictive value (NPV) are defined in a similar way to be TN/(TN + FP) and TN/(TN + FN), respectively. For instance, for a rare signal detection problem with a 5% occurrence rate, a simple model based on a biased coin flip (with 5% vs. 95% probabilities to predict signal vs. non-signal) only gives 5% for both sensitivity and PPV under the assumption that the simple model only considers the population percentage of true signals. In addition to the assessment based on quantitative classification performance metrics, a manual assessment of potential new signals was conducted by humans to evaluate the generalizability of the ML models.

Table 2 Binary classification confusion matrix with four types of results

Strategy to Handle Imbalanced Labels

It is a well-known phenomenon that ML binary classifiers tend to be overwhelmed by the data of the event making up the major portion in training and hence overpredicting the majority event [13]. An ML binary classifier can predict an event of interest (e.g., being a true signal) for any new instance given its covariate information. Behind the scenes, many ML binary classifiers are trained to predict a quantitative outcome on a scale between 0 and 1, representing the probability of the event of interest. To convert the predicted probability to the predicted binary outcome, it is common practice to apply a probability decision threshold beyond which the case is predicted to the event of interest. By default, this probability decision threshold is often chosen to be 0.5 for binary classification problems [14]; however, in the scenario with imbalanced events, this leads to a very small sensitivity rate for detecting the minority event, which is often the target of interest in a scientific investigation. In addition, with the default threshold, very few new signals tend to be generated, which undermines the purpose of a predictive surveillance ML system. Adjusting the probability decision threshold is a viable and widely used approach with good performance[15], where model retraining with either sample weighting or resampling of the training set is not required. While the probability decision threshold can be selected by maximizing a certain mathematical criterion like F1 score [16], it is recommended to be chosen, in our application setting, based on the operation limitation in practice, i.e., the limited resource devoted to manual review. In practice, because the goal is often to minimize the chance of missing a true safety signal, we recommend choosing the lowest probability threshold to generate as many potential signals as the resource limit of manual review permits. In our pilot study, the probability decision threshold was chosen to generate about ten potential new signals for manual review, for demonstration purposes, while achieving about 50% sensitivity and 35% PPV rate for model performance.

Results

The ML model test set results for detection of true signals (i.e., labeled ADRs) for Drugs X and Y using PM data are displayed in Table 3.

Table 3 Test dataset performance for Drugs X and Y to detect true signals

For Drug X, the ML model identified four of eight true signals with 98.3% (n = 682/694) accuracy. Using six months of high-volume PM data, the model achieved 50.0% (n = 4/8) sensitivity and 98.8% (n = 678/686) specificity. The area under curve (AUC) for the receiver-operating characteristic (ROC) curve was 0.84, whereas the AUC value was only 0.5 for a non-performing classifier. The ML model generated 12 potential new signals, out of which four were true signals, with a PPV rate of 33.3%. Among the remaining eight potential new signals, one was confirmed as a signal and was detected six months earlier by the ML model than humans performing standard surveillance activities.

For Drug Y, the ML model identified five of nine true signals with 95.9% (n = 284/296) accuracy. Using 12 months of relatively lower-volume PM data when compared to Drug X, the model achieved 55.6% (n = 5/9) sensitivity and 97.2% (n = 279/287) specificity. The AUC value for the ROC curve was 0.81, which is close to the value for Drug X. The ML model generated 13 potential new signals, out of which five were true signals, with a PPV rate of 38.5%. Among the remaining eight potential new signals, none were confirmed as true safety signals upon review by humans.

In terms of important features in our trained gradient boosting models for true signal prediction, AE outcome and reporter causality were mainly considered for both Drugs X and Y. Other interesting features, such as event seriousness and AE time to onset, were noted as top features for either of the two drugs.

Upon hands-on review of the potential new signals identified by the model for both Drugs X (n = 8) and Y (n = 8), all potential new signals were refuted except for the one confirmed for Drug X. The remaining potential new signals were confounded by the drug’s indications, concurrent events or diseases, and/or concomitant medications.

To provide a more comprehensive understanding about the XGboost's relative performance, we conducted an analysis based on the logistic regression, where forward variable selection is used to deal with multicollinearity issue among predictors. For Drug X, the logistic regression approach has achieved 25.0% sensitivity and 33.3% specificity, with the AUC 0.67. For Drug Y, the logistic regression approach cannot achieve a precision > 30.0% because of poor performance. This comparison shows that additional predictive power can be achieved using a state-of-the-art machine learning approach versus a traditional statistical approach.

Discussion

Signal detection describes a series of analytical approaches intended to identify changes in the pattern of incoming safety data. The analyses can be described as a comparison of an expected value to some observed value. Conventional signal detection methods already include quantitative assessments such as data mining, but the volume of information to be reviewed can be challenging, especially with individual case review. As such, AI/ML models can supplement existing PV processes to identify potential new signals in a timely manner.

This pilot study was undertaken to improve currently used routine PV surveillance methods, which are hampered by susceptibility to human error, further driven by several well-characterized limitations in the PM space, including the volume of data and increasing number of data sources. In contrast, in early product development, detection of signals may be limited because of the low volume of human data to inform on a safety issue. Furthermore, there may be few other marketed products in the same class (or none if the asset is first in class) and therefore limited published literature about ongoing studies and safety issues for related products. The current process, if a potential new signal is detected during ongoing clinical trials, is to convene a cross-disciplinary team usually composed of experts in safety science, clinical development, clinical pharmacology, and pharmacology/toxicology who must then integrate a variety of data sources to evaluate an emerging signal.

While the application of ML models in PV surveillance methods is being broadly explored, ML models have the potential to improve operational efficiency. Given the robustness of the data evaluation and the ability to account for multiple parameters, the ML model may help identify potential new signals earlier than conventional methods and prioritize their evaluation by safety teams [17]. ML models can also be used in the processing and evaluation of ICSRs for causality assessment. A recent article assessed the use of a ML algorithm to predict whether a report has valuable information needed to make a causality assessment, such as key elements including the drug, AE outcome, temporal relationship, and alternative explanations for the AE [18]. Efficiency can be gained if such an algorithm's prediction and an appropriate threshold could identify many reports lacking critical case information that would not need human expert review during surveillance. ML models could also potentially present emerging patterns in the accruing safety data not yet identified through conventional surveillance methods for the safety scientists to further explore. Furthermore, in the clinical trial space, it is often difficult to identify patterns in some subgroups given the limited number of subjects enrolled, lack of diversity, and exclusion criteria for underlying conditions. In the post-approval setting, a more diverse patient population may use a product, and patterns by subgroup, such as pediatric, elderly, renal or hepatic impairment populations, are generally identified by a manual review of the emerging safety data. The sensitivity of the ML algorithm might consider all these multiple underlying factors and present the most compelling patterns for further review by safety scientists.

This pilot, predictive surveillance model using AI/ML in signal detection for PV demonstrated acceptable sensitivity (50.0% and 55.6% for Drugs X and Y, respectively) for the detection of safety signals in the PM arena. Although the ML model showed relatively low PPV values (33.3% and 38.5% for Drugs X and Y, respectively), this may be preferable since a predictive surveillance model with a high PPV rate may incur the risk of not identifying new signals early in the process. This is especially true for products that are recently marketed and have a low volume of PM data.

Our ML model demonstrated some potential for earlier detection of signals when compared to humans: the model identified eight potential new signals for Drug X, of which one was confirmed as a true signal and detected six months earlier than by humans through standard surveillance methods. For Drug Y, none of the potential new signals were confirmed as true signals to date. Even so, these ML results prompted an assessment of these potential new signals that might not have been identified through conventional surveillance methods. As such, the ML model could guide and prioritize the assessment of specific potential new signals.

Despite the substantive potential of our AI/ML model, this pilot study had several limitations. First, the methodology used in our model has not been validated by any regulatory authorities for signal detection. It is important to note that regulators have not clearly defined their expectations on the use of these technologies in PV, and this gap will hopefully be addressed in the future [17]. Second, the data currently used in the training and testing sets require extensive data processing time to convert from standard data fields for safety reporting into formats more conducive to modeling. This data processing step together with the fact that this ML model was not linked to the global safety database limited the use of the model in a continuous manner that would allow the detection of safety signals in real time. Third, the testing model utilized data exclusively from PM sources, which can be subject to biases in reporting, can underestimate AE occurrence in real world use, and may be missing relevant case information for analysis. To extend the generalizability of the model, our pilot study investigated two products from different therapeutic areas and with differing lengths of time on the market. Nonetheless, the quality of these PM reports may have negatively impacted the model’s sensitivity and PPV values. Lastly, the applicability of the methodology used in this model cannot yet be generalized to the clinical trial development phase, where detecting signals early is a top priority not only for business development decisions or risk mitigation for clinical trial subjects but also for ensuring safe use in patients after drug approval. Applying the methodology used in this model to the clinical trial development phase is the next step in our exploration of this methodology.

While ML models like this one may present as powerful tools to support conventional surveillance methods, these results should be interpreted with caution as expert judgment, flexibility, and critical thinking are still essential human skills required for the final, accurate assessment of cases [17]. Furthermore, the Council for International Organizations of Medical Sciences (CIOMS) Working Group XIV Artificial Intelligence in Pharmacovigilance recently highlighted that complex algorithms used in these ML models require large amounts of data, and often there may not be enough training data available. In addition, it is possible that the training data are not fully representative of the issue under review, which results in misalignment in the resulting algorithm. ML algorithms may also perpetuate historical systematic errors present in training data [19]. When the algorithms cannot easily be understood by PV experts, it is even more important that the training data on which they are optimized are transparently communicated. The CIOMS XIV Working Group will ultimately propose guidance on the use of AI in PV. In the interim, the pharmaceutical industry should be aware of the limitations and be prudent about the implementation of AI as different processes in PV present inherent and distinct risks and may need rigorous and independent consideration.

Conclusions

The identification of a new adverse event caused by a drug product is one of the most important activities in the pharmaceutical industry to ensure the safety profile of a drug product. This pilot ML model showed promising results as an additional tool to supplement traditional pharmacovigilance activities in signal detection as it exhibited acceptable accuracy (> 95%) for the detection of safety signals in the PM space and potential for earlier detection when compared to humans. Model optimization to improve performance may be warranted before implementation. Future areas of research include linking the ML model to a safety database that would allow for continuous monitoring of new safety signals in real time. Pharmaceutical companies and health authorities should continue to explore the utilization of AI/ML tools in their PV surveillance methods to aid with the vastly increasing amounts of complex safety data and contribute to the literature. These emerging tools will aid in the integration of data, technology, and human expertise to ensure public health in a rapid, efficient manner.