Introduction

The rising numbers of co-infections of COVID-19 and Influenza (flu)1,2 have raised serious concerns about the potential of a “twindemic” among the general public3. This is also evident in the remarkable similarity between epidemic trends of flu and COVID-19 (Fig. 1). The fast-develo** COVID-19 pandemic, coupled with a severe flu season, would overwhelm the already heavily-burdened health care systems, causing further inconceivable losses4. This calls for an urgent need to establish an accurate and robust bi-disease tracking/forecasting system to provide public health officials with reliable, timely information to make informed decisions to control and prevent the onset of a “twindemic”. To this end, we propose ARGOX-Joint-Ensemble, a principled framework that utilizes the connectivity between flu and COVID-19 to integrate previously proposed forecasting models and adapt to a new era where flu and COVID-19 co-evolve.

Fig. 1: Illustration of Georgia (GA)’s real-time COVID-19 cases/deaths (black) growth in comparision with its own lagged %ILI (yellow) and the lagged %ILI of the neighboring states, from 2020-07-04 to 2022-08-13.
figure 1

a COVID-19 real-time cases in GA (thick black curve) vs. lagged 1 week %ILI in GA (yellow curve) and the neighboring states; (b) COVID-19 real-time deaths in GA (thick black curve) and lagged 3 weeks %ILI in GA (yellow curve) and the neighboring states. The underlying data is found in Supplementary Data 1.

Accurate tracking of flu outbreaks and trends is important but non-trivial. In fact, flu affects 9-41 million people annually between 2010-2020 seasons in the United States, resulting in between 12 and 52 thousands of deaths5. For decades, the U.S. Centers for Disease Control and Prevention (CDC) monitors flu activities through Influenza-like Illness Surveillance Network (ILINet), which collects the number of outpatients with Influenza-like Illness (ILI) from thousands of healthcare providers and publishes the weekly ILI percentages (%ILI, i.e., the percentages of outpatients with ILI) at the national, regional levels (10 Health and Human Services (HHS) regions in the US), and state levels. However, due to the time required for data collection and administrative processing, the ILI reports from CDC lag behind real time by 1–2 weeks, and thus unable to provide most accurate and timely information on the disease development. Numerous ILI tracking approaches have therefore been proposed, utilizing statistical models6,7, mechanistic models such as compartmental models8,9,10,11, ensemble approaches12, and deep learning models13,14. Several approaches rely on external signals such as environmental conditions and weather reports15,16; social media, such as Twitter posts17,18 and Wikipedia article views19,20; search engine data, such as: Google21,22,23,24,25, Yahoo26, and Baidu internet searches27.

Similarly, many ILI forecasting approaches are adapted and modified to predict the newly emerged COVID-19 pandemic8,28. In particular, machine learning (data-driven) methods28,29,30 and compartmental models31,32,33 are the most popular and prevailing approaches for the publicly-available COVID-19 spread forecasts, according to the weekly forecast reports compiled by CDC34. Yet, they also do not capture the inter-correlation between the two diseases, which could be a crucial factor as both infectious diseases co-evolve.

Evidently, COVID-19 is very likely to circulate for a long period of time and co-evolve with ILI, especially when COVID-19 variants continue to evolve1. Hence, a unified robust forecasting framework for both diseases is eminently indispensable.

Despite the development in the methodology tracking individual diseases, joint tracking of flu and COVID-19 remains challenging. In the midst of the on-going COVID-19 pandemic, %ILI collected by CDC may get “contaminated” in the current season, due to symptomatic similarities with COVID-19 as well as various biological and demographic factors. On the other hand, ILI outbreaks can potentially assist COVID-19 cases and deaths predictions, due to the proximity between the two diseases. However, the inter-correlation between COVID-19 and ILI is latent and varies across geographical areas, which can be challenging to capture and utilize for forecasts.

Few attempts have been made to study the connection between COVID-19 and ILI trends, or to incorporate their simultaneous growths for forecasting, while considering the geographical dependence structure (at the state-level). Most of the existing works adapted ILI forecasting model framework and applied towards COVID-19 predictions, or vise-versa. For example, ref. 35 studies the ILI vaccination rates’ correlation with COVID-19 deaths, and states its potential prediction power of deaths’ trends. ref. 36 extends this study to identify association between vaccination rates and COVID-19 infection, deaths and hospitalization, as well as arguing for their forecasting potentials. ref. 37 uses incidence patterns from past flu seasons, COVID-19 time series information, and demographic covariates in a Generalized Linear Model to forecast next week’s county-level case counts, under mild assumptions on the similarity of the transmission mechanisms between COVID-19 and flu. ref. 38, on the other hand, explores seasonal similarities between historical flu seasons and current COVID-19 related signals using a deep clustering module (learn lower-dimensional representation of the signals and reconstruct for forecasting using attention), and produces 1 week ahead independent state-level ILI forecasts.

Inspired by the affinity between ILI and COVID-19’s growth trends (Fig. 1), we propose to leverage external COVID-related signals (confirmed cases), along with relevant public search information, for Influenza-like Illness (%ILI) forecasts, and vise-versa for COVID-19 cases and deaths predictions. Yet, to build a COVID-ILI joint prediction model with online search data, many challenges remain to be addressed. For example, the COVID-ILI co-evolution is a new phenomenon, with limited external signals, while relevant internet search information can be noisy and unstable; hence, it would be a great challenge to efficiently learn the model under data paucity and data instability.

Here we propose ARGOX-Joint-Ensemble, a principled way to integrate and adapt previously proposed flu and COVID-19 forecasting models to “unseen” scenarios where flu and COVID co-exist. In particular, we modified previously proposed forecasting models by incorporating COVID-19 signals for flu predictions and vise-versa for COVID-19 forecasts. We consolidated the models for two diseases through a spatial-temporal fashion to efficiently capture and incorporate COVID-ILI signals for state-level forecasts, while maintaining model features for national-level forecasts. Finally, we employ an ensemble approach to efficiently combine COVID and flu forecasting methods into one joint framework, which is able to effectively shift focuses between COVID and ILI signals for both diseases’ forecasts, and produce robust forecasts despite unstable search information signals as inputs. The ensemble framework is systematic and comprehensive. Each data-driven sub-model within the framework is intentionally straightforward and unified to prevent over-fitting. Numerical comparisons show that our method performs competitively with other publicly available single-disease forecasting methods. This study further emphasizes the general applicability and the predictive power of online search data for various tasks in disease surveillance.

Methods

Data acquisition and pre-processing

This paper focuses on the 50 states of the United States, plus Washington D.C for COVID-19 cases and deaths forecasting, while excluding Florida (whose ILI data is not available from CDC) and including New York City and Washington D.C. for %ILI forecasting. For COVID-19 cases and deaths forecasting, we use confirmed cases, confirmed deaths, confirmed new hospital admissions (hospitalization), ILI and Google search query frequencies as inputs. For %ILI forecasting, we use lagged %ILI, COVID-19 cases, and Google search query frequencies as inputs.

COVID-19 reporting data

We use reported COVID-19 confirmed cases and deaths of United States from New York Times (NYT)39 as features in our model. We also use COVID-19 confirmed new hospital admissions (hospitalization) released by U.S. Department of Health and Human Services (HHS)40 as features for our COVID-19 death forecasts. When comparing against other benchmark methods published in CDC COVID-19 Forecast Hub34, we use COVID-19 confirmed cases and deaths from JHU CSSE COVID-19 dataset41, a curated dataset used by the CDC at their official website, as the groundtruth. We do not use JHU COVID-19 dataset as input features in our model because JHU COVID-19 dataset retrospectively corrects past confirmed cases and deaths due to reporting error or changes in federal and state policies. NYT dataset, on the other hand, does not revise past data, which gives more realistic forecasts based on the real-time. All data sources are collected from January 21, 2020 to August 13, 2022.

CDC’s ILINet data

CDC releases a report of %ILI for the previous week every Friday, which contains the percent of outpatient visits with influenza-like illness for the whole nation, 10 HHS regions, 50 states (except Florida), Washington DC, and New York City (separated from New York State)42. CDC’s %ILI data for this study are collected from January 21, 2020 to August 13, 2022.

Google search data

The online search data used in this paper is obtained from Google Trends43, where one can obtain the search frequencies of a term of interest in a specific region, time frame, and time frequency by ty** in the search query on the website. With Google Trends API, we are able to obtain a daily time series of the search frequencies for the term of interest, including all searches that contain all of its words (un-normalized)43.

We use 23 highly correlated COVID-19 related Google search queries discovered in prior study44 (in daily frequency) for COVID-19 cases and deaths forecasts, while using ILI related queries (weekly frequency) from previous study22,24 for %ILI forecasts. We obtain the search queries for national, regional (summation from states) and state level. For COVID-19 forecasts, we follow the prior work’s data cleaning procedures44, and find the optimal lag of each Google search query from COVID-19 cases/deaths44 (shown in Table S3 in Supplementary Tables) as inputs to the forecasting models. Figure S1a and S2b (Supplementary Figures) show that the peak of COVID-19 search volume for query “loss of taste” ahead of the peak in reported cases and deaths, confirming strong connections between people’s search behaviors and COVID-19 trends.

%ILI data imputation

%ILI is weekly indexed while COVID-19 cases and deaths are daily indexed. As we propose a joint forecast framework for both COVID-19 cases/deaths and %ILI in this study, the discrepancy in time stamps between the two needs to be resolved. For this study, we impute daily %ILI as the same number as weekly %ILI, assuming the daily proportion of patients with ILI symptoms is consistent with the weekly number. Imputing daily data also enables larger training sets. We also included a sensitivity analysis in Table S10 (Supplementary Tables).

Forecasting methods

National level

We propose a joint framework for national level COVID-19 cases and deaths prediction, by additionally incorporating flu information in the previously proposed national COVID-19 forecast model44. Similarly, we also include COVID-19 cases information for %ILI predictions in the Influenza-like Illness forecast model22. Both of the COVID-19 and ILI models are based on the ARGO (AutoRegressive with exogenous GOogle search) method.

Specifically, motivated by the robust performance of ARGO method44 and the connection between COVID-19 cases/deaths and lagged %ILI (Fig. 1), we add lagged daily imputed %ILI information in the L1 penalized LASSO regression as extra exogenous variables to produce future 28 days’ COVID-19 cases and death predictions. That is, we use lagged cases, Google search and ILI information as exogenous variables for COVID-19 cases forecasts, and use lagged hospitalization, deaths, Google search and ILI information for COVID-19 death forecasts. Then, we aggregate the daily predictions into future 4 weeks ahead forecasts for reporting and evaluation, consistent with other publicly available benchmark methods. Meanwhile for ILI, we obtain accurate estimates of 1–2 weeks ahead national %ILI using the ARGO method22, by additionally incorporating national COVID-19 cases (weekly aggregated) as exogenous variables. Detailed regression formulations are included in the Supplementary Methods section “ARGO-Nat Prediction”. We denote this method as bi-disease “ARGO-Nat” method, where “Nat” means national-level.

State level

To handle the complicated disease dynamic when COVID-ILI co-evolves, we propose a new ensemble framework, “ARGOX-Joint-Ensemble”, which uses joint COVID-ILI information to guide previously proposed disease forecasting methods for unified COVID-19 and %ILI state-level forecasting.

A high-level illustration of our propose method is shown in Fig. 2, where ARGOX-Joint-Ensemble operates in 3 steps.

Fig. 2: Flow Chart of the proposed ARGOX-Joint-Ensemble.
figure 2

The top-to-bottom procedure of ARGOX-Joint-Ensemble is presented, starting with raw data input and ending with state-level forecasting outputs. The procedures to forecast COVID-19 cases/deaths are in color blue (left), and the procedures to forecast %ILI is in color red (right). Google: Google search data; NYT: New York Times published COVID-19 data.

In the first step, we gather the raw estimates of COVID-19 cases/deaths (left of Fig. 2) and raw estimates of %ILI (right of Fig. 2) in different geographical resolution. For COVID-19, our raw estimates for state m week τ cases/deaths yτ,m are \({\hat{y}}_{\tau ,m}^{GT}\), \({\hat{y}}_{\tau ,{r}_{m}}^{reg}\), \({\hat{y}}_{\tau }^{nat}\), and yτ−1,m, where rm is the region number for state m. Here, we denote GT and reg to be state/regional estimates with internet search information only, and nat to be national estimates (same as prior study44). Similarly, we obtain the raw estimates for state m weekly %ILI pτ,m: \({\hat{p}}_{\tau ,m}^{GT}\)24, \({\hat{p}}_{\tau ,{r}_{m}}^{reg}\)23, \({\hat{p}}_{\tau }^{nat}\)22, pτ−1,m.

In the second step, we fit two models separately using the raw estimates from step 1 as inputs. Motivated by the connection between lagged neighboring states’ %ILI and real-time COVID-19 trends (Fig. 1), we first propose the bi-disease “ARGOX-Local” method. For COVID-19 cases/deaths predictions, bi-disease ARGOX-Local incorporates neighboring state’s %ILI information; similarly for %ILI predictions, bi-disease ARGOX-Local includes neighboring state’s COVID-19 cases. Besides bi-disease ARGOX-Local, we also directly employ the previously proposed single-disease forecasting models for COVID-1944 and %ILI24 in the second step, since they have already demonstrated robust results prior to the newly emerged bi-disease dynamics.

In the third (last) step, we gather the two methods in step 2, to produce the final winner-takes-all ensemble predictions for future 4 weeks COVID-19 cases/deaths and future 2 weeks %ILI. Particularly, for a training period of (overlap**) 15 weeks, we evaluate both predictors (from two models in the second step) with mean squared error (MSE) and select the one with lowest MSE as the ensemble predictor for future weeks.

Implementation details about bi-disease ARGOX-Local, and the final ensemble step ARGOX-Joint-Ensemble, as well as the modifications on previously proposed single-disease forecasting models for COVID-1944 and %ILI24, are presented in the Supplementary Methods section “Newly Proposed Bi-disease ARGOX-Local”. Detailed ARGOX-Joint-Ensemble’s prediction interval calculation is also included in the Supplementary Methods section “ARGOX-Joint-Ensemble”.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

In this section, we conduct retrospective estimation of the 1–4 weeks ahead COVID-19 cases and deaths, and 1-2 weeks ahead %ILI, at the US national and state level for the period of July 4, 2020 to August 13, 2022. We analyze our joint framework’s performances by conducting comparison analysis with our own methods, as well as with other publicly available methods from CDC Forecast Hub34,45,46,47,48,49,62. In addition, CDC FluSight is also investigating additional surveillance components to track seasonal influenza activities, including laboratory-confirmed influenza hospital admissions63. Therefore, considering alternative influenza activities’ indicators as forecasting targets and/or exogenous information in the model could be an important future direction.

In light of recurrent Influenza-like Illness waves and the prolonged COVID-19 pandemic, accurate joint-disease tracking of epidemic activity at different geographical levels has become more important than ever. Our ARGO-Nat and ARGOX-Joint-Ensemble provide high-precision national and state-level surveillance information, which would enable timely decision making and optimal resource reallocation in the face of a potential twindemic. The reliable estimations by our joint COVID-ILI framework give public more insights into both diseases and can serve as valuable resources for public health officials.