Introduction

Coronaviruses are single-stranded, enveloped and positive-sense RNA viruses, which are spherical in shape and have petal-like spines1. Firstly discovered and identified in 19652, coronaviruses have not caused large-scale outbreaks until the 2003 SARS epidemic in China, followed by 2012 MERS in Saudi Arabia and 2015 MERS in South Korea3. Although the exact origin remains debatable4, the fourth outbreak has taken place in Hubei province of China in December 2019 and rapidly spread out nationally5,6,7,8,9,10. On January 10, 2020, the World Health Organization (WHO) temporarily named the new coronavirus as the 2019 novel coronavirus (2019-nCoV). Around one month later, the WHO officially renamed it to coronavirus disease 2019 or COVID-19 on 11 February, 2020 (see https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200211-sitrep-22-ncov.pdf?sfvrsn=fb6d49b1_2 for details) and released a comprehensive interim guidance on dealing with this new virus for all countries11. On March 11, 2020, the WHO declared COVID-19 a global pandemic12. Since China reported its first cases to the WHO in December 2019, COVID-19 has been spreading rapidly around the world. As of June 15, 2020, about 7.9 million confirmed cases and 433 thousand deaths have been reported by authorities in 214 countries and territories13.

To combat against the rapid spread of the COVID-19, since mid-January 2020, the central government of China and all local governments have implemented intensive preventions. Examples include tracing close contacts and quarantining infected cases, promoting social consensus on self-protection like wearing face mask in public area, among others14. With the unexpectedly rapidly growing number of confirmed cases, more extreme and unprecedented measures have taken places. On 23 January, 2020, the Chinese authorities introduced travel restrictions on five cities (Wuhan, Huanggang, Ezhou, Chibi and Zhijiang) of Hubei, shutting down the movement of more than 40 million people15. Among existing research, most argues that those interventions have effectively halted the spread of the COVID-1915,16,17,18,19,20,21,22,23,24,25.

The feasibilities of aforementioned stringent measures adopted by the Chinese government have been widely discussed, and existing studies have preliminarily examined the effectiveness of these local containment measures (e.g., the lockdown of Wuhan26,27,28, airport screening and travel restrictions19,29,30,31, and isolation of cases and quarantine of contacts32,33). Moreover, different types of infection control measures are enforced in many other countries to prevent and constrain the spread of the COVID-19, and the effects of these measures have been analyzed and compared over a rage of affected countries (e.g., Australia34,34, Germany35, Italy36,37 and South Korea38). Among those emerging large volume of studies, mathematical and statistical modeling plays a non-negligible role. Also, the classical susceptible exposed infectious recovered (SEIR) model with its various extensions is the most popular method39,40,41,42,43,44,45,46,47,48,49,50,51,52. SEIR family models are effective in exploring the epidemic characteristics of the outbreak, forecasting the inflection point and ending time, and deciding the measures to curb the spreading. Despite this, they are less appropriate in identifying transmission routes of the COVID-19 outbreak, which is also not thoroughly investigated in existing literature.

In this paper, we fill in this gap and perform a retrospective analysis using the publicly available data53. Rather than employing the SEIR, we develop a time-varying coefficient sparse vector autoregressive (VAR) model. Using the least absolute shrinkage and selection operator (lasso)54,55 and the local constant kernel smoothing estimator56, our model is capable of estimating the dynamic high-dimensional Granger causality coefficient matrices. This enables the detection and visualization of time-varying inter-location and self-transmission routes of the COVID-19 on the daily basis. The resulting “road-map” can help policy-markers and public-health officers retrospectively evaluate both the effectiveness and unexpected outcomes of their interventions. Such an evaluation is critical to winning the current battle against COVID-19 in China, providing useful experience for other countries facing the emerging threat of this new coronavirus, and saving lives when a new epidemic occurs in the future.

Methods

Model

Throughout this study, we are interested in the growth rate \({y}_{i,t}\) such that:

$$\begin{aligned} {y}_{i,t+1}=\ln ({x}_{i,t+1})-\ln ({x}_{i,t}), \end{aligned}$$
(1)

where \({x}_{i,t}\) is the accumulated confirmed cases in the provincial-level administrative region (location, hereafter) i on day t (\(i=1,\ldots ,N\) and \(t=1,\ldots ,T\)). T and N define the number of days and number of locations under consideration, respectively. We then define \(\varvec{y}_{t}=(y_{1,t},\ldots ,y_{N,t})'\), an \(N\times 1\) vector of the growth rate on day t. To investigate a dynamic direct transmission of the growth rate among locations, we propose a time-varying coefficient sparse VAR model, namely the tvSVAR model, which assumes that Granger causality coefficients are functions of time, such that:

$$\begin{aligned} \varvec{y}_{t+1}&=\varvec{\alpha }_t+\varvec{B}_t \varvec{y}_{t}+\varvec{\epsilon }_t, \end{aligned}$$
(2)

where \(\varvec{\alpha }_t\) is an N-dimensional intercept vector at time t. \(\varvec{B}_t\) is an \(N\times N\) Granger causality matrix at time t with a dynamic sparse structure, for which entries can be exactly zero and the locations of zeros can vary with time. \(\varvec{\epsilon }_t\) is an \(N\times 1\) vector of error terms. The sparsity of \(\varvec{B}_t\) is assumed because N could be even larger than T in our case, which leads to very unstable estimations and problematic interpretations of \(\varvec{B}_t\).

One important benefit of using the proposed tvSVAR to model the transmissions is that the Granger causality matrix, \(\varvec{B}_t\), can provide both the direction and strength of the route on day t. For example, the \(ij{\text {th}}\) entry in \(\varvec{B}_t\) measures the strength of the transmission from location i to location j on day t. The \(i{\text {th}}\) diagonal of \(\varvec{B}_t\) represents the self-transmission in location i that captures the relationship between the growth rate in the current and previous days. More critically, the sparse structure eases the interpretation of \(\varvec{B}_t\) because many weak transmissions may be of a random nature. The corresponding coefficients, therefore, can be treated as noises and are shrunk to zeros exactly. Moreover, a time-varying design of \(\varvec{B}_t\) allows us to investigate changes in the identified transmissions over time. For instance, let 11, 14 and 17 indicate Hubei, Jiangxi and Shanghai, respectively. On day \(t=1\), the estimated \(\beta _{11,14,1}\) and \(\beta _{14,17,1}\) are 0.52 and 0.35, respectively. This suggests on that day, moderately strong transmission routes of confirmed COVID-19 cases are detected from Huber to Jiangxi and from Jiangxi to Shanghai, respectively. Further, estimated \(\beta _{17,i,1}\) for all \(i=1, \ldots ,20\) are zeros, suggesting that the confirmed cases in Shanghai cannot spread to other locations on day 1. On day \(t=2\), we observe estimated \(\beta _{11,14,1}=0.61\), \(\beta _{14,17,1}=0.41\) and all \(\beta _{17,i,1}=0\). Thus, the two detected routes from Huber to Jiangxi and Jiangxi to Shanghai have become more influential, whereas the cases in Shanghai are still yet to spread out on day 2. The above results cannot be derived using the classic epidemiological SEIR model.

To capture both dynamic and sparse structure of the Granger causality coefficients, we solve the following optimization problem:

$$\begin{aligned} (\widehat{\varvec{\alpha }}_t,\widehat{\varvec{B}}_t) =\arg \min _{\varvec{\alpha }_t,\varvec{B}_t}\left\{ \sum ^{T-1}_{s=1}\left( \varvec{y}_{s+1}-\varvec{\alpha }_t-\varvec{B}_t\varvec{y}_{s}\right) ^T\varvec{W}_s\left( \varvec{y}_{s+1}-\varvec{\alpha }_t-\varvec{B}_t\varvec{y}_{s}\right) + \lambda \sum _{i, j}^{N}w_{i,j,t}|\beta _{i,j,t}|\right\} , \end{aligned}$$
(3)

where \(\varvec{W}_s=\text {diag}\left( K_{b_1}(\tau _s-\tau ),\ldots ,K_{b_N}(\tau _{s}-\tau )\right)\) is the matrix of kernel weights calculated based on the bandwidth \(b_i\), \(i=1,\ldots ,N\), and \(K_{b_i}(\tau _s-\tau )=K(\frac{\tau _s-\tau }{b_i})/{b_i}\) with \(\tau _s\) defined as a scaled time \(\frac{s}{T-1}\). We use the Epanechnikov kernel \(K(x)=0.75(1-x^2)_+\) and a unified bandwidth for each i (\(b_i\equiv b\)) to avoid a large number of tuning parameters. The coefficients \(\beta _{i,j,t}\) denotes the \(ij{\text {th}}\) entry of the Granger causality matrix \(\varvec{B}_t\), and \(\lambda\) is the tuning parameter that aims to shrink insignificant \(\beta _{i,j,t}\) to zero and thus controls the sparsity of \(\varvec{B}_t\). Another essential feature of our proposed model is that the adaptive weights \(w_{i,j,t}\) are employed to penalize \(\beta _{i,j,t}\) differently in the lasso (L1) penalty54,55. The choice of weights \(w_{i,j,t}\) takes account of the prior knowledge about the transmissions and can be specified by the users. In this study, we consider \(w_{i,j,t}\) as the reciprocal of the accumulated confirmed in location i on day \(t-1\). That is, the growth rate of a location with a smaller accumulated confirmed cases is less likely to influence the growth rates of others, and thus, more likely to be shrunk to zero. The final sparsity structure of \(\varvec{B}_t\) is still data-driven.

The estimators as in (3) can also be viewed as a penalized version of local constant kernel smoothing estimator56. We utilize a modified version of the fast iterative soft thresholding algorithm (FISTA)57 to solve the optimization problem (3).

Given a bandwidth b and a penalty parameter \(\lambda\), we can find the estimator \((\widehat{\varvec{\alpha }}_t,\widehat{\varvec{B}}_t)\) for each day t and observe the dynamic patterns of the transmission over time t for each pair of locations. The selection of b is critical to detecting the influential routes, which depends on the chosen criterion. Among the existing literature, a popular approach is to adopt the cross-validation strategy, such that based on the estimated \(({\varvec{\alpha }}_t,{\varvec{B}}_t)\), the model will not ‘overfit’ \(\varvec{y}_{t}\). As for the time-series analysis, we use an expanding-window sample to implement the cross-validation58. This requires that the chosen b will minimize the cross-validated forecast error, which is measured by the one-step-ahead root mean squared forecast error (RMSFE), such that

$$\begin{aligned} \text {RMSFE}(b, \lambda )=\sqrt{\frac{1}{N(T_1-T_0-1)}\sum _{i=1}^{N}\sum _{t=T_0}^{T_1-1}\left( \widehat{y}^{(b, \lambda )}_{i,t+1}-{y}_{i,t+1}\right) ^2}, \end{aligned}$$
(4)

where \([T_0,T_1]\) is the evaluation period, which is given by the last third of the data in our study, \(\widehat{y}^{(b, \lambda )}_{i,t+1}\) denotes the one-step-ahead forecast for location i based on the data up to day t, and \({y}_{i,t+1}\) defines the observed growth rate at day \(t+1\) for location i. Note that RMSFE is analogous to the square root of the popular least squared errors. An interpretation is that the chosen b will lead to the minimized total out-of-sample forecast errors of the growth rates of confirmed cases over the last third of the sample period.

Data and results

Data

The data studied in this paper include confirmed COVID-19 cases which occurred in mainland China. The data are publicly available and sourced from the website of the National Health Commission of the People’s Republic of China53. The data-coverage ranges from January 29, 2020 to February 19, 2020, during which no missing data were recorded at location-level. The accumulated cases and the associated growth rates, grouped by the total national number, cases in Hubei and cases in all other locations, are plotted in Fig. 1a,b, respectively. The total national (Hubei) accumulated confirmed cases increased rapidly from 7,736 (4,586) on January 31, 2020 to 75,101 (62,457) on February 19, 2020. Note that on February 12, 2020, confirmed cases in Hubei included those confirmed by both laboratory and clinical diagnosis, leading to a one-time hump of the accumulated number. Compared to those of Hubei, confirmed cases of other locations took up a smaller proportion of the total national number, ranging from 40.7% on January 29, 2020 to 16.8% on February 19, 2020. This suggests that the growth rate of other locations should be lower than that of Hubei, which is consistent with Fig. 1b. Throughout our investigation period, except for the one-time hump on February 12, 2020, growth rates of Hubei and the rest steadily declined, from 33% and 25% to 5% and 1%, respectively.

Figure 1
figure 1

Accumulated confirmed cases and growth rate: 31/1/2020–19/2/2020.

Estimation results: transmission routes

By taking the difference of the logged accumulated cases and applying one lag, our estimated transmission routes are available from January 31 to February 19, 2020 (two observations are lost). To avoid potential noises caused by small numbers, we only include data of locations, which had at least 150 accumulated confirmed cases as of February 19, 2020. Altogether, our modeled sample contains 20 location-level confirmed cases. We firstly test the stationarity of the 20 growth rates separately. Based on the Augmented Dickey–Fuller test, only the rates of three locations (Bei**g, Hainan and Heilongjiang) are insignificant, which is in-line with the employed 10% significance level. The detailed results are available upon request. The model explained in “Methods” is then fitted incorporating all the 20 growth rates. A non-zero estimate of \(\beta _{i,j,t}\), the \(ij{\text {th}}\) entry of \(\varvec{B_t}\) in (2), indicates that on the \(t{\text {th}}\) day, the growth rate of location j is Granger caused by that of location i. In other words, there is a transmission route from location i to location j. Among the 20-day results, we noticed that the estimated transmission routes on days 1–5 changed considerably on daily basis. From the sixth day onwards, however, those estimated routes were more steady. Hence, we plot the estimates on days 1–5 and those on the every fifth day thereafter, on Fig. 2. Be noted that estimates smaller than 0.2 (none-influential) are not presented a better visual illustration purpose. Also, this research focuses on the analysis of mainland China only, which excludes Taiwan, Macau, Hong Kong and all important islands of China’s territory, such as those located in the South China Sea. The plots presented in Fig. 2 therefore do not present a complete map of the territory of China, nor should they be used for purposes other than displaying identified transmission routes of COVID-19 in mainland China. The readers are directed to The National Administration of Surveying, Map** and Geographic Information of the People’s Republic of China, should they need to precisely explore the scope of maps, national boundaries and the drawing of important islands of the Chinese territory.

Figure 2
figure 2

Estimated routes of transmission among locations of China: 31/1/2020–19/2/2020. Note: This figure is created in software R version 3.6.3 (https://www.r-project.org) using packages ‘ggplot2’ and ‘ggmap.’

In Fig. 2, we use color of light orange (small) to dark red (large) indicating the accumulated confirmed cases in each location, up to time t. Estimated transmission routes are colored in blue. Self-transmissions (indicated by \(\beta _{i,i,t}\)) are denoted by dots, and a larger size of dot suggests a larger estimated \(\beta _{i,i,t}\). Inter-location transmission (indicated by \(\beta _{i,j,t}\), where \(i \ne j\)) is represented by arrows, with the transparency indicating the magnitude of estimated \(\beta _{i,j,t}\). On the first day (January 31, 2020), there were influential inter-location transmissions from Hubei to Jiangxi, Heilongjiang, Zhejiang, Henan, Shandong, Jiangsu and Shaanxi, sorted by the magnitudes of strength (big to small). There were a few additional detected such transmissions on the second day, including those from Hubei to Guangxi, from Jiangxi to Fujian, and from Guangdong to Anhui, Yunnan and Hunan. The number of such identified inter-location routes, however, reduced rapidly over the next three days. On the fifth day (February 4, 2020), no influential transmission routes were found from Hubei to directly affect other locations, and there were only three influential routes identified nationally, including Zhejiang–Shaanxi, Zhejiang–Jiangxi and Jiangxi–Shanghai. The number of those detected inter-location routes declined again in the next few days, and on day 13, only Henan–Heilongjiang was found influential. On days 19 and 20 (February 18 and 19, 2020), there were no influential inter-location transmissions identified. The above findings suggest that the number of influential inter-location transmissions overall dropped quickly in the first five days and then reduced steadily for the rest 15 days. This is consistent with the observations of Fig. 3a, where the time-varying estimates of the Granger causality of Hubei on other locations are plotted. On each day, we report the mean, standard deviation (Std. Dev.), the 25% quantile (\(Q_1\)) and 75% quantile (\(Q_3\)) of those estimates in Table 1, which also leads to consistent findings.

Figure 3
figure 3

Estimated time-varying coefficients: 31/1/2020–19/2/2020.

Table 1 Summary of daily estimated transmission routes.

As for the self-transmission, we firstly examine (Fig. 2b). It can bee seen that there were quite a few detected influential self-transmissions on the first two days. However, this number dropped quickly over days 3–5, and only self-transmissions of Heilongjiang, Guangdong and Zhejiang were found influential on day 5. Since then, the number of influential self-transmissions increased quickly with growing magnitudes (influence). On the sixteenth day (February 15, 2020), 16 out of the 20 examined locations had an estimated \(\beta _{i,i,20}\) of at least 0.2. Those large self-transmissions, however, disappeared rapidly again in the next 3 days. On February 18 and 19, 2020, there were no influential self-transmissions identified. This is consistent with our findings on Fig. 3b, where time-varying estimated \(\beta _{i,i,t}\) are plotted for each location. We report daily descriptive statistics of those estimates in Table 1, which also results in consistent conclusions.

Discussions

Since 23 January, 2020, many cities on mainland China started to introduce travel restrictions, including five cities (Wuhan, Huanggang, Ezhou, Chibi and Zhijiang) of Hubei15. According to WHO’s situation report59, the average incubation period of COVID-19 is up to 10 days. Thus, our estimated dynamic transmission routes supports the significant effectiveness of the interventions taken by the Chinese authorities15,16,17,18,19,20,21,22,23,24,25. This is evidenced by Fig. 2a–e, where the number of influential inter-location transmissions from Hubei to other locations reduced very quickly. Compared to multiple influential routes originating in Hubei detected on the first two days (January 31 and February 1, 2020), by February 4, 2020 (around 10 days after the travel restrictions), there were already no such transmissions identified. On the other hand, from February 5 to 16, 2020, Table 1 suggests that the averaged magnitudes of self-transmission on each day were strengthening steadily. This may also be explained by the interventions, which have effectively blocked inter-location transmissions, such that the growth rate of each location could only be caused by its internal transmissions.

Table 2 Top five locations of the inter-location transmissions.

We now focus on the inter-location transmission routes. Since influential routes from Hubei were no longer detected since day 5, we calculate the average \(\beta _{i,j,t}\) of the 19 locations affected by Hubei over the first four days (January 31–February 3, 2020). The top five destinations are presented in Panel A of Table 2. Apart from its geographic neighbors Jiangxi and Henan, Hubei has lead to influential routes to Heilongjiang, Zhejiang and Shandong directly. This may be explained by the substantial floating population working and living in Hubei from those locations. Excluding routes originating in Hubei, the Panel B of Table 2 suggests that the top destinations of transmission routes have not changed much over the first four days and the rest 16 days. Despite minor differences in ranking, Shaanxi, Heilongjiang, Jiangxi, Anhui, Henan and Jiangsu appear to be the destinations suffered most from inter-location transmissions from origins other than Hubei. Similarly, the top five sources (excluding Hubei) of those transmissions are basically identical over the days 1–4 and days 5–20, as shown in Panel C of Table 2. This is consistent with the fact that travel restrictions in Hubei should not affect the connections among other locations. In all cases, Jiangxi, Henan, Guangdong, Zhejiang and Anhui are the most influential origins other than Hubei.

It is worth noting that Jiangxi, Henan and Anhui belong to both the top origins and destinations of the inter-location transmissions, excluding Hubei. Since the impact of Hubei is not considered, this cannot be explained by the two influential transmission routes of Hubei–Jiangxi and Hubei–Henan listed in Panel A of Table 2. To see this, over days 5–20, the transmissions out of Hubei are no longer significant and thus should not affect routes from Jiangxi and Henan to another location. In contrast, one explanation is the large migrant workers from Jiangxi, Henan and Anhui to other locations (excluding Hubei). According to the Report on China’s migrant population development of 201760, Jiangxi (7.25%), Henan (6.30%) and Anhui (6.27%) are among the top five locations in mainland China, ranked by the percentages of migrant workers in 2017.

Conclusions

Coronaviruses have lead to three major outbreaks ever since the SARS occurred in 2003. Although the exact origin is still debatable, the current shock, namely COVID-19, has taken place in Wuhan, the capital city of Hubei province in mainland China. As the fourth large-scale outbreak of coronaviruses, COVID-19 is spreading quickly to all provincial-level administrative regions (locations, hereafter) in China and has recently become a world-wide epidemic. As a significant complement to existing research, this study employs a tvSVAR model and retrospectively investigates and visualizes the transmission routes in mainland China.

Demonstrated in Fig. 2, our baseline results review both the dynamic inter-location and self-transmission routes. Since February 4, 2020, the spread out of Hubei was largely reduced, leading to no identifiable routes to other locations. Simultaneously, the self-transmissions started to accelerate and peaked on around February 15, 2020 for most locations. Given an average incubation period of 10 days, those results support the argued effectiveness of the travel restrictions to control the spread of COVID-19, which took place in multiple cities of Hubei on January 23, 2020. On February 18–19, 2020, there existed no influential inter-location or self-transmission routes. Thus, the growth rates of confirmed cases are of a more random nature in all locations thereafter, implying that the spread of COVID-19 has been under control. For the detected inter-location transmissions, our findings demonstrate that Jiangxi, Heilongjiang, Zhejiang, Henan and Shandong are the top 5 locations affected mostly via routes directly from Hubei. When the influence of Hubei is excluded, Jiangxi, Henan and Anhui are among both the top origins and destinations of transmission routes.

Our results have major practical implications for public health decision- and policy-makers. For one thing, the implemented timely ad-hoc public health interventions are proven effective, including contact tracing, quarantine and travel restrictions. For another, apart from the origin of the virus, as locations with largest migrant workers percentages, virus preventions are also of crucial importance in Jiangxi, Henan and Anhui to controlling the epidemics like the outbreak of COVID-19 in the future. With limited resources, taking ad-hoc interventions in such locations may most effectively help stop the spread of a new virus, from an economic perspective.