1 Introduction

Spatial scan statistics are a method for the detection and evaluation of localized spatial clusters. The method is widely utilized in fields such as spatial epidemiology, criminology, disease surveillance, and others [1]. Joseph Naus, known as the “father of the scan statistic” first studied the application of the traditional scan statistic in a two- dimensional setting [2]. A rectangular scanning window having a fixed shape and size was moved over the entire study region to identify a region, if possible, with a higher concentration of points than that to have occurred by chance. Extending the idea, Kulldorff [3] proposed a spatial scan statistic, based on the likelihood ratio test, for the variable-size circular scanning windows. Numerous scanning windows of varying sizes are imposed over the entire study area and the likelihood ratio test statistic is calculated by comparing the inside versus outside regions of each scanning window. The most likely cluster corresponds to the window with maximum value of the test statistic. Abolhassani and Prates [4] have reviewed the development and expansion of spatial scan statistics in the last three decades. Traditionally, spatial scan statistics are based on likelihood ratio tests for probability distributions like binomial, Poisson, normal, etc. Spatial scan statistics for continuous datasets, apart from survival data, are usually modeled assuming univariate or multivariate normal distribution, as the case may be [5,6,7,8].

Cucala [9] introduced a distribution-free scan statistic based on a concentration index for any type of data, be it discrete or continuous. The scan statistic relying on the standardized difference in means was shown equivalent to the scan statistic introduced by Kulldorff et al. [7] that assumes a normal distribution. A nonparametric scan statistic based on the Mann–Whitney (MW)/Wilcoxon rank-sum test method was also introduced for continuous outcomes. The nonparametric scan statistic proposed by Jung and Cho [10] was defined as the minimum of p-values from the Wilcoxon rank-sum tests over multiple circular scanning windows. P-values were obtained using the normal approximation of the Wilcoxon rank-sum test. The MW scan statistic proposed by Cucala [11] was defined as the maximum of the standardized MW statistic over numerous circular scanning windows.

Empirical likelihood, introduced by Owen [12, 13], is a nonparametric method for statistical inference based on the empirical distribution of the available data. Review papers exist on theoretical advancement in empirical likelihood and variants [14, 15], empirical likelihood methods for regression [16], and time series data [17]. The empirical likelihood method and its variants have long been applied in a spatial setting, such as for spatial lattice data and spatial regression [18, 19], irregularly located spatial data [20], variogram estimation [21], spatial Markov models [22], spatial quantile regression [23], spatial panel data models [24], small area estimation [25,26,27,28], and spatial autoregressive models [29, 30]. With regard to cluster detection, a spatial scan statistic based on empirical likelihood has been recently introduced for zero-inflated data [31].

The present study proposes an empirical likelihood approach to spatial scan statistics for continuous outcomes. The existing spatial scan statistics are either based on distribution-free methods or likelihood ratio tests assuming a probability distribution. The proposed spatial scan statistic is based on Empirical likelihood which has the advantage of remaining distribution-free while allowing the use of likelihood methods. In situations where the underlying distribution of the real data is unknown or skewed, it might be better to choose a spatial scan statistic that performs well in spatial accuracy measures like sensitivity and positive predictive value (PPV) while maintaining a higher statistical power. Therefore, the proposed empirical likelihood spatial scan (ELiSS) statistic will also be compared to the MW scan statistic and the normal scan statistic [7] considering statistical power and accuracy measures. The rest of the paper is structured as follows. Section 2 briefly describes the normal scan statistic, and presents the ELiSS method, and the simulation scenario for evaluating the performance of the considered scan statistics. Results of the application of the considered methods in numerical studies are reported in Sect. 3. In Sect. 4, application of the methods on a real data set is described and the paper ends with a discussion in Sect. 5 followed by conclusion in Sect. 6.

2 Methods

2.1 The normal scan statistic

The normal scan statistic for continuous data [7] is described as follows. The null hypothesis H0 states that all observations come from a normal distribution. The alternative hypothesis Ha states that there is one cluster location where the observations have either a larger or smaller mean than outside that cluster. Let there be N continuous observations with values \({x}_{i}\), i = 1…, N. Corresponding to each observation, there is a spatial location s, s = 1…, S, with specified latitude and longitude, and S ≤ N. For each location s, the sum of the observed values is \({x}_{s }=\sum_{i\in s}{x}_{i}\) and the number of observations in the location is \({n}_{s}\). The sum of all the observed values is \(X=\sum_{i}{x}_{i}\). For each circular scanning window z (with more than one observation), let \({n}_{z}={\sum }_{s\in z}{n}_{s}\), be the number of observations in circle z, and let \({x}_{z}={\sum }_{s\in z}{x}_{s}\), be the sum of the observed values in circle z. The likelihood function for deriving the likelihood ratio test for the normal model with mean μ and variance \({\sigma }^{2}\) is expressed as,

$$L={{\prod }_{i}\frac{1}{\sigma \sqrt{2\pi }}e}^{-\frac{({x}_{i}-\mu {)}^{2}}{2{\sigma }^{2}}}$$

The spatial scan statistic is defined as the maximum of the log-likelihood ratio LLR(z) over all the circular windows, i.e. maxz (ln Lz / ln L0) where Lz is the log-likelihood for the circle z under Ha and L0 is the log-likelihood under H0. That is, over multiple circular scanning windows, the likelihood ratio test statistic will be calculated for each scanning window and the zone (associated region) that maximizes the value of the specified test statistic will be considered as the most likely cluster. The radius of the circular scanning window, centered on an observation, is allowed to vary continuously from zero often up to at most 50 percent of all observations. For assessing the statistical significance of the most likely cluster, a random permutation-based Monte Carlo hypothesis testing procedure is used.

2.2 The empirical likelihood spatial scan (ELiSS) statistic

The normal scan statistic as well as the proposed empirical likelihood scan statistic are based on the likelihood ratio test between well-defined hypotheses. The proposed ELiSS statistic is derived based on the empirical likelihood method for testing the equality of means in a two-sample problem [32,33,34]. The asymptotic null distribution of the empirical likelihood test statistic for comparing means is chi-square with one degree of freedom. The description of the proposed ELiSS method is along the lines of the normal scan statistic differing only with respect to the test statistic.

The proposed nonparametric scan statistic for spatial clusters makes no distributional assumptions regarding the continuous data. Since the observations are continuous, we assume there are no ties. Under the null hypothesis H0, the independent and identically distributed (iid) continuous observations come from an unknown univariate distribution E with unknown mean μ and unknown common variance \({\sigma }^{2}\). The alternative hypothesis is that there is at least one cluster location where the observations inside and outside the scanning window have unequal means. We define the ELiSS statistic as the maximum value of the empirical likelihood ratio test (ELRT) statistic comparing inside versus outside the scanning window over multiple windows.

With respect to each scanning window, two samples of observations are created, and we treat them as independent of each other. That is, we have nz observations inside and (n–nz) observations outside the scanning window z, giving a total of n observations for the entire study region. Then we have two probability vectors (p1,…..,pnz) and (q1…q(n-nz)) corresponding to the two samples of observations inside and outside the scanning window. That is, \({p}_{i}\), \({q}_{j}\)≥ 0; \({\sum }_{i=1}^{n_{z}}{p}_{i}\) = \({\sum }_{j=1}^{n-n_{z}}{q}_{j}\)  = 1, pi and qj being the probability measure associated with \({x}_{i}\) and \({x}_{j}\) observations inside and outside the scanning window z, respectively.

The empirical likelihood function (EL) under H0 is

$${\prod }_{i=1}^{n_{z}}{p}_{i} {\prod }_{j=1}^{n-n_{z}}{q}_{j},$$
(1)

which needs to be maximized subject to constraints \({\sum }_{i=1}^{n_{z}}{p}_{i}\) = \({\sum }_{j=1}^{n-n_{z}}{q}_{j}\)  = 1 and \(\sum {p}_{i}{x}_{i} -\sum {q}_{j}{x}_{j}=0.\)

The EL under the alternative Ha, is

$${\prod }_{i=1}^{n_{z}}{p}_{i} {\prod }_{j=1}^{n-n_{z}}{q}_{j},$$
(2)

which needs to be maximized subject to constraints \({\sum }_{i=1}^{n_{z}}{p}_{i}\) = \({\sum }_{j=1}^{n-n_{z}}{q}_{j}\)=1.

Then, the empirical likelihood ratio function

$$ {{\text{ELR}}} { } = { }\frac{{EL_{H_{0}}} }{{EL_{H_{a}} }},$$
(3)

where EL under H0 and Ha is maximized using the Lagrange multiplier method [35].

We have \({\text{ELRT }} = - {\text{2 log ELR}},\)

which becomes,

$$2\left[{\sum }_{i=1}^{n_{z}}log\left\{1+{\lambda }_{1}({x}_{i} -\upmu )\right\}+{\sum }_{j=1}^{n-n_{z}}log\left\{1+{\lambda }_{2}({x}_{j} -\upmu )\right\}\right].$$
(4)

Here \({\lambda }_{1}, {\lambda }_{2}\) (the Lagrange multipliers) are solutions to

$$\sum \frac{({x}_{i} -\mu )}{n_{z}(1+{\lambda }_{1}\left({x}_{i} -\mu \right))}= 0 \text{ and }\sum \frac{({x}_{j} -\mu )}{(n-n_{z})(1+{\lambda }_{2}({x}_{j} -\mu ))}= 0$$
(5)

which are nonlinear equations in \({\lambda }_{1} \text{ and } {\lambda }_{2}\).

Even for a two-sample problem with a smaller sample size, nonlinear optimization problems, as is the case with the general empirical likelihood approach are complex and time-consuming. Hence the adaptation of general ELRT as a spatial scan statistic is not computationally feasible for us considering that each scanning window gives a two-sample problem. Hence, we use the numerical solutions to empirical likelihood methods found in the literature [36,37,38] to develop the one-step or first-order EL spatial scan statistic, i.e., the ELiSS statistic. The description is similar to the general EL spatial scan statistic with the difference being that we depend on the numerical solutions rather than the computerized optimization routines to obtain the ELRT.

We find the numeric solutions using the first-order Taylor series approximation method around \({\lambda }_{1},{\lambda }_{2}=0\) and we get,

$${\lambda }_{1}=\frac{\sum \left({x}_{i} -\upmu \right)}{\sum {\left({x}_{i} -\upmu \right)}^{2}} \left(\text{or }\frac{{\overline{x} }_{in}-\upmu }{S}\right)\text{and }{\lambda }_{2}=\frac{\sum \left({x}_{j} -\upmu \right)}{\sum {\left({x}_{j} -\upmu \right)}^{2}}(\text{or }\frac{{\overline{x} }_{out}-\upmu }{S}),$$
(6)

which we substitute in the Eq. (4) for ELRT [here \(\sum {\left({x}_{i} -\upmu \right)}^{2},\sum {\left({x}_{j} -\upmu \right)}^{2}\) or S denotes information about variance, while \({\overline{x} }_{in}\) and \({\overline{x} }_{out}\) denotes the sample mean inside and outside the scanning window]. Then we obtained the max ELRT over the numerous scanning windows to get the most likely cluster.

Our proposed test statistic is then defined as follows:

ELiSS = max ELRT = maxz −2 log ELRz.

i.e., maxz \(2\left[{\sum }_{i=1}^{n_{z}}log\left\{1+{\lambda }_{1}({x}_{i} -\upmu )\right\}+{\sum }_{j=1}^{n-n_{z}}log\left\{1+{\lambda }_{2}({x}_{j} -\upmu )\right\}\right],\) (7).

where \({x}_{i}\in Z\) and \({x}_{j}\) \(\notin\) Z.

We also used a constrained optimization routine to optimize the objective function while kee** the arguments of the log terms positive [35, 39]. It is necessary that 0 \(\le\) pi \(\le\) 1, which implies that \(\lambda\) and \(\upmu\) must satisfy \(1+\lambda ({x}_{i} -\upmu )\) ≥ 1/n (or δ, where δ is a small number chosen by the researcher) for each i. We have imposed the conditions, \(\left\{1+{\lambda }_{1}({x}_{i} -\upmu )\right\}\) ≥ 1/\(n_{z}\) and \(\left\{1+{\lambda }_{2}({x}_{j} -\upmu )\right\}\) ≥ 1/\((n- n_{z}\)) to be satisfied for each i and j.

We consider four choices of information regarding variance for reaching the value of lambda in Eq. (6).

A. We consider the original formula where \({\lambda }_{1}=\frac{\sum \left({x}_{i} -\widehat{\upmu}\right)}{\sum {\left({x}_{i} -\widehat{\upmu}\right)}^{2}}\) and \({\lambda }_{2}=\frac{\sum \left({x}_{j} -\widehat{\upmu }\right)}{\sum {\left({x}_{j} -\widehat{\upmu }\right)}^{2}}\).

Here, \(\widehat{\upmu }= \frac{X}{n}\) (maximum likelihood estimate for mean) where X and n are the sum and number of total observations in the entire study region, respectively.

\({\lambda }_{1}=\frac{{\overline{x} }_{in}-\widehat{\upmu }}{S}\)B. and \({\lambda }_{2}=\frac{{\overline{x} }_{out}-\widehat{\upmu }}{S}\) . Here we consider S to be the common variance under H0.

Hence S= \(\frac{1}{n}\left\{\sum_{i\in Z}{\left({x}_{i} -\widehat{\upmu }\right)}^{2}+\sum_{j\notin \text{ Z}}{\left({x}_{j} -\widehat{\upmu }\right)}^{2}\right\}\) which is the same as the maximum likelihood estimate for variance,\({\sigma }^{2}=\frac{\sum {\left({x}_{k} -\widehat{\upmu }\right)}^{2}}{n},\text{ k }= 1, 2,\dots \text{n}\)

C. For the \({\lambda }_{1}\) and \({\lambda }_{2}\) defined in B, we consider pooled variance as an estimate of S.

S \(=\frac{\left({n}_{1}-1\right){S}_{1}^{2}+({n}_{2}-1){S}_{2}^{2}}{{n}_{1}+{n}_{2}-2}\) where \({S}_{1}^{2}=\frac{\sum {\left({x}_{i} -{\overline{x} }_{in}\right)}^{2}}{{n}_{1}-1}\) and \({S}_{2}^{2}=\frac{\sum {\left({x}_{j} -{\overline{x} }_{out}\right)}^{2}}{{n}_{2}-1}\)

Here, n1 and n2 refer to the number of observations inside and outside each scanning window, respectively.

D. O’Neill [40] studied the standard sampling problem, where a sample of n observations were taken from a finite population of size N using simple random sampling without replacement. It was reported that the overall variance can be decomposed into statistical quantities corresponding to the sampled (n observations) and unsampled (N–n observations) parts of the population. A distance measure comparing the means of the sampled and unsampled parts were also defined. Similarly, for the spatial scan statistic, with respect to each scanning window, we have nz observations inside and n–nz observations outside creating a similar sampling problem. Hence, for the \({\lambda }_{1}\) and \({\lambda }_{2}\) defined in B, we incorporate \({S}{\prime}\), variance decomposition as per O’Neill [40], as an estimate of the variance S.

$${S}{\prime} =\frac{1}{{n}_{1}+{n}_{2}-1} \left[\left({n}_{1}-1\right){S}_{1}^{2}+\left({n}_{2}-1\right){S}_{2}^{2}+\frac{{n}_{1}{n}_{2}}{{n}_{1}+{n}_{2}} {\left({\overline{x} }_{in} -{\overline{x} }_{out}\right)}^{2}\right]$$

2.3 Simulation study

A simulation study was conducted to understand the performance of the proposed ELiSS method with respect to the four different choices of incorporating variance (versions of ELiSS correspond directly to choices A, B, C, and D as described earlier). To evaluate the performances of the considered approach, the four versions of the ELiSS statistic were also compared to the Mann–Whitney test-based nonparametric spatial scan statistic and the normal scan statistic with respect to statistical power, accuracy measures of sensitivity and PPV using simulated continuous data.

Under the simulation study using R software, a true cluster having higher outcomes in comparison with the rest of the study region was defined on a 0:100 × 0:100 rectangular grid. The true cluster was centered at the location coordinates (50,20) and had a radius of 10 units (L1). Four probability distributions namely normal, logistic, lognormal, and gamma distributions were considered with different values for the location parameters within and outside the defined true cluster, whereas the scale parameter was fixed at one for all four distributions. The mean of the distribution inside the true cluster was kept as c√2 for normal and logistic distributions, and 2 + c√2 for lognormal and gamma distributions where c = 1, 2, 3. The mean outside the cluster was kept as 0 for the symmetric distributions and as 2 for the asymmetric distributions. The location and scale parameters were defined as per Huang et al. [6] and Jung and Cho [10] with different values for c.

We generated 1000 random datasets for one baseline scenario and its variations. For the baseline scenario, the true cluster was at L1, the sample size was 100, and the maximum size of the scanning window r was kept as 50 percent of all observations. With respect to the true cluster, the value of c pertaining to the difference in location parameters was set as 3. The varied scenarios in the simulation study include a different sample size (n = 200), varying size of the scanning window (r = 10, 30), and differences in location parameters (c = 1, 2). Another variation in the baseline scenario was a bigger true cluster with a radius of 20 units, L2. The number of permutations for each simulated dataset was fixed at 999.

The statistical power was defined as “the proportion of the 1000 random data sets for which the null hypothesis is rejected at the significance level of 0.05” and was expressed as a percentage. Sensitivity was defined “as the proportion of the number of cells correctly detected among the cells in the true cluster”. PPV was defined “as the proportion of the number of cells belonging to the true cluster among the cells in the detected cluster”. As the measures of accuracy varied between the 1000 random data sets, only the average value of the measures over the datasets for which the null hypothesis was rejected was considered.

3 Results

The estimated accuracy measures and power of the considered methods under varying sample size, true cluster size, scanning window size, and values of the parameters are depicted in Figs. 1, 2, 3 and 4 (See also Tables 1, 2, 3 and 4 in Online Resource 1). The ELiSS A method had the largest difference between the accuracy measures while the normal method had the smallest. The PPV of the methods was lesser than the sensitivity for the smaller true cluster. The power of all the methods either increased to or remained constant at 100 percent as the true cluster became larger. The methods had considerably lesser power at c = 1 compared to other scenarios.

Fig. 1
figure 1

Estimated accuracy measures and Power of scan statistics when used for detecting clusters when there are differences in sample size. a Normal, b Logistic, c Lognormal and d Gamma distribution

Fig. 2
figure 2

Estimated accuracy measures and Power of scan statistics when used for detecting clusters when there are differences in size of the true cluster. a Normal, b Logistic, c Lognormal and d Gamma distribution

Fig. 3
figure 3

Estimated accuracy measures and Power of the scan statistics for detecting clusters using different maximum size scanning windows. a Normal, b Logistic, c Lognormal and d Gamma distribution

Fig. 4
figure 4

Estimated accuracy measures and Power of the scan statistics for detecting clusters for different choices of the parameter. a Normal, b Logistic, c Lognormal and d Gamma distribution

3.1 Sensitivity

3.1.1 Method-wise performance of scan statistics across the varied scenarios with respect to the distributions

There was almost no difference in the sensitivity of ELiSS B and D methods irrespective of the distribution. The methods had performed mostly better than the normal and ELiSS C methods across all scenarios, with little to no differences, if any. Overall, the MW method slightly performed better compared to ELiSS B and D methods, with the differences being more evident in the case of asymmetric distributions. In the case of symmetric distributions, the methods had approximately the same sensitivity across most scenarios (Figs. 5 and 6).

Fig. 5
figure 5

Method-wise performance of scan statistics across the varied scenarios with respect to the distributions. a Normal, b Logistic, c Lognormal and d Gamma

BS: baseline scenario (True cluster at L1, n=100, c=3 and r=50); Si, where i= 1, 2..6 refers to one variation from the baseline scenario. S1: c=1, S2: c=2, S3: r=10, S4: r=30, S5: n=200 and S6: True cluster at L2

Fig. 6
figure 6

Similarity/differences in the performance of the methods across the distributions. a MW, b Normal, c ELiSS A, d ELiSS B, e ELiSS C and f ELiSS D

BS: baseline scenario (True cluster at L1, n=100, c=3 and r=50); Si, where i= 1, 2..6 refers to one variation from the baseline scenario. S1: c=1, S2: c=2, S3: r=10, S4: r=30, S5: n=200 and S6: True cluster at L2

The normal method had the lowest sensitivity (except for the larger true cluster) which was considerably less at c = 1, especially in the case of asymmetric distributions. For the smaller true cluster, the normal and ELiSS C methods had approximately similar sensitivity across r, except for the lognormal distribution. The normal method performed slightly better than the latter across r for asymmetric distributions. For c = 1, 2 and n = 200, the ELiSS C performed better than normal except for asymmetric distributions at c = 2 where the two methods had approximately the same sensitivity. For n = 200, all except the normal method had almost similar sensitivity with differences being more evident for asymmetric distributions.

For the smaller true cluster, ELiSS A mostly performed slightly better or approximately the same as the MW method, with the differences being more evident in the case of symmetric distributions.

All methods performed considerably better compared to the ELiSS C method for the bigger true cluster irrespective of the distribution. For the larger true cluster in the case of symmetric distributions, the ELiSS A method had the highest sensitivity followed by the MW method. In the case of asymmetric distributions, the MW method had the highest sensitivity and performed better than the former. For lognormal distribution, the latter method performed better than just the ELiSS C method, though only slightly less compared to the normal method. Except for ELiSS C (for all) and ELiSS A (except for gamma distribution) methods, the other methods had approximately the same sensitivity for the larger true cluster.

Except for the normal method at c = 1, ELiSS A (only for lognormal), and ELiSS C methods for the larger true cluster, the sensitivity of methods ranged from 0.9 to 1.

3.1.2 Similarity/differences in the performance of the methods across the distributions

The sensitivity of the MW method was almost the same across the distributions in all scenarios except for when c = 1. The MW method had the highest sensitivity for lognormal distribution with only small differences across others. The sensitivity of the normal method was more or less constant across the distributions and scenarios except when c = 1. The sensitivity was considerably higher for the symmetric distributions.

There was only a slight difference, if any, in the sensitivity of the ELiSS A method with respect to the sample size and value of c across the distributions. Irrespective of the variable window size (r), the ELiSS A method had 100 percent sensitivity for symmetric distributions. The differences in sensitivity across the symmetric and asymmetric distributions decreased as r increased, with the maximum being at r = 10 percent. For the larger true cluster, there were considerable differences in the sensitivity between symmetric and asymmetric distributions, with sensitivity being higher for the former.

The ELiSS C method had higher sensitivity for symmetric distributions irrespective of the scenario. As the sample size increased the differences between symmetric and asymmetric distributions decreased. Though there were only small differences in sensitivity across the distributions for values of c, the difference in the case of symmetric distributions decreased to nil as c increased. For asymmetric distributions, the difference increased from zero as c increased, with lognormal distribution having the least sensitivity. For all distributions, the sensitivity remained almost constant across r.

There are more similarities than differences in the performance of the ELiSS B and D scan statistics and therefore they are considered together. The sensitivity was more or less similar across the distributions and scenarios. The sensitivity was slightly higher for the symmetric distributions in most scenarios. Both methods had slightly better sensitivity for logistic distributions across c.

3.2 PPV

3.2.1 Method-wise performance of scan statistics across the varied scenarios with respect to the distributions

The ELiSS A method had the least while the normal method had the highest PPV irrespective of the distribution or the scenario. For all the scenarios involving the smaller true cluster, the MW method had the second lowest PPV, though with a large difference from ELiSS A (smallest difference at r = 10 percent). The ELiSS B and D methods followed next with the methods having identical PPV except in the case of the normal distribution at r = 30, 50 percent. The two methods had approximately the same PPV even then with ELiSS B method performing only slightly better. The two methods when compared to the MW method performed much better in the case of asymmetric distributions. Though there were considerable differences between the methods across the distributions, in the case of symmetric distributions, the PPV of the MW method was only slightly different from that of the two methods at r = 10 percent. The ELiSS C method had the second highest PPV with only small differences from that of the ELiSS B and D methods. The PPV of the ELiSS C method was much closer to that of the normal method in the case of asymmetric distributions.

The only difference in the ordering of the methods in the case of the larger true cluster was for the PPV of the ELiSS C method. The method only performed slightly better than the MW method in the case of asymmetric distributions. It was the opposite in the case of symmetric distributions, though the PPV was approximately the same.

The ELiSS A method had the highest PPV at r = 10 while the lowest was at c = 1 or n = 200 depending on the type of distribution. The PPV of the other methods was the highest for the larger true cluster and the lowest at c = 1.

3.2.2 Similarity/differences in the performance of the methods across the distributions

PPV of the MW method was mostly similar across the distributions, with PPV slightly being higher for symmetric distributions especially when c = 1,2, and for the bigger sample size. The normal method had was more or less constant PPV with only the slightest differences, if any, across the distributions and scenarios. For each of the ELiSS versions, the PPV across symmetric distributions was mostly similar. They had better PPV for asymmetric distributions (highest for lognormal distribution) across almost all scenarios.

3.3 Power

3.3.1 Method-wise performance of scan statistics across the varied scenarios with respect to the distributions

All methods had 100 percent power for the larger true cluster and the least power at c = 1 irrespective of the type of distribution. The normal, ELiSS B, C and D methods had approximately or equal to 100 percent power except for c = 1, 2. For n = 200, ELiSS A followed by the MW method had the largest difference in power from other methods. All methods had approximately or equal to 100 percent power across r except for the ELiSS A method. The method, while maintaining the above pattern for r = 10 percent, had slightly lower power for r = 30, 50 percent, especially in the case of lognormal distribution.

ELiSS A (followed by the MW method) had considerably lesser power for c = 1 (symmetric distributions), 2 (all), when compared to other methods. At c = 1, the ELiSS A method was only slightly better compared to the normal method in the case of lognormal distribution. For gamma distribution, the alternate was true with the MW method performing slightly better than the normal method.

At c = 1, the normal method, and ELiSS C method, respectively, had the highest power for normal and logistic distributions. Though the difference between these two methods was slightly higher for c = 1, the power was much closer in the case of symmetric distributions. The power of ELiSS B and D methods were mostly the same across the scenarios, with few exceptions where the ELiSS B method was only slightly better. With respect to the above two methods at c = 1, the ELiSS C method had slightly better power for normal distribution, while the normal method had approximately the same power for logistic distribution.

For c = 2, the power of ELiSS B, C, and D methods were approximately the same with them having the higher power for asymmetric distributions. In the case of symmetric distributions, the normal method performed only slightly better than these methods. The differences in power between the normal and the three methods were much larger for asymmetric distributions, with the normal method having lower power.

3.3.2 Similarity/differences in the performance of the methods across the distributions

The MW method had similar power across the distributions in most scenarios with only small differences in case of the exceptions (when c = 1, 2 and for the larger sample size). The normal method had better power for symmetric distributions at c = 1 and 2, and in other scenarios, the power was almost the same across the distributions. The power of the ELiSS A method was slightly better for asymmetric distributions at c = 1,2 and exceptionally better for the larger sample size. The method had comparatively lesser power for lognormal distribution while maintaining the highest power for normal distribution across r. The ELiSS B, C and D methods had constant power across most scenarios except for c = 1, 2. For such exceptions, these ELiSS methods had slightly better power for symmetric distributions.

4 Application to rape crime data

We considered the detection of the most likely clusters of high rape in India for the year 2011 using the rape data from the 640 districts as per 2011 census [41]. The district-level rape data was extracted from National Crime Records Bureau [42], and district-wise female population data for the year 2011 was extracted from the Census of India [41]. The rape crime rate was calculated as the number of rapes per 100,000 of the female population. The objective was to apply the ELiSS A, B, C, D statistics, the MW scan statistic, and the normal scan statistic to the rape crime data to detect the likely high clusters of rape.

The analyses were performed in R software and the maximum size of the circular scanning window was chosen as 100 km and to include at most five percent of the observations (the unit of observation being districts, varying in size and totalling 640 in number). Statistical significance was based on the P value calculated through 999 Monte Carlo replications. A P value < 0.05 was considered statistically significant. Among the statistically significant clusters identified based on the P value, the most likely (primary) cluster was the one with the maximum likelihood ratio. Secondary clusters detected using the standard version of the normal scan statistic, were reported based on the criterion of ‘no geographical overlap’ [43]. The district-level shapefile of India based on the 2011 census was used under the Creative Commons license and was obtained from GitHub [44]. The statistically significant spatial clusters, detected using the various scan statistics, were mapped using GeoDa 1.20.0.8 software [45]. Only one non-significant cluster was mapped additionally.

The most likely clusters detected by the various methods were spread across the North Eastern Zone, with the location and extent of it changing across the methods (Table 1 and Fig. 7). The location of the primary cluster detected by MW and ELiSS A methods became that of the secondary cluster detected by the remaining methods and vice versa. The most likely cluster detected by the former methods covered 11 districts of Assam, and two districts of Meghalaya with the addition of the East Garo Hills and East Khasi Hills districts in case of just ELiSS A statistic. The primary cluster detected by ELiSS B, C, D methods covered six districts of Mizoram along with two districts of Tripura, and two districts of Assam (minus Karimganj district in case of the normal method). (See Table 1 in Online Resource 2 for more information.)

Table 1 High rape spatial clusters detected using the various spatial scan statistics for 2011 all India rape data
Fig. 7
figure 7

Spatial clusters of high rape rates detected, using various scan statistics, across India for the year 2011. a Normal, b MW, c ELiSS A, d ELiSS B, e ELiSS C and f ELiSS D

The number of the districts contributing to the potential spatial clusters were the largest in case of MW method and the lowest in case of normal method (Table 2). Overall, the ELiSS B, C methods had more districts in common with all except ELiSS A method (MW method had more common districts), while the ELiSS A method had the least common districts with all except MW method, and normal method having the least common districts with MW, ELiSS A methods.

Table 2 Proportion of common clusters detected by the various scanning methods for rape in India- 2011

The likely clusters detected by the ELiSS B, C, D methods were the same in terms of location and extent. The three methods detected more clusters compared to the normal method, with the coverage of districts more in case of the former methods even for similar location. That is, all the districts included in the clusters detected by the normal method were also covered by the ELiSS B, C, D methods though vice versa is not true. The additional clusters detected by the latter methods were also covered by the MW method. The MW and ELiSS B methods each had two clusters not detected by the other, with the methods differing in the extent of a cluster at a similar location. Except for the change in coverage of districts from Meghalaya for the primary cluster and districts from Arunachal Pradesh and Assam in a secondary cluster, the cluster districts detected by ELiSS A were detected by the MW method. Except for the first two most likely clusters, the location and extent of the clusters detected differed largely between ELiSS A and other methods (excluding MW method).

5 Discussion

We have proposed a nonparametric spatial scan statistic for continuous data apart from survival data using an empirical likelihood approach. The proposed ELiSS statistic includes both the distribution-free and likelihood-based aspects of the existing spatial scan statistics.

All compared methods and associated versions consistently showed greater sensitivity than PPV for the smaller cluster, irrespective of the sample size, difference in location parameters, and size of the scanning window. While the sensitivity of MW and ELiSS A methods was greater than PPV for the bigger cluster as well, it was the opposite for the normal and ELiSS C methods. For both ELiSS B and D methods, sensitivity was slightly lesser than PPV for the asymmetric distributions, while both measures were observed to be equal for the symmetric distributions. All methods had 100 percent power for the bigger cluster irrespective of the distributions. Even then the accuracy measures of the compared methods were not 100 percent, indicating that spatial scan statistics only determine a general location of the localized spatial cluster and not its exact boundaries. The size of the detected clusters affects the accuracy measures of the spatial scan statistics. Compared to the smaller true cluster, the cluster detected by all methods was somewhat bigger in size. For the larger true cluster, the detected clusters by methods other than MW and ELiSS A were somewhat smaller in size.

Among the four ELiSS versions, the use of the B and D methods is recommended as they have good as well as the most stable performance in terms of power, and accuracy measures irrespective of the distribution. While the ELiSS A method had the highest sensitivity, it had considerably lower power and PPV, with the performance varying across distributions. The sensitivity of the ELiSS C method drastically decreased for the larger true cluster with the reduction being slightly more for asymmetric distributions. The ELiSS C method especially had lower sensitivity for lognormal distribution compared to others.

In terms of power, the normal scan statistic had considerably lower power for the lognormal distribution for the smaller difference in means (c = 1, 2) compared to ELiSS B and D statistics. In other instances, the methods had little to no difference in power. The ELiSS B, D statistic had either equivalent or much better power than the MW scan statistic. The normal method had better PPV than ELiSS B and D methods with only small differences in the case of asymmetric distributions. The PPV of the normal method was only slightly better for the larger true cluster. The MW method had lesser PPV compared to the ELiSS B and D methods, with the latter methods performing considerably better for asymmetric distributions. The ELiSS B and D methods had better sensitivity compared to the normal method, with the latter having considerably lesser sensitivity for c = 1 in the case of asymmetric distributions. The MW method performed slightly better with little to no difference in sensitivity when compared to ELiSS B and D methods. Thus, the B and D versions of the ELiSS statistic are excellent alternatives to both the normal and MW scan statistics for continuous data from skewed or unknown distributions.

It can be seen from the application on India rape data for 2011, that there are similarities as well as differences in the results of the various scan statistics in terms of number, order, location, and extent of the potential clusters detected. There is also considerable overlap between the clusters detected using the different scan statistics. As seen from the application, ELiSS B, C, D methods had more cluster districts in common with other methods. It also detected few additional clusters when compared with normal and MW methods. Hence, considering more than one method, when analysing real data, could help one in better understanding the high risk areas.

6 Conclusion

In this paper, the ELiSS statistic was proposed for purely spatial clusters and using circular windows for univariate continuous data. The methods have the advantage of being distribution free while still allowing the use of likelihood methods, which are beneficial in real world applications where the data might come from unknown or skewed distributions. The ELiSS B and D methods are recommended for applications as an alternative and/or complementary method to the other spatial scan statistics.

The extension of the method to a space–time setting, detection of spatial clusters after adjusting for covariates, or the use of flexible scanning windows to improve the precision of the detected clusters might be of future interest.