An empirical likelihood approach for detecting spatial clusters of continuous data

Mathews, Maria; Guddattu, Vasudeva; Binu, V. S.; Rao, K. Aruna

doi:10.1007/s41324-024-00592-y

An empirical likelihood approach for detecting spatial clusters of continuous data

Open access
Published: 27 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Spatial Information Research Aims and scope Submit manuscript

An empirical likelihood approach for detecting spatial clusters of continuous data

Download PDF

26 Accesses
Explore all metrics

Abstract

Spatial scan statistics are an important tool for detecting and evaluating the statistical significance of spatial clusters and have widespread applications in various fields. The study proposes a new nonparametric spatial scan statistic based on the empirical likelihood method as an alternative to existing methods, for detecting clusters for continuous outcomes from unknown or skewed probability distributions. The existing methods are either based on distribution-free methods or likelihood ratio tests assuming a probability distribution. The proposed spatial scan statistic is based on the empirical likelihood method which remains distribution-free while allowing the use of likelihood methods. The performance of the proposed method was compared to the Mann–Whitney-based nonparametric scan statistic and the normal model-based scan statistic through a simulation study under varied scenarios as well as application on a real data. The proposed method had better positive predictive value compared to the Mann–Whitney-based scan statistic, and better sensitivity than the normal-based scan statistic. The methods had little to no difference in terms of power, with the proposed method performing much better in most scenarios. The number, order, location, and extent of the potential clusters detected from the rape crime data from India for the year 2011 varied across methods with certain similarities and differences. The Mann–Whitney and normal scan statistics detected more clusters in common with the proposed method than with each other. The proposed method serves as a good alternative and/or complementary method to existing spatial scan statistics for continuous outcomes when the underlying distribution is unknown or asymmetric.

Exploring Spatial Patterns of Crime Using Non-hierarchical Cluster Analysis

A simulation study for geographic cluster detection analysis on population-based health survey data using spatial scan statistics

Article Open access 09 September 2022

Optimizing the maximum reported cluster size for the multinomial-based spatial scan statistic

Article Open access 08 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Spatial scan statistics are a method for the detection and evaluation of localized spatial clusters. The method is widely utilized in fields such as spatial epidemiology, criminology, disease surveillance, and others [1]. Joseph Naus, known as the “father of the scan statistic” first studied the application of the traditional scan statistic in a two- dimensional setting [2]. A rectangular scanning window having a fixed shape and size was moved over the entire study region to identify a region, if possible, with a higher concentration of points than that to have occurred by chance. Extending the idea, Kulldorff [3] proposed a spatial scan statistic, based on the likelihood ratio test, for the variable-size circular scanning windows. Numerous scanning windows of varying sizes are imposed over the entire study area and the likelihood ratio test statistic is calculated by comparing the inside versus outside regions of each scanning window. The most likely cluster corresponds to the window with maximum value of the test statistic. Abolhassani and Prates [4] have reviewed the development and expansion of spatial scan statistics in the last three decades. Traditionally, spatial scan statistics are based on likelihood ratio tests for probability distributions like binomial, Poisson, normal, etc. Spatial scan statistics for continuous datasets, apart from survival data, are usually modeled assuming univariate or multivariate normal distribution, as the case may be [5,6,7,8].

Cucala [9] introduced a distribution-free scan statistic based on a concentration index for any type of data, be it discrete or continuous. The scan statistic relying on the standardized difference in means was shown equivalent to the scan statistic introduced by Kulldorff et al. [7] that assumes a normal distribution. A nonparametric scan statistic based on the Mann–Whitney (MW)/Wilcoxon rank-sum test method was also introduced for continuous outcomes. The nonparametric scan statistic proposed by Jung and Cho [10] was defined as the minimum of p-values from the Wilcoxon rank-sum tests over multiple circular scanning windows. P-values were obtained using the normal approximation of the Wilcoxon rank-sum test. The MW scan statistic proposed by Cucala [11] was defined as the maximum of the standardized MW statistic over numerous circular scanning windows.

Empirical likelihood, introduced by Owen [12, 13], is a nonparametric method for statistical inference based on the empirical distribution of the available data. Review papers exist on theoretical advancement in empirical likelihood and variants [14, 15], empirical likelihood methods for regression [16], and time series data [17]. The empirical likelihood method and its variants have long been applied in a spatial setting, such as for spatial lattice data and spatial regression [18, 19], irregularly located spatial data [20], variogram estimation [21], spatial Markov models [22], spatial quantile regression [23], spatial panel data models [24], small area estimation [25,26,27,28], and spatial autoregressive models [29, 30]. With regard to cluster detection, a spatial scan statistic based on empirical likelihood has been recently introduced for zero-inflated data [31].

The present study proposes an empirical likelihood approach to spatial scan statistics for continuous outcomes. The existing spatial scan statistics are either based on distribution-free methods or likelihood ratio tests assuming a probability distribution. The proposed spatial scan statistic is based on Empirical likelihood which has the advantage of remaining distribution-free while allowing the use of likelihood methods. In situations where the underlying distribution of the real data is unknown or skewed, it might be better to choose a spatial scan statistic that performs well in spatial accuracy measures like sensitivity and positive predictive value (PPV) while maintaining a higher statistical power. Therefore, the proposed empirical likelihood spatial scan (ELiSS) statistic will also be compared to the MW scan statistic and the normal scan statistic [7] considering statistical power and accuracy measures. The rest of the paper is structured as follows. Section 2 briefly describes the normal scan statistic, and presents the ELiSS method, and the simulation scenario for evaluating the performance of the considered scan statistics. Results of the application of the considered methods in numerical studies are reported in Sect. 3. In Sect. 4, application of the methods on a real data set is described and the paper ends with a discussion in Sect. 5 followed by conclusion in Sect. 6.

2 Methods

2.1 The normal scan statistic

The normal scan statistic for continuous data [7] is described as follows. The null hypothesis H₀ states that all observations come from a normal distribution. The alternative hypothesis H_a states that there is one cluster location where the observations have either a larger or smaller mean than outside that cluster. Let there be N continuous observations with values ${x}_{i}$, i = 1…, N. Corresponding to each observation, there is a spatial location s, s = 1…, S, with specified latitude and longitude, and S ≤ N. For each location s, the sum of the observed values is ${x}_{s }=\sum_{i\in s}{x}_{i}$ and the number of observations in the location is ${n}_{s}$. The sum of all the observed values is $X=\sum_{i}{x}_{i}$. For each circular scanning window z (with more than one observation), let ${n}_{z}={\sum }_{s\in z}{n}_{s}$, be the number of observations in circle z, and let ${x}_{z}={\sum }_{s\in z}{x}_{s}$, be the sum of the observed values in circle z. The likelihood function for deriving the likelihood ratio test for the normal model with mean μ and variance ${\sigma }^{2}$ is expressed as,

$$L={{\prod }_{i}\frac{1}{\sigma \sqrt{2\pi }}e}^{-\frac{({x}_{i}-\mu {)}^{2}}{2{\sigma }^{2}}}$$

The spatial scan statistic is defined as the maximum of the log-likelihood ratio LLR(z) over all the circular windows, i.e. max_z (ln L_z / ln L₀) where L_z is the log-likelihood for the circle z under H_a and L₀ is the log-likelihood under H₀. That is, over multiple circular scanning windows, the likelihood ratio test statistic will be calculated for each scanning window and the zone (associated region) that maximizes the value of the specified test statistic will be considered as the most likely cluster. The radius of the circular scanning window, centered on an observation, is allowed to vary continuously from zero often up to at most 50 percent of all observations. For assessing the statistical significance of the most likely cluster, a random permutation-based Monte Carlo hypothesis testing procedure is used.

2.2 The empirical likelihood spatial scan (ELiSS) statistic

The normal scan statistic as well as the proposed empirical likelihood scan statistic are based on the likelihood ratio test between well-defined hypotheses. The proposed ELiSS statistic is derived based on the empirical likelihood method for testing the equality of means in a two-sample problem [32,33,34]. The asymptotic null distribution of the empirical likelihood test statistic for comparing means is chi-square with one degree of freedom. The description of the proposed ELiSS method is along the lines of the normal scan statistic differing only with respect to the test statistic.

The proposed nonparametric scan statistic for spatial clusters makes no distributional assumptions regarding the continuous data. Since the observations are continuous, we assume there are no ties. Under the null hypothesis H₀, the independent and identically distributed (iid) continuous observations come from an unknown univariate distribution E with unknown mean μ and unknown common variance ${\sigma }^{2}$. The alternative hypothesis is that there is at least one cluster location where the observations inside and outside the scanning window have unequal means. We define the ELiSS statistic as the maximum value of the empirical likelihood ratio test (ELRT) statistic comparing inside versus outside the scanning window over multiple windows.

With respect to each scanning window, two samples of observations are created, and we treat them as independent of each other. That is, we have n_z observations inside and (n–n_z) observations outside the scanning window z, giving a total of n observations for the entire study region. Then we have two probability vectors (p₁,…..,p_nz) and (q₁…q_(n-nz)) corresponding to the two samples of observations inside and outside the scanning window. That is, ${p}_{i}$, ${q}_{j}$≥ 0; ${\sum }_{i=1}^{n_{z}}{p}_{i}$ = ${\sum }_{j=1}^{n-n_{z}}{q}_{j}$ = 1, p_i and q_j being the probability measure associated with ${x}_{i}$ and ${x}_{j}$ observations inside and outside the scanning window z, respectively.

The empirical likelihood function (EL) under H₀ is

$${\prod }_{i=1}^{n_{z}}{p}_{i} {\prod }_{j=1}^{n-n_{z}}{q}_{j},$$

(1)

which needs to be maximized subject to constraints ${\sum }_{i=1}^{n_{z}}{p}_{i}$ = ${\sum }_{j=1}^{n-n_{z}}{q}_{j}$ = 1 and $\sum {p}_{i}{x}_{i} -\sum {q}_{j}{x}_{j}=0.$

The EL under the alternative H_a, is

$${\prod }_{i=1}^{n_{z}}{p}_{i} {\prod }_{j=1}^{n-n_{z}}{q}_{j},$$

(2)

which needs to be maximized subject to constraints ${\sum }_{i=1}^{n_{z}}{p}_{i}$ = ${\sum }_{j=1}^{n-n_{z}}{q}_{j}$=1.

Then, the empirical likelihood ratio function

$$ {{\text{ELR}}} { } = { }\frac{{EL_{H_{0}}} }{{EL_{H_{a}} }},$$

(3)

where EL under H₀ and H_a is maximized using the Lagrange multiplier method [35].

We have ${\text{ELRT }} = - {\text{2 log ELR}},$

which becomes,

$$2\left[{\sum }_{i=1}^{n_{z}}log\left\{1+{\lambda }_{1}({x}_{i} -\upmu )\right\}+{\sum }_{j=1}^{n-n_{z}}log\left\{1+{\lambda }_{2}({x}_{j} -\upmu )\right\}\right].$$

(4)

Here ${\lambda }_{1}, {\lambda }_{2}$ (the Lagrange multipliers) are solutions to

$$\sum \frac{({x}_{i} -\mu )}{n_{z}(1+{\lambda }_{1}\left({x}_{i} -\mu \right))}= 0 \text{ and }\sum \frac{({x}_{j} -\mu )}{(n-n_{z})(1+{\lambda }_{2}({x}_{j} -\mu ))}= 0$$

(5)

which are nonlinear equations in ${\lambda }_{1} \text{ and } {\lambda }_{2}$.

Even for a two-sample problem with a smaller sample size, nonlinear optimization problems, as is the case with the general empirical likelihood approach are complex and time-consuming. Hence the adaptation of general ELRT as a spatial scan statistic is not computationally feasible for us considering that each scanning window gives a two-sample problem. Hence, we use the numerical solutions to empirical likelihood methods found in the literature [36,37,38] to develop the one-step or first-order EL spatial scan statistic, i.e., the ELiSS statistic. The description is similar to the general EL spatial scan statistic with the difference being that we depend on the numerical solutions rather than the computerized optimization routines to obtain the ELRT.

We find the numeric solutions using the first-order Taylor series approximation method around ${\lambda }_{1},{\lambda }_{2}=0$ and we get,

$${\lambda }_{1}=\frac{\sum \left({x}_{i} -\upmu \right)}{\sum {\left({x}_{i} -\upmu \right)}^{2}} \left(\text{or }\frac{{\overline{x} }_{in}-\upmu }{S}\right)\text{and }{\lambda }_{2}=\frac{\sum \left({x}_{j} -\upmu \right)}{\sum {\left({x}_{j} -\upmu \right)}^{2}}(\text{or }\frac{{\overline{x} }_{out}-\upmu }{S}),$$

(6)

which we substitute in the Eq. (4) for ELRT [here $\sum {\left({x}_{i} -\upmu \right)}^{2},\sum {\left({x}_{j} -\upmu \right)}^{2}$ or S denotes information about variance, while ${\overline{x} }_{in}$ and ${\overline{x} }_{out}$ denotes the sample mean inside and outside the scanning window]. Then we obtained the max ELRT over the numerous scanning windows to get the most likely cluster.

Our proposed test statistic is then defined as follows:

ELiSS = max ELRT = max_z −2 log ELR_z.

i.e., max_z $2\left[{\sum }_{i=1}^{n_{z}}log\left\{1+{\lambda }_{1}({x}_{i} -\upmu )\right\}+{\sum }_{j=1}^{n-n_{z}}log\left\{1+{\lambda }_{2}({x}_{j} -\upmu )\right\}\right],$ (7).

where ${x}_{i}\in Z$ and ${x}_{j}$ $\notin$ Z.

We also used a constrained optimization routine to optimize the objective function while kee** the arguments of the log terms positive [35, 39]. It is necessary that 0 $\le$ p_i $\le$ 1, which implies that $\lambda$ and $\upmu$ must satisfy $1+\lambda ({x}_{i} -\upmu )$ ≥ 1/n (or δ, where δ is a small number chosen by the researcher) for each i. We have imposed the conditions, $\left\{1+{\lambda }_{1}({x}_{i} -\upmu )\right\}$ ≥ 1/$n_{z}$ and $\left\{1+{\lambda }_{2}({x}_{j} -\upmu )\right\}$ ≥ 1/$(n- n_{z}$) to be satisfied for each i and j.

We consider four choices of information regarding variance for reaching the value of lambda in Eq. (6).

A. We consider the original formula where ${\lambda }_{1}=\frac{\sum \left({x}_{i} -\widehat{\upmu}\right)}{\sum {\left({x}_{i} -\widehat{\upmu}\right)}^{2}}$ and ${\lambda }_{2}=\frac{\sum \left({x}_{j} -\widehat{\upmu }\right)}{\sum {\left({x}_{j} -\widehat{\upmu }\right)}^{2}}$.

Here, $\widehat{\upmu }= \frac{X}{n}$ (maximum likelihood estimate for mean) where X and n are the sum and number of total observations in the entire study region, respectively.

${\lambda }_{1}=\frac{{\overline{x} }_{in}-\widehat{\upmu }}{S}$B. and ${\lambda }_{2}=\frac{{\overline{x} }_{out}-\widehat{\upmu }}{S}$ . Here we consider S to be the common variance under H₀.

Hence S= $\frac{1}{n}\left\{\sum_{i\in Z}{\left({x}_{i} -\widehat{\upmu }\right)}^{2}+\sum_{j\notin \text{ Z}}{\left({x}_{j} -\widehat{\upmu }\right)}^{2}\right\}$ which is the same as the maximum likelihood estimate for variance,${\sigma }^{2}=\frac{\sum {\left({x}_{k} -\widehat{\upmu }\right)}^{2}}{n},\text{ k }= 1, 2,\dots \text{n}$

C. For the ${\lambda }_{1}$ and ${\lambda }_{2}$ defined in B, we consider pooled variance as an estimate of S.

S $=\frac{\left({n}_{1}-1\right){S}_{1}^{2}+({n}_{2}-1){S}_{2}^{2}}{{n}_{1}+{n}_{2}-2}$ where ${S}_{1}^{2}=\frac{\sum {\left({x}_{i} -{\overline{x} }_{in}\right)}^{2}}{{n}_{1}-1}$ and ${S}_{2}^{2}=\frac{\sum {\left({x}_{j} -{\overline{x} }_{out}\right)}^{2}}{{n}_{2}-1}$

Here, n₁ and n₂ refer to the number of observations inside and outside each scanning window, respectively.

D. O’Neill [40] studied the standard sampling problem, where a sample of n observations were taken from a finite population of size N using simple random sampling without replacement. It was reported that the overall variance can be decomposed into statistical quantities corresponding to the sampled (n observations) and unsampled (N–n observations) parts of the population. A distance measure comparing the means of the sampled and unsampled parts were also defined. Similarly, for the spatial scan statistic, with respect to each scanning window, we have n_z observations inside and n–n_z observations outside creating a similar sampling problem. Hence, for the ${\lambda }_{1}$ and ${\lambda }_{2}$ defined in B, we incorporate ${S}{\prime}$, variance decomposition as per O’Neill [40], as an estimate of the variance S.

$${S}{\prime} =\frac{1}{{n}_{1}+{n}_{2}-1} \left[\left({n}_{1}-1\right){S}_{1}^{2}+\left({n}_{2}-1\right){S}_{2}^{2}+\frac{{n}_{1}{n}_{2}}{{n}_{1}+{n}_{2}} {\left({\overline{x} }_{in} -{\overline{x} }_{out}\right)}^{2}\right]$$

2.3 Simulation study

A simulation study was conducted to understand the performance of the proposed ELiSS method with respect to the four different choices of incorporating variance (versions of ELiSS correspond directly to choices A, B, C, and D as described earlier). To evaluate the performances of the considered approach, the four versions of the ELiSS statistic were also compared to the Mann–Whitney test-based nonparametric spatial scan statistic and the normal scan statistic with respect to statistical power, accuracy measures of sensitivity and PPV using simulated continuous data.

Under the simulation study using R software, a true cluster having higher outcomes in comparison with the rest of the study region was defined on a 0:100 × 0:100 rectangular grid. The true cluster was centered at the location coordinates (50,20) and had a radius of 10 units (L1). Four probability distributions namely normal, logistic, lognormal, and gamma distributions were considered with different values for the location parameters within and outside the defined true cluster, whereas the scale parameter was fixed at one for all four distributions. The mean of the distribution inside the true cluster was kept as c√2 for normal and logistic distributions, and 2 + c√2 for lognormal and gamma distributions where c = 1, 2, 3. The mean outside the cluster was kept as 0 for the symmetric distributions and as 2 for the asymmetric distributions. The location and scale parameters were defined as per Huang et al. [6] and Jung and Cho [10] with different values for c.

We generated 1000 random datasets for one baseline scenario and its variations. For the baseline scenario, the true cluster was at L1, the sample size was 100, and the maximum size of the scanning window r was kept as 50 percent of all observations. With respect to the true cluster, the value of c pertaining to the difference in location parameters was set as 3. The varied scenarios in the simulation study include a different sample size (n = 200), varying size of the scanning window (r = 10, 30), and differences in location parameters (c = 1, 2). Another variation in the baseline scenario was a bigger true cluster with a radius of 20 units, L2. The number of permutations for each simulated dataset was fixed at 999.

The statistical power was defined as “the proportion of the 1000 random data sets for which the null hypothesis is rejected at the significance level of 0.05” and was expressed as a percentage. Sensitivity was defined “as the proportion of the number of cells correctly detected among the cells in the true cluster”. PPV was defined “as the proportion of the number of cells belonging to the true cluster among the cells in the detected cluster”. As the measures of accuracy varied between the 1000 random data sets, only the average value of the measures over the datasets for which the null hypothesis was rejected was considered.

3 Results

The estimated accuracy measures and power of the considered methods under varying sample size, true cluster size, scanning window size, and values of the parameters are depicted in Figs. 1, 2, 3 and 4 (See also Tables 1, 2, 3 and 4 in Online Resource 1). The ELiSS A method had the largest difference between the accuracy measures while the normal method had the smallest. The PPV of the methods was lesser than the sensitivity for the smaller true cluster. The power of all the methods either increased to or remained constant at 100 percent as the true cluster became larger. The methods had considerably lesser power at c = 1 compared to other scenarios.

3.1 Sensitivity

3.1.1 Method-wise performance of scan statistics across the varied scenarios with respect to the distributions

There was almost no difference in the sensitivity of ELiSS B and D methods irrespective of the distribution. The methods had performed mostly better than the normal and ELiSS C methods across all scenarios, with little to no differences, if any. Overall, the MW method slightly performed better compared to ELiSS B and D methods, with the differences being more evident in the case of asymmetric distributions. In the case of symmetric distributions, the methods had approximately the same sensitivity across most scenarios (Figs. 5 and 6).

The normal method had the lowest sensitivity (except for the larger true cluster) which was considerably less at c = 1, especially in the case of asymmetric distributions. For the smaller true cluster, the normal and ELiSS C methods had approximately similar sensitivity across r, except for the lognormal distribution. The normal method performed slightly better than the latter across r for asymmetric distributions. For c = 1, 2 and n = 200, the ELiSS C performed better than normal except for asymmetric distributions at c = 2 where the two methods had approximately the same sensitivity. For n = 200, all except the normal method had almost similar sensitivity with differences being more evident for asymmetric distributions.

For the smaller true cluster, ELiSS A mostly performed slightly better or approximately the same as the MW method, with the differences being more evident in the case of symmetric distributions.

All methods performed considerably better compared to the ELiSS C method for the bigger true cluster irrespective of the distribution. For the larger true cluster in the case of symmetric distributions, the ELiSS A method had the highest sensitivity followed by the MW method. In the case of asymmetric distributions, the MW method had the highest sensitivity and performed better than the former. For lognormal distribution, the latter method performed better than just the ELiSS C method, though only slightly less compared to the normal method. Except for ELiSS C (for all) and ELiSS A (except for gamma distribution) methods, the other methods had approximately the same sensitivity for the larger true cluster.

Except for the normal method at c = 1, ELiSS A (only for lognormal), and ELiSS C methods for the larger true cluster, the sensitivity of methods ranged from 0.9 to 1.

3.1.2 Similarity/differences in the performance of the methods across the distributions

The sensitivity of the MW method was almost the same across the distributions in all scenarios except for when c = 1. The MW method had the highest sensitivity for lognormal distribution with only small differences across others. The sensitivity of the normal method was more or less constant across the distributions and scenarios except when c = 1. The sensitivity was considerably higher for the symmetric distributions.

There was only a slight difference, if any, in the sensitivity of the ELiSS A method with respect to the sample size and value of c across the distributions. Irrespective of the variable window size (r), the ELiSS A method had 100 percent sensitivity for symmetric distributions. The differences in sensitivity across the symmetric and asymmetric distributions decreased as r increased, with the maximum being at r = 10 percent. For the larger true cluster, there were considerable differences in the sensitivity between symmetric and asymmetric distributions, with sensitivity being higher for the former.

The ELiSS C method had higher sensitivity for symmetric distributions irrespective of the scenario. As the sample size increased the differences between symmetric and asymmetric distributions decreased. Though there were only small differences in sensitivity across the distributions for values of c, the difference in the case of symmetric distributions decreased to nil as c increased. For asymmetric distributions, the difference increased from zero as c increased, with lognormal distribution having the least sensitivity. For all distributions, the sensitivity remained almost constant across r.

There are more similarities than differences in the performance of the ELiSS B and D scan statistics and therefore they are considered together. The sensitivity was more or less similar across the distributions and scenarios. The sensitivity was slightly higher for the symmetric distributions in most scenarios. Both methods had slightly better sensitivity for logistic distributions across c.

3.2 PPV

3.2.1 Method-wise performance of scan statistics across the varied scenarios with respect to the distributions

The ELiSS A method had the least while the normal method had the highest PPV irrespective of the distribution or the scenario. For all the scenarios involving the smaller true cluster, the MW method had the second lowest PPV, though with a large difference from ELiSS A (smallest difference at r = 10 percent). The ELiSS B and D methods followed next with the methods having identical PPV except in the case of the normal distribution at r = 30, 50 percent. The two methods had approximately the same PPV even then with ELiSS B method performing only slightly better. The two methods when compared to the MW method performed much better in the case of asymmetric distributions. Though there were considerable differences between the methods across the distributions, in the case of symmetric distributions, the PPV of the MW method was only slightly different from that of the two methods at r = 10 percent. The ELiSS C method had the second highest PPV with only small differences from that of the ELiSS B and D methods. The PPV of the ELiSS C method was much closer to that of the normal method in the case of asymmetric distributions.

The only difference in the ordering of the methods in the case of the larger true cluster was for the PPV of the ELiSS C method. The method only performed slightly better than the MW method in the case of asymmetric distributions. It was the opposite in the case of symmetric distributions, though the PPV was approximately the same.

The ELiSS A method had the highest PPV at r = 10 while the lowest was at c = 1 or n = 200 depending on the type of distribution. The PPV of the other methods was the highest for the larger true cluster and the lowest at c = 1.

3.2.2 Similarity/differences in the performance of the methods across the distributions

PPV of the MW method was mostly similar across the distributions, with PPV slightly being higher for symmetric distributions especially when c = 1,2, and for the bigger sample size. The normal method had was more or less constant PPV with only the slightest differences, if any, across the distributions and scenarios. For each of the ELiSS versions, the PPV across symmetric distributions was mostly similar. They had better PPV for asymmetric distributions (highest for lognormal distribution) across almost all scenarios.

3.3 Power

3.3.1 Method-wise performance of scan statistics across the varied scenarios with respect to the distributions

All methods had 100 percent power for the larger true cluster and the least power at c = 1 irrespective of the type of distribution. The normal, ELiSS B, C and D methods had approximately or equal to 100 percent power except for c = 1, 2. For n = 200, ELiSS A followed by the MW method had the largest difference in power from other methods. All methods had approximately or equal to 100 percent power across r except for the ELiSS A method. The method, while maintaining the above pattern for r = 10 percent, had slightly lower power for r = 30, 50 percent, especially in the case of lognormal distribution.

ELiSS A (followed by the MW method) had considerably lesser power for c = 1 (symmetric distributions), 2 (all), when compared to other methods. At c = 1, the ELiSS A method was only slightly better compared to the normal method in the case of lognormal distribution. For gamma distribution, the alternate was true with the MW method performing slightly better than the normal method.

At c = 1, the normal method, and ELiSS C method, respectively, had the highest power for normal and logistic distributions. Though the difference between these two methods was slightly higher for c = 1, the power was much closer in the case of symmetric distributions. The power of ELiSS B and D methods were mostly the same across the scenarios, with few exceptions where the ELiSS B method was only slightly better. With respect to the above two methods at c = 1, the ELiSS C method had slightly better power for normal distribution, while the normal method had approximately the same power for logistic distribution.

For c = 2, the power of ELiSS B, C, and D methods were approximately the same with them having the higher power for asymmetric distributions. In the case of symmetric distributions, the normal method performed only slightly better than these methods. The differences in power between the normal and the three methods were much larger for asymmetric distributions, with the normal method having lower power.

3.3.2 Similarity/differences in the performance of the methods across the distributions

The MW method had similar power across the distributions in most scenarios with only small differences in case of the exceptions (when c = 1, 2 and for the larger sample size). The normal method had better power for symmetric distributions at c = 1 and 2, and in other scenarios, the power was almost the same across the distributions. The power of the ELiSS A method was slightly better for asymmetric distributions at c = 1,2 and exceptionally better for the larger sample size. The method had comparatively lesser power for lognormal distribution while maintaining the highest power for normal distribution across r. The ELiSS B, C and D methods had constant power across most scenarios except for c = 1, 2. For such exceptions, these ELiSS methods had slightly better power for symmetric distributions.

4 Application to rape crime data

We considered the detection of the most likely clusters of high rape in India for the year 2011 using the rape data from the 640 districts as per 2011 census [41]. The district-level rape data was extracted from National Crime Records Bureau [42], and district-wise female population data for the year 2011 was extracted from the Census of India [41]. The rape crime rate was calculated as the number of rapes per 100,000 of the female population. The objective was to apply the ELiSS A, B, C, D statistics, the MW scan statistic, and the normal scan statistic to the rape crime data to detect the likely high clusters of rape.

The analyses were performed in R software and the maximum size of the circular scanning window was chosen as 100 km and to include at most five percent of the observations (the unit of observation being districts, varying in size and totalling 640 in number). Statistical significance was based on the P value calculated through 999 Monte Carlo replications. A P value < 0.05 was considered statistically significant. Among the statistically significant clusters identified based on the P value, the most likely (primary) cluster was the one with the maximum likelihood ratio. Secondary clusters detected using the standard version of the normal scan statistic, were reported based on the criterion of ‘no geographical overlap’ [43]. The district-level shapefile of India based on the 2011 census was used under the Creative Commons license and was obtained from GitHub [44]. The statistically significant spatial clusters, detected using the various scan statistics, were mapped using GeoDa 1.20.0.8 software [45]. Only one non-significant cluster was mapped additionally.

The most likely clusters detected by the various methods were spread across the North Eastern Zone, with the location and extent of it changing across the methods (Table 1 and Fig. 7). The location of the primary cluster detected by MW and ELiSS A methods became that of the secondary cluster detected by the remaining methods and vice versa. The most likely cluster detected by the former methods covered 11 districts of Assam, and two districts of Meghalaya with the addition of the East Garo Hills and East Khasi Hills districts in case of just ELiSS A statistic. The primary cluster detected by ELiSS B, C, D methods covered six districts of Mizoram along with two districts of Tripura, and two districts of Assam (minus Karimganj district in case of the normal method). (See Table 1 in Online Resource 2 for more information.)

Table 1 High rape spatial clusters detected using the various spatial scan statistics for 2011 all India rape data

Full size table

The number of the districts contributing to the potential spatial clusters were the largest in case of MW method and the lowest in case of normal method (Table 2). Overall, the ELiSS B, C methods had more districts in common with all except ELiSS A method (MW method had more common districts), while the ELiSS A method had the least common districts with all except MW method, and normal method having the least common districts with MW, ELiSS A methods.

Table 2 Proportion of common clusters detected by the various scanning methods for rape in India- 2011

Full size table

The likely clusters detected by the ELiSS B, C, D methods were the same in terms of location and extent. The three methods detected more clusters compared to the normal method, with the coverage of districts more in case of the former methods even for similar location. That is, all the districts included in the clusters detected by the normal method were also covered by the ELiSS B, C, D methods though vice versa is not true. The additional clusters detected by the latter methods were also covered by the MW method. The MW and ELiSS B methods each had two clusters not detected by the other, with the methods differing in the extent of a cluster at a similar location. Except for the change in coverage of districts from Meghalaya for the primary cluster and districts from Arunachal Pradesh and Assam in a secondary cluster, the cluster districts detected by ELiSS A were detected by the MW method. Except for the first two most likely clusters, the location and extent of the clusters detected differed largely between ELiSS A and other methods (excluding MW method).

5 Discussion

We have proposed a nonparametric spatial scan statistic for continuous data apart from survival data using an empirical likelihood approach. The proposed ELiSS statistic includes both the distribution-free and likelihood-based aspects of the existing spatial scan statistics.

All compared methods and associated versions consistently showed greater sensitivity than PPV for the smaller cluster, irrespective of the sample size, difference in location parameters, and size of the scanning window. While the sensitivity of MW and ELiSS A methods was greater than PPV for the bigger cluster as well, it was the opposite for the normal and ELiSS C methods. For both ELiSS B and D methods, sensitivity was slightly lesser than PPV for the asymmetric distributions, while both measures were observed to be equal for the symmetric distributions. All methods had 100 percent power for the bigger cluster irrespective of the distributions. Even then the accuracy measures of the compared methods were not 100 percent, indicating that spatial scan statistics only determine a general location of the localized spatial cluster and not its exact boundaries. The size of the detected clusters affects the accuracy measures of the spatial scan statistics. Compared to the smaller true cluster, the cluster detected by all methods was somewhat bigger in size. For the larger true cluster, the detected clusters by methods other than MW and ELiSS A were somewhat smaller in size.

Among the four ELiSS versions, the use of the B and D methods is recommended as they have good as well as the most stable performance in terms of power, and accuracy measures irrespective of the distribution. While the ELiSS A method had the highest sensitivity, it had considerably lower power and PPV, with the performance varying across distributions. The sensitivity of the ELiSS C method drastically decreased for the larger true cluster with the reduction being slightly more for asymmetric distributions. The ELiSS C method especially had lower sensitivity for lognormal distribution compared to others.

In terms of power, the normal scan statistic had considerably lower power for the lognormal distribution for the smaller difference in means (c = 1, 2) compared to ELiSS B and D statistics. In other instances, the methods had little to no difference in power. The ELiSS B, D statistic had either equivalent or much better power than the MW scan statistic. The normal method had better PPV than ELiSS B and D methods with only small differences in the case of asymmetric distributions. The PPV of the normal method was only slightly better for the larger true cluster. The MW method had lesser PPV compared to the ELiSS B and D methods, with the latter methods performing considerably better for asymmetric distributions. The ELiSS B and D methods had better sensitivity compared to the normal method, with the latter having considerably lesser sensitivity for c = 1 in the case of asymmetric distributions. The MW method performed slightly better with little to no difference in sensitivity when compared to ELiSS B and D methods. Thus, the B and D versions of the ELiSS statistic are excellent alternatives to both the normal and MW scan statistics for continuous data from skewed or unknown distributions.

It can be seen from the application on India rape data for 2011, that there are similarities as well as differences in the results of the various scan statistics in terms of number, order, location, and extent of the potential clusters detected. There is also considerable overlap between the clusters detected using the different scan statistics. As seen from the application, ELiSS B, C, D methods had more cluster districts in common with other methods. It also detected few additional clusters when compared with normal and MW methods. Hence, considering more than one method, when analysing real data, could help one in better understanding the high risk areas.

6 Conclusion

In this paper, the ELiSS statistic was proposed for purely spatial clusters and using circular windows for univariate continuous data. The methods have the advantage of being distribution free while still allowing the use of likelihood methods, which are beneficial in real world applications where the data might come from unknown or skewed distributions. The ELiSS B and D methods are recommended for applications as an alternative and/or complementary method to the other spatial scan statistics.

The extension of the method to a space–time setting, detection of spatial clusters after adjusting for covariates, or the use of flexible scanning windows to improve the precision of the detected clusters might be of future interest.

Data availability

The codes for the Mann–Whitney scan statistic were obtained from the author- Dr. Lionel Cucala on demand. The codes for the ELiSS statistic are not publicly available as it is part of ongoing doctoral research work but are available from the corresponding author on reasonable request. All data generated or analyzed during this study are either included in this published article [and its supplementary information files] or available in public domain.

References

Costa, M. A., & Kulldorff, M. (2009). Applications of spatial scan statistics: A review. In J. Glaz, V. Pozdnyakov, & S. Wallenstein (Eds.), Scan statistics (pp. 129–152). UK: Birkhäuser Boston.
Chapter Google Scholar
Naus, J. I. (1965). Clustering of random points in two dimensions. Biometrika, 52(1/2), 263. https://doi.org/10.2307/2333829
Article Google Scholar
Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics: Theory and Methods, 26(6), 1481–1496. https://doi.org/10.1080/03610929708831995
Article Google Scholar
Abolhassani, A., & Prates, M. O. (2021). An up-to-date review of scan statistics. Statistics Surveys, 15, 111–153. https://doi.org/10.1214/21-SS132
Article Google Scholar
Cucala, L., Genin, M., Lanier, C., & Occelli, F. (2017). A multivariate Gaussian scan statistic for spatial data. Spatial Statistics, 21, 66–74. https://doi.org/10.1016/j.spasta.2017.06.001
Article Google Scholar
Huang, L., Tiwari, R. C., Zou, Z., Kulldorff, M., & Feuer, E. J. (2009). Weighted normal spatial scan statistic for heterogeneous population data. Journal of the American Statistical Association, 104(487), 886–898. https://doi.org/10.1198/jasa.2009.ap07613
Article CAS Google Scholar
Kulldorff, M., Huang, L., & Konty, K. (2009). A scan statistic for continuous data based on the normal probability model. International Journal of Health Geographics, 8(1), 58. https://doi.org/10.1186/1476-072X-8-58
Article Google Scholar
Shen, X., & Jiang, W. (2014). Multivariate normal spatial scan statistic for detecting the most severe cluster of a disease. Journal of Management Analytics, 1(2), 130–145. https://doi.org/10.1080/23270012.2014.915130
Article Google Scholar
Cucala, L. (2014). A distribution-free spatial scan statistic for marked point processes. Spatial Statistics, 10, 117–125. https://doi.org/10.1016/j.spasta.2014.03.004
Article Google Scholar
Jung, I., & Cho, H. J. (2015). A nonparametric spatial scan statistic for continuous data. International Journal of Health Geographics, 14(1), 30. https://doi.org/10.1186/s12942-015-0024-6
Article Google Scholar
Cucala, L. (2016). A Mann-Whitney scan statistic for continuous data. Communications in Statistics: Theory and Methods, 45(2), 321–329. https://doi.org/10.1080/03610926.2013.806667
Article Google Scholar
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika, 75(2), 237–249. https://doi.org/10.1093/biomet/75.2.237
Article Google Scholar
Owen, A. B. (2001). Empirical likelihood (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781420036152
Book Google Scholar
Lazar, N. A. (2021). A review of empirical likelihood. Annual Review of Statistics and Its Application, 8(1), 329–344. https://doi.org/10.1146/annurev-statistics-040720-024710
Article Google Scholar
Liu, P., & Zhao, Y. (2023). A review of recent advances in empirical likelihood. WIREs Computational Statistics, 15(3), e1599. https://doi.org/10.1002/wics.1599
Article Google Scholar
Chen, S. X., & Van Keilegom, I. (2009). A review on empirical likelihood methods for regression. TEST, 18(3), 415–447. https://doi.org/10.1007/s11749-009-0159-5
Article Google Scholar
Nordman, D. J., & Lahiri, S. N. (2014). A review of empirical likelihood methods for time series. Journal of Statistical Planning and Inference, 155, 1–18. https://doi.org/10.1016/j.jspi.2013.10.001
Article Google Scholar
Nordman, D. J. (2008). A blockwise empirical likelihood for spatial data. Statistica Sinica, 18(3), 1111–1129.
Google Scholar
Nordman, D. J. (2008). An empirical likelihood method for spatial regression. Metrika, 68(3), 351–363. https://doi.org/10.1007/s00184-007-0167-y
Article Google Scholar
Van Hala, M., Nordman, D. J., & Zhu, Z. (2015). Empirical likelihood for irregularly located spatial data. Statistica Sinica, 25(4), 1399–1420. https://doi.org/10.5705/ss.2013.385
Article Google Scholar
Nordman, D. J., & Caragea, P. C. (2008). Point and interval estimation of variogram models using spatial empirical likelihood. Journal of the American Statistical Association, 103(481), 350–361. https://doi.org/10.1198/016214507000001391
Article CAS Google Scholar
Kaiser, M. S., & Nordman, D. J. (2012). Blockwise empirical likelihood for spatial Markov model assessment. Statistics and Its Interface, 5(3), 303–318.
Article Google Scholar
Kostov, P. (2013). Empirical likelihood estimation of the spatial quantile regression. Journal of Geographical Systems, 15(1), 51–69. https://doi.org/10.1007/s10109-012-0162-3
Article Google Scholar
Li, Y., & Qin, Y. (2022). Empirical likelihood for spatial dynamic panel data models. Journal of the Korean Statistical Society, 51(2), 500–525. https://doi.org/10.1007/s42952-021-00150-4
Article Google Scholar
Chaudhuri, S., & Ghosh, M. (2011). Empirical likelihood for small area estimation. Biometrika, 98(2), 473–480. https://doi.org/10.1093/biomet/asr004
Article Google Scholar
Porter, A. T., Holan, S. H., & Wikle, C. K. (2014). Bayesian semiparametric hierarchical empirical likelihood spatial models. Journal of Statistical Planning and Inference, 165, 78–90. https://doi.org/10.48550/ARXIV.1405.3880
Article Google Scholar
Porter, A. T., Holan, S. H., & Wikle, C. K. (2015). Multivariate spatial hierarchical Bayesian empirical likelihood methods for small area estimation: Multivariate semiparametric small area estimation. Stat, 4(1), 108–116. https://doi.org/10.1002/sta4.81
Article Google Scholar
Jahan, F., Kennedy, D. W., Duncan, E. W., & Mengersen, K. L. (2022). Evaluation of spatial Bayesian empirical likelihood models in analysis of small area data. PLoS ONE, 17(5), e0268130. https://doi.org/10.1371/journal.pone.0268130
Article CAS Google Scholar
Qin, Y. (2021). Empirical likelihood for spatial autoregressive models with spatial autoregressive disturbances. Sankhya A, 83(1), 1–25. https://doi.org/10.1007/s13171-019-00166-3
Article Google Scholar
Qin, Y. (2021). Empirical likelihood and GMM for spatial models. Communications in Statistics: Theory and Methods, 50(18), 4367–4385. https://doi.org/10.1080/03610926.2020.1716252
Article Google Scholar
de Matos, C. D., do AmorimAmaral, G. J., & de Bastiani, F. (2021). Spatial scan statistics based on empirical likelihood. Communications in Statistics: Simulation and Computation, 52(8), 3897–3911. https://doi.org/10.1080/03610918.2021.1949470
Article Google Scholar
Liu, Y., Zou, C., & Zhang, R. (2008). Empirical likelihood for the two-sample mean problem. Statistics & Probability Letters, 78(5), 548–556. https://doi.org/10.1016/j.spl.2007.09.006
Article Google Scholar
Rao, A. K., & Udupi, A. (2011). On Empirical Likelihood Ratio Test for Equality of Means. InterStat - Statistics on the Internet. https://www.researchgate.net/publication/265518919_On_Empirical_Likelihood_Ratio_Test_for_Equality_of_Means
Wu, C., & Yan, Y. (2012). Empirical likelihood inference for two-sample problems. Statistics and Its Interface, 5(3), 345–354. https://doi.org/10.4310/SII.2012.v5.n3.a7
Article Google Scholar
Qin, J., & Lawless, J. (1994). Empirical likelihood and general estimating equations. The Annals of Statistics. https://doi.org/10.1214/aos/1176325370
Article Google Scholar
Diciccio, T. J., Hall, P., & Romano, J. P. (1989). Comparison of parametric and empirical likelihood functions. Biometrika, 76(3), 465–476. https://doi.org/10.1093/biomet/76.3.465
Article Google Scholar
Imbens, G. W. (1997). One-step estimators for over-identified generalized method of moments models. The Review of Economic Studies, 64(3), 359. https://doi.org/10.2307/2971718
Article Google Scholar
Lehmann, E. L., & Casella, G. (1998). Asymptotic optimality. Theory of point estimation (2nd ed., pp. 429–519). Springer. https://doi.org/10.1007/b98854
Chapter Google Scholar
Kitamura, Y. (2007). Empirical likelihood methods in econometrics: theory and practice. In R. Blundell, W. Newey, & T. Persson (Eds.), Advances in economics and econometrics (pp. 174–237). Cambridge University Press. https://doi.org/10.1017/CBO9780511607547.008
Chapter Google Scholar
O’Neill, B. (2014). Some useful moment results in sampling problems. The American Statistician, 68(4), 282–296. https://doi.org/10.1080/00031305.2014.966589
Article Google Scholar
Office of the Registrar General & Census Commissioner (2022). Census of India 2011. Government of India. Retrieved on November 6, 2022, from http://censusindia.gov.in.
National Crime Records Bureau. (2022) Crime in India. Retrieved October 21, 2022, from https://ncrb.gov.in/en/crime-india
Kulldorff, M. (2022). SaTScan user guide v10.1. https://www.satscan.org/
GitHub. (2014) India—District map (2011 census). https://github.com/datameet/maps/tree/master/Districts/Census_2011
Anselin, L., Syabri, I., & Kho, Y. (2006). GeoDa: An introduction to spatial data analysis. Geographical Analysis, 38(1), 5–22. https://doi.org/10.1111/j.0016-7363.2005.00671.x
Article Google Scholar

Download references

Funding

Open access funding provided by Manipal Academy of Higher Education, Manipal.

Author information

Authors and Affiliations

Department of Data Science, Prasanna School of Public Health, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India
Maria Mathews & Vasudeva Guddattu
Department of Biostatistics, Dr. M.V. Govindaswamy Centre, National Institute of Mental Health and Neuro Sciences, Bengaluru, Karnataka, 560029, India
V. S. Binu
Department of Statistics, Mangalore University, Mangalagangotri, Karnataka, 574199, India
K. Aruna Rao

Authors

Maria Mathews
View author publications
You can also search for this author in PubMed Google Scholar
Vasudeva Guddattu
View author publications
You can also search for this author in PubMed Google Scholar
V. S. Binu
View author publications
You can also search for this author in PubMed Google Scholar
K. Aruna Rao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Miss Maria designed the study, wrote the codes, performed statistical analyses, prepared the manuscript, and approved the final manuscript as submitted. Dr. Vasudeva conceptualized, designed, and supervised the study, reviewed and revised the manuscript, and approved the final manuscript as submitted. Dr. Binu conceptualized and supervised the study, reviewed and revised the manuscript, and approved the final manuscript as submitted. Dr. Aruna Rao supervised the study, reviewed and revised the manuscript, and approved the final manuscript as submitted.

Corresponding author

Correspondence to Vasudeva Guddattu.

Ethics declarations

Conflict of interest

None of the authors have any financial or non-financial interests to disclose that are relevant to this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 42 KB)

Supplementary file2 (DOCX 29 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mathews, M., Guddattu, V., Binu, V.S. et al. An empirical likelihood approach for detecting spatial clusters of continuous data. Spat. Inf. Res. (2024). https://doi.org/10.1007/s41324-024-00592-y

Download citation

Received: 12 February 2024
Revised: 14 June 2024
Accepted: 16 June 2024
Published: 27 June 2024
DOI: https://doi.org/10.1007/s41324-024-00592-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An empirical likelihood approach for detecting spatial clusters of continuous data

Abstract

Similar content being viewed by others

Exploring Spatial Patterns of Crime Using Non-hierarchical Cluster Analysis

A simulation study for geographic cluster detection analysis on population-based health survey data using spatial scan statistics

Optimizing the maximum reported cluster size for the multinomial-based spatial scan statistic

1 Introduction

2 Methods

2.1 The normal scan statistic

2.2 The empirical likelihood spatial scan (ELiSS) statistic

2.3 Simulation study

3 Results

3.1 Sensitivity

3.1.1 Method-wise performance of scan statistics across the varied scenarios with respect to the distributions

3.1.2 Similarity/differences in the performance of the methods across the distributions

3.2 PPV

3.2.1 Method-wise performance of scan statistics across the varied scenarios with respect to the distributions

3.2.2 Similarity/differences in the performance of the methods across the distributions

3.3 Power

3.3.1 Method-wise performance of scan statistics across the varied scenarios with respect to the distributions

3.3.2 Similarity/differences in the performance of the methods across the distributions

4 Application to rape crime data

5 Discussion

6 Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 42 KB)

Supplementary file2 (DOCX 29 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation