Introduction

Reference context

The increasing urbanization occurring during the last years is transforming every aspect of the urban society and affecting its sustainable development [1,2,3,4]. In fact, as urbanization continues to grow, it is bringing significant social and economic benefits (i.e., additional urban services and employment opportunities), while also presenting challenges in city management issues, like resource planning (water, electricity), traffic, air and water quality, public policy and public safety services.

Among the main urban issues, criminal activities are one of the most important social problems in metropolitan areas, because they can severely affect public safety, harm the economy and sustainable development of a society, as well as reduce the quality of life and well-being of citizens. For such a reason, improving strategies to effectively manage and utilize limited public security resources has become a crucial issue for policymakers and urban management departments.

However, ICT technologies and sensor infrastructures are enabling public organizations and police departments to gather and store increasing volumes of crime-related data, with spatial and temporal information. This offers the opportunity to apply data analytics methodologies to extract useful knowledge models, which can effectively detect spatial and temporal patterns of crime events. By extracting useful predictive models and applying appropriate methods for data analysis, police departments are supported to better utilize their limited resources and implement more effective strategies for crime prevention.

Motivations and contributions

Several criminal justice studies show that the incidence of criminal events is not uniformly distributed within a city [2, 3, 5, 6]. In fact, crime trends are strongly affected by the geographic location of the area (there are low-risk and high-risk areas). Also, they can vary with respect to the period of the year (there could be seasonal patterns, peaks, and dips). For this reason, an effective predictive model must be able to automatically determine which city neighborhoods are most affected by crime-related incidents, namely crime hotspots, as well as how the crime rate in each particular hotspot evolves over time. This knowledge can allow police departments to allocate their resources more efficiently over the urban territory, enabling the effective deployment of officers to high-risk areas, or moving officers from areas expecting a decline in crime activities, thus more efficiently preventing or promptly responding to crimes.

In literature, classic density-based clustering algorithms are largely exploited to discover spatial hotspots [7,8,9,10,11]. However, due to the adoption of global parameters, they fail to identify multi-density hotspots (i.e., different regions having various densities [12, 13]) unless the clusters (or hotspots) are clearly separated by sparse regions [14]. In particular, this is a key issue when analyzing crime data and thus correctly detecting the real crime hotspots. In fact, the density of population, traffic, or events in large cities can vary widely from one area to another area [5], which also makes the incidence of crime events extremely dissimilar in terms of density.

Such a spatial density variation in crime events challenges the discovery of proper hotspots when classic density-based algorithms perform the analysis. For example, the well-known DBSCAN [14] receives two global input parameters (\(\epsilon\) and \(min-points\)), which results in a minimum density threshold \(\delta _{min}\) that is exploited for clustering the whole dataset. The optimal value of \(\delta _{min}\) can affect the densities of the discovered hotspots and does not deal with large density variations in the urban data. Indeed, if the value of \(\delta _{min}\) is too small, the algorithm can discover several small non-significant hotspots that actually do not represent dense crime regions, while if \(\delta _{min}\) is too large, it can discover a few large regions having high intra-cluster density variations. Thus, classic density-based clustering algorithms fail to identify proper hotspots characterized by different density levels, and their application to discover crime hotspots can produce inaccurate results, particularly in urban environments. A recent study in Cesario et al. [5] shows that multi-density clustering achieves higher performance than classic approaches for discovering hotspots in multi-density urban environments.

This paper presents the design and implementation of MD-CrimePredictor (Multi-Density Crime Predictor), an approach based on multi-density crime hotspots and regressive models to automatically detect high-risk crime areas in urban environments, and to forecast crime trends in each area reliably. The algorithm is composed of three main steps. First, multi-density crime hotspots are detected by applying a multi-density clustering algorithm (i.e., CHD) proposed in Cesario et al. [5], where densities, shapes, and number of the detected regions are automatically computed by the algorithm without any pre-fixed division in areas. Then, a specific regressive model is discovered from each detected hotspot, analyzing the partitions discovered during the previous step. In this paper, this is done by exploiting both SARIMA [15] and LSTM [16] models, and a comparative experimental analysis is presented in terms of error measures. The final result of the algorithm is a spatio-temporal crime forecasting model, composed of a set of crime hotspots, their densities, and a set of associated crime predictors, each one representing a predictive model to forecast the number of crimes that are estimated to happen in its specific hotspot. The experimental evaluation of the proposed approach has been performed by analyzing a large area of Chicago, involving more than two million crime events (over a period of 19 years). The experimental evaluation, aimed at assessing the effectiveness of the approach over rolling prediction horizons, presents a comparative analysis between SARIMA and LSTM regression models, demonstrating higher accuracy of the first method with respect to the second one. We also provide a comparative assessment of the proposed approach with other studies proposed in literature, drawing a comparison in terms of hotspots detection and crime forecasting accuracy. Overall, the results show the effectiveness of the approach, by achieving good accuracy in spatial and temporal crime forecasting over rolling time horizons.

Plan of the paper

The rest of the paper is organized as follows. Section "Related work" reports the most important approaches proposed in the literature for crime hotspot detection and crime forecasting. Section "Problem Definition and Proposed Approach" outlines the problem statement and describes the approach proposed in the paper and reports its steps in detail. Section "Experimental Evaluation and Results" provides the experimental evaluation of the proposed approach on a real-world scenario by showing a comparative analysis between SARIMA and LSTM performances. The section also shows a comparison between the results achieved with the presented approach and other methodologies proposed in the literature. Finally, Sect. "Conclusion" concludes the paper and plans future research works.

Related work

Recently, crime hotspot detection and crime forecasting have been raised as hot topics within the research community. This section briefly reviews the most representative research works in both areas.

Crime forecasting

One of the first frameworks proposed in the literature for crime data analysis is CrimeTracer [17], which is based on a probabilistic approach to model the spatial behavior of known offenders within areas they frequent, called activity spaces. This work is based on the assumption, based on crime pattern theories, that offenders frequently commit serial violent crimes in places they are most familiar with (namely, their activity space). Also, the authors claim that taxi flows can provide useful information to correlate activity spaces, even if they are not geographically connected. Experiments carried out on real-world crime data have shown that criminals frequently commit crimes within their activity spaces, rather than venture into unknown territories. CrimeTracer is indeed able to predict the location of the next crime committed by known offenders but it does not provide information about the time window for the next crime events. Also, it requires a dataset with information related to specific offenders, which could not be available in general.

The work in Catlett et al. [7] presented a predictive approach based on spatial analysis and auto-regressive models in order to detect high-risk regions in urban areas and to forecast crime trends in each region. The approach exploits the DBSCAN algorithm to detect high-risk regions and ARIMA models to fit crime predictors. The approach has been validated on two crime datasets (i.e., Chicago and New York City areas) comprising crime events spanning from 2001 to 2016. The study shows good performances on both datasets, considering a three-year ahead forecasting window, which is a long-term time horizon. The approach is capable of detecting crime-dense regions having any shapes, however the main drawback is that DBSCAN detects wide regions or a large number of outliers, as it cannot tackle the multi-density nature of urban datasets.

The study described in Zhu et al. [3] proposes a hierarchical crime prediction framework, which integrates a modified gated GCN (Graph Convolutional Networks) and VMD (variational mode decomposition), to holistically predict the short-term crime patterns in different communities and support proactive policing. The approach is composed of several steps. First, the temporal dependency is decomposed in the frequency domain, and a network is constructed to capture the spatial relationships within the sub-frequencies. Then, human mobility traces are exploited to characterize the dynamic relationships within the network. The experimental evaluation has been focused on the crime distribution evolution of crimes in Chicago, to predict the short-term criminal events in the different communities holistically. The study concludes that social interactions based on human activity data can characterize dynamic crime distribution relationships, as well as spatial crime distribution evolutions. The main strength of the research study proposed in Zhu et al. [3] leverages on the dynamic relationships between human mobility and crimes, which represents a relevant methodological difference with other approaches proposed in literature; in particular, the analysis of human mobility allows to characterize also the dynamic distribution and evolution of crimes within and across areas, which is strongly affected by social interactions among individuals. However, while the approach exhibits reasonable effectiveness of taking a relationship-based perspective for crime forecasting, the theoretical description needs further verification (as also claimed by authors): in fact, as human activity data is multi-source, multi-granular, and multi-mode, and involves complex relationships, a more refined classification of human mobility trends is needed to understand their effects on different crime evolutions.

A general framework for crime data mining, exploited for some analysis tasks in collaboration with the Tucson and Phoenix Police departments, is presented in Chen et al. [18]. In particular, the paper describes three examples of its use in practice. First, entity extraction algorithms have been used to automatically identify persons, addresses, vehicles, and personal characteristics from police narrative reports (usually containing many typos, spelling errors, grammatical mistakes, etc.). Second, a text mining algorithm has been explored for deceptive identity detection, to discover the real identity of suspects that have given false names, faked birth dates, or false addresses. Third, a concept-based approach has been exploited to identify subgroups or key members in criminal networks, and to study interaction patterns among them. In our opinion, the main strength of this study is its innovativeness in providing investigators with a framework for automatically applying crime entity-extraction techniques on crime data, aiming to extract serial offenders’ behavioral patterns. However, using only crime department data could limit the applicability and effectiveness of the framework; as also observed in Chen et al. [18], additional heterogeneous data (i.e., citizenship, secret services, immigration, web, social) could enable the development of more intuitive techniques for crime pattern and network visualization, and higher accuracy in criminal activity predictions.

Authors of Liang et al. [19] propose a framework, named CrimeTensor, to predict the number of crime incidents belonging to different categories within each target region. The framework, based on tensor learning with spatio-temporal consistency techniques, aims to offer fine-scale prediction results considering spatio-temporal categorical correlations in crime events. Crime data is modeled as a tensor, and an objective function is presented, which leverages spatial, temporal, and categorical information. The prediction task is done by applying CANDECOMP/PARAFAC decomposition to find an optimal solution for the defined objective function. The approach is validated by conducting experiments on two real-world crime datasets, collected in the ** approach. The results of the experimental evaluation on the artificial datasets, made in Cesario et al. [25], are reported in Tables 2 and 3, where the clustering results are compared by several performance indexes (for each index, the best achieved result is reported in bold). The analysis shows that the HDBSCAN and CHD algorithms are the most effective in detecting clusters in multi-density dataset, and that CHD performs better than HDBSCAN on the second dataset (see Table 3). However, other approaches are presented in the literature, specifically tailored for clustering spatio-temporal data. The work in Nanni et al. [26] presents the TF-OPTICS algorithm, designed for time-focused clustering. The algorithm processes a set of spatio-temporal objects, each one represented by a trajectory of values, as a function of time. TF-OPTICS focuses on computing distances between trajectories by searching for the best possible time interval. This algorithm, as well as those ones tailored for clustering trajectories of moving objects, does not suit to the proposed use case, because we focus on crime events characterized both in time and space, that can not be aggregated in a set of well-defined trajectories. A more fitting algorithm for clustering spatio-temporal data is presented in Agrawal et al. [27]. The algorithm, called ST-OPTICS, is density-based, and exploits two different \(\epsilon\) parameters, one for clustering points in space and the other for clustering points in time. A comparison between the proposed approach, based on CHD, and an alternative one, based on the ST-OPTICS algorithm, is provided in the Sect. "Comparative analysis with ST-OPTICS on hotspots detection and crime forecasting".

Table 2 Performance comparison between different density-based clustering algorithm on dataset Zahn Compound [25]
Table 3 Performance comparison between different density-based clustering algorithm on Ordered Chess dataset [25]

Main differences and novelty of MD-CrimePredictor

With respect to the summarized works, this paper presents two main novelties. First, it introduces MD-CrimePredictor, where a multi-density clustering algorithm (i.e., CHD) is exploited for crime hotspot detection (to the best of our knowledge, this is the first research study in the crime data analysis domain, showing results on multi-density crime hotspots). The exploited approach CHD is able to automatically detect multi-density (and multi-shape) crime hotspots, which differentiates it w.r.t. all the other approaches reviewed here, thus showing important benefits in the urban data analysis. MD-CrimePredictor relies on the exploitation of both seasonal regressive (SARIMA) and deep-learning (LSTM) models for crime forecasting in each discovered hotspot, and, as e second contribution, the paper furnishes an extensive comparative evaluation between the results given by the two forecasting algorithms. Also, to assess the effectiveness of the CHD-based approach for hotspot detection, we show a comparative analysis of the proposed approach with other studies proposed in literature, drawing a comparison in terms of hotspots detection and crime forecasting accuracy

Problem definition and proposed approach

This section presents the problem formulation and the approach proposed in the paper to forecast crime events in multi-density crime hotspots. Specifically, Sect. "Problem definition and goals" depicts the problem under investigation and its goals, whereas Sect. "The multi-crime-predictor approach" details the algorithm proposed in the paper.

Problem definition and goals

We begin by fixing a proper notation to be used throughout the paper. Let \(T=<t_1,t_2,\ldots ,t_H>\) be an ordered timestamp list, such that \(t_h<t_{h+1}, \forall _{ 0\le h<H}\), and where all \(t_h\) are at equal time intervals (e.g., every hour, day, week). Let \(\mathcal{C}\mathcal{D}\) be a crime dataset collecting crime events, \(\mathcal{C}\mathcal{D}=<CD_1,CD_2,\ldots ,CD_N>\), where each \(CD_i\) is a data instance described by \(<latitude,longitude,t>\), i.e., the coordinates of the place and the time (with \(t \in T\)) the event occurs at. Now, let us consider a future temporal horizon, \(S=<t_s, t_{s+1}, \ldots>\), with \(s>H\). The goal of the analysis is to discover a set of crime hotspots in the city (which can have multi-density distribution of the events) and predictive models for reliably forecasting the number of crimes in each hotspots at a given timestamp \(t_s \in S\). More specifically, the goal of the proposed approach aims at achieving the following goals:

  1. 1.

    Discover a set \(\mathcal{C}\mathcal{H}\) of crime hotspots, \(\mathcal{C}\mathcal{H} = \{CH_1, \ldots , CH_K\}\), where a crime hotspot \(CH_k\) is a spatial area which criminal events occur in with an higher density than other areas in the city;

  2. 2.

    Compute a set \(\Sigma\) of crime hotspot densities, \(\Sigma =\{\sigma _1,\sigma _2,\ldots ,\sigma _H\}\), where each \(\sigma _h\) is the spatial density of events occurred in the hotspot \(CH_h\).

  3. 3.

    Extract a set \(\mathcal {F}_{crimes}\) of crime predictors, \(\mathcal {F}_{crimes} = \{\mathcal {F}^1_{crimes}, \dots , \mathcal {F}^K_{crimes}\}\), where each function \(F^k_{crime}:\mathcal {S}\rightarrow \mathcal {R}\), given a timestamp \(t_s \in S\) states the number of crimes \(N \in \mathcal {R}\) that are predicted to happen in the crime hotspot \(CH_k \in \mathcal{C}\mathcal{H}\) at the timestamp \(t_s\).

Fig. 1
figure 1

The multi-crime-predictor algorithm workflow

The multi-crime-predictor approach

The approach proposed in this paper is sketched in Fig. 1, and its meta-code is reported in Algorithm 1. The algorithm is composed of three main steps, as described in the following.

Step 1. Multi-density Crime Hotspots detection. The first step consists in the detection of multi-density crime hotspots from the original dataset, that is, areas where crime events occur with greater density than other adjacent areas. The goal of this step is to detect spatial areas of interest for crime forecasting, in order to conduct the further analysis over areas rather than single points. This step is performed by the DiscoverCrimeHotspots(\(\mathcal {D}\)) method (line 1 of Algorithm 1), which returns the set \(\mathcal{C}\mathcal{H}=\{CH_1,\ldots ,CH_H\}\) of crime hotspots and their corresponding densities \(\Sigma =\{\sigma _1,\sigma _2,\ldots ,\sigma _H\}\). This task has been modeled as a geo-spatial clustering instance and has been performed, as described in Sect. "Detection of multi-density crime hotspots", using the City Hotspot Detector (CHD) multi-density clustering algorithm [5]. The number of detected hotspots is automatically detected by the algorithm, and their shapes are traced without any pre-fixed division in areas. The parameter setting for CHD is chosen by adopting a parameter-swee** methodology, that is, by running several instances of the CHD algorithm by varying their input parameters, and choosing the parameter settings that maximizes a set of internal indexes which comprises Silhouette [28], DBCV [29], CDBW [30], Calinsky-Harabaz [31], Davies-Bouldin [32].

Step 2. Crime Time Series Extraction. The second step consists in the spatial data splitting of the original crime data, based on the clustering model discovered at the previous step. In other words, the points of the original crime data events assigned to the \(i^{th}\) hotspot are transformed in a time series and gathered in the \(i^{th}\) output dataset, for \(i = 1,...,K\). At the end of this step, K different time series data sets are available, each one containing the time series of crimes occurred in its associated dense region, aggregated on a weekly basis.

Step 3. Predictive Crime Models extraction. The third step is aimed at extracting a specific crime prediction model \(F^i_{crime}\) for each \(i^{th}\) crime hotspot, analyzing the crime data split during the previous step. This task can be done by applying different regression techniques. In particular, in our approach this task has been implemented by exploiting both SARIMA and LSTM techniques (which have been resulted the most effective approaches to this purpose), as described in Sect. "Extraction of crime predictors".

Algorithm 1
figure a

MultiCrimePredictor

Detection of multi-density crime hotspots

The detection of crime hotspots has been done by exploiting the CHD algorithm [5], a multi density-based clustering algorithm that has been purposely designed for processing urban spatial data and discover multi-density hotspots. The algorithm is composed of several steps, as reported in Algorithm 2. First, given a fixed k variable, the k-nearest neighbors distance for each point is computed and exploited as an estimator of the density of each data point (line 1). Then, the points are sorted with respect to their estimated density, and the density variation between each consecutive couple of points in the ordered list is computed (line 2). The obtained density variation list can show very frequent fluctuations between subsequent values (in particular, in the analysis of real-wold urban data), thus a moving average filtering over windows of size s is applied to smooth out such fluctuations and highlight main trends (line 3). The data points are then partitioned into several density level sets (each one characterized by homogeneous density distributions), on the basis of the smoothed density variations (line 4). Then, a different \(\epsilon\) value is estimated for each density level set (line 5). Finally, each set is analyzed by the DBSCAN algorithm (lines 7–12). Specifically, each instance takes as input the specific \(\epsilon\) value computed for the analyzed density level set. The set of clusters detected for each partition constitutes the final result of the CHD algorithm. More details about CHD can be found in [5]. Moreover, in Cesario et al. [25] CHD has been proven to be effective in detecting clusters characterized by different densities in urban spatial datasets.

Algorithm 2
figure b

The CityHotspotDetector algorithm

Extraction of crime predictors

Given a specific crime hotspot, the DiscoverLocalCrimePredictor() method (line 4 in Algorithm 1) extracts a regressive model to forecast the number of crimes that will happen in its specific area. In this paper, this has been performed by exploiting SARIMA (Seasonal AutoRegressive Integrated Moving Average) and LSTM (Long Short-Term Memory) models. Such models and their principles are briefly summarized in the following.

SARIMA models

Multiple regression models have been defined with the goal of forecasting a variable of interest using a linear combination of predictors [33]. In particular, in an auto-regression model, the variable of interest is forecasted using a linear combination of its past values (the term auto-regression indicates that it is a regression of the variable against itself), while a moving average model uses past forecast errors in a regression-like model. Sometimes, as a preliminary step to the regressive analysis, time series need a differencing transformation to stabilize the mean of a time series and so eliminating (or reducing) trend and seasonality. A combination of differencing, auto-regression and moving average methods is known as AutoRegressive Integrated Moving Average model (more frequently referred by its acronym ARIMA) [33], formally defined in the following.

Let us consider the time series \(\{y_t: t=1...n\}\), where \(y_t\) is the value of the time series at the timestamp t. Then, an ARIMA(pdq) model is written in the form

$$y^{(d)}_t = c + \phi _1 y^{(d)}_{t-1} + \ldots + \phi _p y^{(d)}_{t-p} + \theta _1 e_{t-1} + \ldots + \theta _q e_{t-q} + e_t$$

where:

  • \(y^{(d)}_t\) is the \(d^{th}\)-differenced series of \(y_t\), that is: \(y^{(d)}_t=y^{(d-1)}_t-y^{(d-1)}_{t-1},~...~,y^{(d)}_{t-p}=y^{(d-1)}_{t-p}-y^{(d-1)}_{t-p-1}\);

  • \(\phi _1,\ldots ,\phi _p\) are the regression coefficients of the auto-regressive part;

  • \(\theta _1,\ldots ,\theta _q\) are the regression coefficient of the moving average part;

  • \(e_{t-1},\ldots ,e_{t-q}\) are lagged errors;

  • \(e_t\) is white noise and takes into account the forecast error;

  • c is a correcting factor.

The regression model above described is referred as ARIMA(pdq), where the order of the model is stated by three parameters: p (order of the auto-regressive part), d (degree of first differencing involved) and q (order of the moving average part). A useful notation commonly adopted when treating this kind of models is the ’backshift notation’ [34,35,36], that is based on the B operator. The B (\(B^d\)) operator on \(y_t\) has the effect of shifting the data back one period (d periods). This is very useful when combining differences, as the operator can be treated using ordinary algebraic rules. By using the ’backshift’ operator, the full model can be written as:

$$(1-\phi _1B - \ldots - \phi _pB^p)(1-B)^dy_t = (1 - \theta _1B - \ldots - \theta _qB^q)e_t$$

whose details are out of the scope of this work and a formal demonstration can be found in [33,34,35].

In order to deal with seasonality, the classical ARIMA processes have been generalized and extended by the SARIMA (i.e., Seasonal ARIMA) models. A SARIMA model is formed by including additional seasonal terms (modeling a seasonal component that repeats with a given periodicity) in the classic ARIMA models previously introduced. The seasonal part of the model consists of terms that are very similar to the non-seasonal components of the model. In the final formula, the additional seasonal terms are simply multiplied with the non-seasonal terms. A seasonal ARIMA model is referred as \(SARIMA(p,d,q)(P,D,Q)_m\), where m is a periodicity factor.

The SARIMA model can be written as [15]:

$$\phi _p(B)\Phi _P(B^m)\bigtriangledown ^d\bigtriangledown _m^D y_t = \theta _q(B)\Theta _Q(B^m)e_t$$

where p and q represent non-seasonal ARIMA order, P and Q represent seasonal ARIMA order, d is the number of time differences and D is the number of seasonal difference. B is the backshift operator and is defined such that \(y_tB^s=y_{t-s}\). \(\phi _p(B) = (1-\phi _1B - \ldots - \phi _pB^p)\) is the AR operator and \(\theta _q(B) = (1 - \theta _1B - \ldots - \theta _qB^q)\) is the MA operator. \(\Phi _P(B^m) = (1-\Phi _mB^m - \ldots - \Phi _{Pm}B^{Pm})\) is the seasonal AR operator and \(\Theta _Q(B^m) = (1 - \Theta _mB^m - \ldots - \Theta _{Qm}B^{Qm})\) is the seasonal MA operator. \(y_t\), which has both seasonal and non-seasonal components, is differenced d times (length one) and D times (length m). \(\bigtriangledown ^d = (1-B)^d\) is the non-seasonal differencing operator and \(\bigtriangledown _m^D = (1 - B^m)^D\) is the seasonal differencing operator. \(e_t\) is the random shocks that are not autocorrelated.

Once the differencing order has been chosen i.e. d and D values, the estimation of the best model order and the regression coefficient values is performed by applying the Hyndman-Khandakar’s algorithm. Briefly, the algorithm performs a step-wise search to traverse the model space and discover the optimal combination of p, q, P and Q values, which is based on the minimization of the AIC (Akaike’s Information Criterion) [33]. Then, the estimation of the regression parameters of both seasonal (i.e., \(\phi _1,\ldots ,\phi _p\) and \(\theta _1,\ldots ,\theta _q\)) and non-seasonal part (\(\Phi _1,\ldots ,\Phi _p\) and \(\Theta _1,\ldots ,\Theta _q\)) is obtained by maximizing the MLE (Maximum Likelihood Estimation) [33], i.e., the probability of fitting the data that have been observed.

LSTM

The LSTM model is a recurrent neural system designed to overcome the exploding/vanishing gradient problems that typically arise when learning long-term dependencies, even when the minimal time lags are very long [16]. The LSTM architecture consists of a set of recurrently connected sub-networks, known as memory blocks. The idea behind the memory block is to maintain its state over time and regulate the information flow through non-linear gating units [37]. The output of the block is recurrently connected back to the block input and to all of the gates. As shown in Fig. 2 LSTM has an internal state variable, which is passed from one cell to the subsequent, and modified by the following Operation Gates [37]:

  • Forget gate: it is a sigmoid layer that takes the output at \(\textit{t - 1}\) and the current input at time \(\textit{t}\), concatenates them and applies a linear transformation followed by a sigmoid:

    $$f^{(t)} = \sigma (W_f [ h^{(t-1)},x^t ] + b_f)$$
  • Input gate: it takes the previous output and the new input and passes them through another sigmoid layer, so this gate returns a value between 0 and 1.

    $$i^{(t)} = \sigma (W_i [ h^{(t-1)},x^t ] + b_i)$$

    This value is multiplied with the output of the candidate layer:

    $$C^{(t)} = tanh (W_c [ h^{(t-1)},x^t ] + b_c)$$

    The candidate layer applies a hyperbolic tangent returning a candidate vector to be added to the internal state, which is updated as follows:

    $$C^{(t)} = f^{(t)}C^{(t)} + i^{(t)}C^{(t)}$$

    The previous state is multiplied by the forget gate and then added to the fraction of the new candidate allowed by the output gate.

  • Output gate: it controls how much of the internal state is passed to the output and it works in a similar way to the other gates:

    $$o^{(t)} = \sigma (W_o [ h^{(t-1)},x^t ] + b_o)$$
    $$h^{(t)} = o^{(t)} tanh(C^{(t)})$$
Fig. 2
figure 2

LSTM architecture

Once the number of layers, the number of nodes/units and the activation function per layer have been chosen, the estimation of the best model weights is performed by applying the backpropagation algorithm, i.e. one of the most popular neural network algorithms exploited to compute the necessary correction of weights that have been set randomly at first. Briefly, the algorithm can be decomposed in the following steps [38]:

  • Feed-forward computation: given an input for the network, the output is computed by evaluating the network layer by layer, from the input to the output layers.

  • Back propagation: the error (loss) of the output layer is computed by comparing it with the reference. Once the layer error has been identified, it is exploited to compute the error for the previous layer, thus propagating it backward. This is repeated for all the layers back to the input one.

  • Weight updates: as the errors in all the network layers have been computed, the weights are changed in order to reduce the error, by exploiting the gradient descent algorithm.

The algorithm is stopped when the changes in the value of the chosen loss function become lower than a given threshold value.

Experimental evaluation and results

To assess the performance and usefulness of the algorithm described above, we conducted an extensive experimental analysis by running several experiments in a real-world case study represented by a large area of Chicago. Our analysis aims to identify the most significant multi-density crime hotspots and build efficient prediction models that can forecast the number of future crimes likely to occur in each hotspot. We also present a comparative analysis between SARIMA and LSTM forecasting models. The rest of this section is organized as follows. Section "Data description" describes the area selected for the analysis and the gathered data, Sect. "Crime hotspots: results and discussion" reports the results in terms of multi-density crime hotspots, and Sect. "Crime forecasting models: results and discussion" describes the evaluation of the regressive models, i.e., SARIMA and LSTM, comparing the achieved accuracy to predict crimes in the detected hotspots. Sect. "Comparative analysis with ST-OPTICS on hotspots detection and crime forecasting" furnishes a comparative evaluation of CHD and ST-OPTICS, establishing a contrast in crime prediction accuracy between hotspots based on CHD and those based on ST-OPTICS. Finally, Sect. "Comparison with other crime forecasting approaches on the Chicago Crimes dataset" reports a comparison of the performances between MD-CrimePredictor with other crime forecasting approaches [21,39], which evaluate the goodness of a clustering structure without respect to external labels. To do so, the following set of internal indexes are here adopted: Silhouette [28], DBCV [29], CDBW [30], Calinsky-Harabaz [31], Davies-Bouldin [32], which are used in literature to evaluate the clustering quality in terms of compactness, separation, number of clusters and density when no external information is available [39].

The first set of experimental results is reported in Fig. 5, which shows the performance achieved by the CHD algorithm with \(\omega\) varying from \(-\)0.3 to \(-\)0.25. In particular, Figure 5a shows how the aforementioned internal indexes, evaluating the clustering quality, vary with respect to \(\omega\) values. We can observe that the quality of detected hotspots is very sensitive to \(\omega\), whose best value, in this case, can be clearly estimated as equal to \(\omega ^*\) = \(-\)0.27. On the other side, Figure 5b shows how the number of noise points (blue curve) and the number of detected hotspots (red curve) vary with respect to \(\omega\) values. Noise points are data instances that do not meet the criteria for falling into any of the detected clusters (and are considered outliers by the algorithm), while the number of detected hotspots depends on the algorithm’s ability to find a balanced trade-off between separability and compactness properties. We can observe that for \(\omega ^*\)=\(-\)0.27, the number of detected noise points is 18,929, while the number of detected clusters is 200.

Fig. 5
figure 5

CHD clustering quality, num. of hotspots and num. of noise points vs \(\omega\), with \(k=64\) and \(s=5000\)

As reported above, we have run several experimental tests to find the parameter settings capable of detecting the highest-quality city hotspots. For such a reason, in the following, we present the results achieved by fixing \(\omega = -0.27\), \(k=64\), \(s=5000\), which have been assessed to best suit our application scenario and the considered dataset by the previous analysis.

Now, let us analyze more in detail the crime hotspots detected in the considered scenario. As reported in Sect. "The multi-crime-predictor approach", the clustering algorithm exploited in this work first partitions the original data in several density level sets (each one characterized by homogeneous density distributions on the basis of density variations), then analyzes each density level set through a specific density-based clustering algorithm to detect proper clusters in each partition. The final hotspots (i.e. totally 200) discovered by the algorithm are shown in Fig. 6, where a different color represents each region. Interestingly, this image shows how crime events are clustered on the basis of a density criterion; for example, the algorithm detects several significant crime regions clearly recognizable through different colors: a large crime region (in red) in the central part of the area along with seven smaller areas (in green, blue and light-blue) on the left and right side, corresponding to zones with the highest concentration of crimes. The five most relevant crime hotspots (\(CH\#197\), \(CH\#198\), \(CH\#8\), \(CH\#21\), and \(CH\#15\)) are zoomed-in on the left and right sides of Fig. 6. Many other hotspots are detected, representing areas having minor crime-densities w.r.t. the highlighted ones, or local high-density crime zones surrounded by low-density ones. Table 4 shows several statistics about the whole area and the five most numerous crime hotspots. Overall, these regions cover about 22% of the whole area extension and about 55% of the crime events detected in the whole area between 2001 and 2019.

Fig. 6
figure 6

Detected crime hotspots in the selected area of Chicago, whose the top-5 largest ones are zoomed-out on the left and right

Table 4 Descriptive statistics—whole area and crime hotspots

Finally, in order to make a comparative analysis among classic density-based algorithms and multi-density approaches for hotspots detection, we report here a comparative table (Table 5) showing the results of four algorithms (two classic approaches: DBSCAN and OPTICS-**, and two multi-density approaches: CHD and HDBSCAN). Table 5 shows, for each algorithm, the selected input parameters and some statistics related to the achieved results (i.e., number of detected hotspots, percentage of noise points, Silhouette evaluation measure) on the Chicago crime dataset exploited in this paper and described in Sect. "Data description". By observing the results in Table 5, we can observe that HDBSCAN and CHD achieve higher clustering qualities than DBSCAN and OPTICS-**; in fact, HDBSCAN and CHD (multi-density algorithms) assess on silhouette values equal to \(-\)0.19 and \(-\)0.23, respectively, which are better than DBSCAN and OPTICS-xi’s results, whose clustering qualities assess on \(-\)0.28 and \(-\)0.46. Such results show that multidensity clustering (i.e., HDBSCAN and CHD) is able to distinguish and identify proper hotspots in urban environments better than classic density-based techniques. Moreover, focusing on the two multi-density algorithms CHD and HDBSCAN results, we can observe that CHD achieves a slightly lower silhouette than HDBSCAN, but it labels a very lower percentage of noise points (5.7%) with respect to HDBSCAN (34.6%). For such a reason, CHD resulted the best algorithm to be exploited in our crime data analysis case study. A more detailed analysis about the comparison among such algorithms is reported in [25].

Table 5 Comparative results achieved by DBSCAN, OPTICS-**, CHD and HDBSCAN to detect crime hotspots, on the Chicago crime dataset [25]

Crime forecasting models: results and discussion

As described in Sect. "The multi-crime-predictor approach", the next steps of the algorithm consist of (i) transforming the original crime data set in several time series, and (ii) training local crime predictors for each crime hotspot. In particular, as described in Sect. "Extraction of crime predictors", the extraction of crime regressors has been performed by applying SARIMA and LSTM models on each hotspot. Specifically, we present here the details of the regressive models obtained by both algorithms for the whole area and the three largest crime hotspots, i.e., CH#197, CH#198, and CH#8. Then, we will show the predictive performance of the models on the test set for all hotspots.

The regressive models extracted by SARIMA are reported in Table 6. For each area, the table shows the order of the models, the final autoregressive formulas (in back-shift notation), and the final coefficient values. It is worth noting that the predictive crime models differ among the hotspots, showing that each area presents specific crime trends and patterns, thus making the discovery of different predictive models reasonable.

The models extracted by LSTM are reported in Table 7. For each area, neural networks are trained with 4 layers, ReLu [40] activation function, a number of epochs equal to 50, and a customised batch size and number of units/nodes per layer. In each of the models presented, the mean absolute error (mae) loss function is considered. One of the most important factors in neural network training is the learning rate, a customized hyperparameter with a small positive value between 0.0 and 1.0 [41]. The rate at which weights are changed during the training is known as the step size or learning rate. A learning rate of 0.01 produced superior results in the NN models reported here than other learning rates. Even in the case of LSTM models, each hotspot has specific crime trends and patterns.

Table 6 Details of the SARIMA models trained for the whole area and the top 3 largest crime hotspots in Chicago
Table 7 Details of the LSTM models trained for the whole area and the top 3 largest crime-dense regions in Chicago

In order to assess the effectiveness and accuracy of the regressive functions, we performed an evaluation analysis on the test set consisting of the last three years of data (i.e., years 2017–2019). In particular, for each crime hotspot and for the whole area, their associated SARIMA and LSTM models have been exploited to predict the number of crimes that are likely to happen in that hotspot, week by week. Figures 7 and 8 show observed, SARIMA-forecasted and LSTM-forecasted data (plotted in blue, orange and green, respectively), for the whole area and the crime hotspot CH#197 (the largest one), respectively. We consider here four prediction horizons on the test set, from one to four-week ahead. We note that forecasts generally adhere very well to the observed data over the whole test set period. However, the forecasting accuracy clearly decreases (in particular for LSTM) with the increase of the prediction horizon.

Fig. 7
figure 7

Observed vs forecasted crimes, on the whole area. Number of crimes observed, SARIMA-forecasted and LSTM-forecasted (blue, orange and green lines) on the Chicago test set, for the whole area and several prediction horizons

Fig. 8
figure 8

Observed vs forecasted crimes, on the largest hotspot. Number of crimes observed, SARIMA-forecasted and LSTM-forecasted (blue, orange and green lines) on the Chicago test set, for the hotspot 197 and several prediction horizons

Now, let us give a quantitative evaluation of the performance of the regressive models and their effectiveness in making predictions on the corresponding test sets. To this end, we computed six error measures (MAE, MAPE, MSE, RMSE, MaxError, MeanError), which are commonly used in regressive analysis literature to quantify forecast performance [12].

Table 8 reports the values of the error measures described above achieved by SARIMA and LSTM models for the whole area and the three largest detected crime hotspots. Looking at the values reported in the table, we can make the following observations.

Table 8 MAE, MAPE, MSE, RMSE, Max Error and Mean Error vs several weekly prediction horizons, for the whole area and the top three largest crime hotspots in Chicago City
Fig. 9
figure 9

MAE for each hotspot. Mean Absolute Error (MAE) for the whole area and the top 5 largest crime hotspots, achieved by SARIMA and LSTM

The smaller hotspot, the lower MAE. Looking at the values in the table, we can observe that MAE values decrease when hotspot areas are smaller and smaller. In fact, considering one-week-ahead forecasting, the MAE achieved by SARIMA models monotonously decreases from 77.44 (whole area) to 24.42, 21.09, and 12.59 (three largest crime hotspots, ordered by decreasing size), and similarly for all other forecasting horizons. LSTM forecasts show decreasing MAE values as well. The trend is clearly recognizable in Fig. 9, which plots the MAE achieved by both SARIMA and LSTM for the whole area and the top five largest crime hotspots. The chart clearly shows that the smaller the hotspot, the lower the error. This is a reasonable outcome, that is, predictions are more precise when hotspot areas are smaller, thus providing city administrators and police officers with more detailed information for strategizing how to distribute resources and efforts among the various parts of the city.

Higher forecasting accuracy when the forecasting horizon is shorter. For example, the MAE assessed by LSTM-forecasts, by considering the whole area, monotonously increases from 91.06 (for one-week-ahead forecasts) to 97.86, 113.70 and 140.41 (for two-, three- and four-week ahead forecasts), and similarly all other indices and areas. This is a reasonable result, considering that forecasts are based on the previous historical trends: the more away is the forecasting timestamp from the most recent historical data, the less accurate the forecast. The increasing trend can also be seen in Fig. 10, which shows the MAE versus several weekly forecasting horizons. The increasing trend is more evident for the whole area and the largest cluster, and it is particularly marked for the LSTM-based forecasts.

Fig. 10
figure 10

MAE vs n. of weeks. Mean Absolute Error (MAE) versus the number of weeks in the test set, achieved by SARIMA and LSTM, for the whole area and the top 3 largest crime hotspots

Fig. 11
figure 11

MAPE vs n. of weeks. Mean Absolute Percentage Error (MAPE) versus the number of weeks in the test set, achieved by SARIMA and LSTM, for the whole area and the top 3 largest crime hotspots

Fig. 12
figure 12

Distribution of the residuals. Distribution of the residuals (with the overlaid normal curve) on the test set, for the top 2 largest crime hotspots, for one-week ahead forecasting

Fig. 13
figure 13

QQ-plot. QQ-plot for the top 2 largest crime hotspots

SARIMA models outperform LSTM model (for large hotspots). Percentage errors (MAPE column) show that the adopted SARIMA models (Table 6) forecast the number of crimes with an average error ranging from 5.09% (whole area, one-week ahead) to 13.37% (crime hotspot #8, four-week ahead), which appears to be a very interesting result. On the other side, LSTM models assess MAPE values ranging from 5.93% to 12.81%, respectively. For a more complete view of these results, Fig. 11 shows the MAPE versus several weekly forecasting horizons. From the plot, we can observe that percentage errors of both SARIMA and LSTM models increase when the prediction horizon is longer and longer, and that generally SARIMA models outperform LSTM regressors (but for the smaller hotspot). Also, by observing the values in the Table 8 and Fig. 11, we can observe that the lower the hotspot area, the higher the percentage error. However, the MAPE index, as defined above, does not take into account the coverage level of each hotspot. The growth in forecasting errors is compensated by a more precise identification of the specific area where crime events will occur, thus giving more exhaustive information to city administrator and police officers for planning how to distribute resources and efforts in the different regions of the city.

Finally, to understand whether the forecast errors can be approximated to normally distributed with mean zero and variance \(\sigma ^2\), we show in Fig. 12 the distribution of residuals (with overlaid the normal curve with the same mean and standard deviation as the distribution of forecast errors) for the two largest crime hotspots detected by SARIMA models. In particular, the figure presents the histograms of the forecast errors over one-week ahead forecasts, which show that the distributions of forecast errors are slightly shifted towards positive or negative values compared to a normal curve (it should be centered on 0, in the ideal case). This is also confirmed by observing the Normal QQ plot (quantile-quantile plot) shown in Fig. 13, which can be exploited as a graphical tool to assess if residuals plausibly follow a normal distribution. Both plots graphically confirm that the residuals follow a normal distribution, as expected.

Comparative analysis with ST-OPTICS on hotspots detection and crime forecasting

To make our evaluation more accurate and complete, we performed a comparative analysis of the proposed approach, based on CHD for hotspot detection, with a similar approach based on ST-OPTICS [27], which is a density-based clustering algorithm specifically designed to analyze spatio-temporal data. ST-OPTICS was selected among others since it was purposely designed for clustering datasets characterized by time-based features, and thus is not directly comparable with the other spatial clustering algorithms previously mentioned (see Table 5). In a nutshell, ST-OPTICS is a modified version of the OPTICS algorithm, achieved by extending the notion of density-reachability. It exploits two radiuses, \(\epsilon _1\) and \(\epsilon _2\), where the \(\epsilon _1\) defines the reachability with respect to spatial attributes, and \(\epsilon _2\) defines the reachability w.r.t. non-spatial (temporal) attributes; on the basis of such definitions, a point \(p_i\) is considered in the neighborhood of \(p_j\) if the distance between \(p_i\) and \(p_j\) is less than \(\epsilon _1\) w.r.t. spatial attributes, and less than \(\epsilon _2\) w.r.t. non-spatial attributes. The ST-OPTICS implementation we exploited is publicly available,Footnote 3 and it takes as input parameters \(\langle \epsilon _2, min\_pts,\xi \rangle\), where \(\epsilon _2\) is a threshold value on the maximum radius w.r.t. the non-spatial attributes, \(min\_pts\) is the minimum number of neighbors required to define a core-point, and \(\xi\) determines the minimum steepness on the reachability plot that constitutes a cluster boundary. The reachability plot takes into account both spatial and non-spatial radiuses. It is also worth noting that \(min\_pts\) and \(\xi\) are exploited as in the well-known OPTICS-\(\xi\) algorithm.

To perform the comparative analysis between the results achieved by ST-OPTICS and CHD, we first evaluated the characteristics of the most five relevant hotspots detected by the two algorithms, and then the forecasting performance achieved for crime prediction in each hotspot. The dataset exploited for the comparative analysis is that one described in Sect. "Data description", and predictions have been compared versus different forecasting horizons.

As a first result, ST-OPTICS has been applied to discover spatial hotspots from the geo-referenced crime data. In order to detect high-quality crime-dense regions, an input parameters tuning has been done to achieve the best results of the algorithm. In particular, the clustering quality has been evaluated by computing the internal indexes (Silhouette, DBCV, CDBW, Calinsky-Harabasz, Davies-Bouldin) adopted in Sect. "Crime hotspots: results and discussion", by varying \(\xi\) from 0.05 to 0.1 and \(\epsilon _2\) from 4 to 24 (with step size equal to 4). The results are reported in Figure 14a, which shows the performance achieved by varying \(\xi\), fixed \(\epsilon _2=24\) and \(k=64\) (which corresponded to the optimal performance within the faced scenario). In particular, Figure 14b shows that the best quality of detected hotspots is achieved for \(\xi ^*\) = 0.07. Comparing such results with those reported in Sect. "Crime hotspots: results and discussion", we notice that CHD performs better than ST-OPTICS considering Silhouette, Calinsky-Harabasz and Davies-Bouldin indexes, while ST-OPTICS is better on the DBCV index. On the other side, Figure 5b shows how the number of noise points (blue curve) and the number of detected hotspots (red curve) vary with respect to \(\xi\) values. We can observe that for \(\xi ^*\)=0.07, the number of detected noise points is 23,947, while the number of detected clusters is 49. With respect to CHD, ST-OPTICS detects an higher number of noise points (23,947 versus 18,929) and a lower number of hotspots (49 versus 200). The results shown below only refer to the run with the best combination of parameters (i.e, \(\xi\)=0.7, \(\epsilon _2=24\), \(k=64\)).

Fig. 14
figure 14

Hotspots detection: ST-OPTICS clustering quality, num. of hotspots and num. of noise points vs \(\xi\), with \(k=64\) and \(\epsilon _2 = 24\)

Table 9 Crime forecasting: MAE, MAPE, MSE and RMSE for the top five most numerous crime hotspots in Chicago, detected by ST-OPTICS and CHD

The comparative forecasting performance analysis on the hotspots detected by ST-OPTICS and CHD has been done by focusing on the five most numerous clusters returned by the two algorithms. In particular, as SARIMA models have shown higher predictive accuracy in Sect. "Experimental evaluation and results", we exploit here these regressive models to compare the achieved results. Table 9 reports the values of four error measures (MAPE, MAE, MSE, RMSE) achieved by SARIMA models on the five largest hotspots detected by ST-OPTICS and CHD (sorted by decreasing size), versus one-, two-, three- and four-week-ahead forecasting horizons. Looking at the values reported in the table, we can observe that the first two largest clusters detected by ST-OPTICS (clusters #0 and #4) and CHD (clusters #197 and #198) are very different in terms of number of points, while the other ones have comparable sizes. Also, by comparing MAPE, MAE, MSE and RMSE, we can observe that forecasts achieve generally lower errors on the hotspots detected by CHD than on those ones detected by ST-OPTICS. This result, in part due to the lower numerosity of the clusters, shows higher forecasting accuracy on the hotspots detected by CHD. As a more complete view of the MAPE results, Fig. 15 shows the MAPE versus several weekly forecasting horizons. From the plot, we can observe that percentage errors are lower on CHD-detected hotspots than on ST-OPTICS-detected hotspots (except for the largest cluster).

Fig. 15
figure 15

MAPE achieved by SARIMA model, for the top 5 most numerous clusters detected by ST-OPTICS and CHD

Comparison with other crime forecasting approaches on the Chicago Crimes dataset

With the aim of making a comparative analysis for crime forecasting more accurate and complete, we report here some comparative results between MD-CrimePredictor and some other approaches selected from the crime forecasting literature (i.e., [21,22,23]). Specifically, to ensure a fair and consistent comparison, we selected four algorithms that have been specifically applied to the Chicago crime data, i.e., the same dataset we exploited to evaluate MD-CrimePredictor as well. The approaches have been compared in terms of MAPE, which is a scale-independent metric (making it suitable for comparisons between different datasets or models) largely used in the crime forecasting performance evaluation [1]. Table 10 summarizes the results of the comparison, showing for each approach (i) the exploited models, (ii) the period of the Chicago crimes dataset exploited as training set, (iii) the period of the dataset exploited as test set, (iv) the total number of forecasted days, and (v) the related MAPE index for one-day-ahead forecasts, as reported in the corresponding references [21,22,23] (reviewed in Sect. "Related work"). By observing the table, it is worth noting that the MD-Crime-Predictor has been tested considering the longer time horizon (365 days), while the other approaches have been tested on time horizons no longer than 6 months (184 days for the approaches proposed in [23]). As a second thought, it can be seen that MD-CrimePredictor over-performs the other methodologies w.r.t. the MAPE index (0.12), resulting slightly more effective than the second best result reported in the table (0.14). The comparison confirms the goodness of the presented approach, even when considering short (one-day-ahead) time windows.

Table 10 Comparative results on crime forecasting with other approaches proposed in literature on the Chicago crimes dataset, for one day-ahead forecasts

Conclusion

This paper presented the design and implementation of MD-CrimePredictor (Multi-Density Crime Predictor), an approach based on multi-density clustering and regressive models to automatically detect high-risk crime areas in urban environments, and to reliably forecast crime trends in each area. First, the algorithm detects multi-density crime hotspots by applying a multi-density clustering algorithm, where densities, shapes, and the number of the detected regions are automatically computed by the algorithm without any pre-fixed division in areas. Then, a specific regressive model is discovered from each detected hotspot, analyzing the partitions discovered during the previous step. The final result of the algorithm is a spatio-temporal crime forecasting model, composed of a set of crime hotspots, their densities, and a set of associated crime predictors. Forecasting models are extracted by exploiting both SARIMA and LSTM models, and a comparative experimental analysis is presented in terms of error measures. The experimental evaluation of the proposed approach, performed on a large area of Chicago (involving more than two million crime events), has shown higher accuracy of the first method with respect to the second one. We also offer a comparative evaluation of CHD in contrast to ST-OPTICS, making a comparison regarding crime prediction accuracy between hotspots identified through CHD and those identified through ST-OPTICS. Moreover, we have also presented a comparative analysis with other crime forecasting methods proposed in the literature, and specifically tested on Chicago crime data. Overall, the results show the effectiveness of the approach proposed in the paper, by achieving good accuracy in spatial and temporal crime forecasting over rolling time horizons.

In future work, other research issues may be investigated. First, we further explore the application of other multi-density approaches for the detection of crime hotspots, with the aim to perform a comparative evaluation between different clustering algorithms (multi-density vs classic density-based approaches) in crime spatial analysis. Second, we will study how other urban events can affect crime trends, and how such data can be correlated to criminal activities.