1 Introduction

Road crashes are a significant cause of death, grief and loss worldwide and require thorough investigation and concerted efforts. According to data from the World Health Organization (WHO), road crashes cause over one million deaths annually, placing them among the top ten causes of death in all age groups. Greater compliance with traffic signals and the respect of driving rules is likely to reduce the number of accidents but exploring ways to improve road safety could also be beneficial.

Understanding the relationship between accident risk and road infrastructure has benefits for a range of stakeholders, including urban planners, policymakers, and transportation authorities. However, it also directly affects citizens’ experiences when using public spaces. A deeper understanding of how road infrastructure affects accident probabilities enables individuals to make informed decisions about routes, modes of transportation, and overall risk exposure. This knowledge allows individuals to proactively seek safer routes and exercise greater caution in high-risk areas, thereby enhancing personal safety and well-being.

To help policymakers and administration to address their decision to increase road safety, a wide literature has been published to highlight the weakest features of roads or network of roads. From an engineering point of view, pioneering researches by Davies (1944) and Smeed (1949) delved into optimal road composition to mitigate accidents, while recent studies like Christensen et al. (2022) propose the use of Energy Absorbing Structures for crash mitigation. In Greibe (2003) and Mackay (1994) the challenges in preventing road accidents, as road and vehicle improvements, are discussed. However, road network structures often carry implicit constraints within their spatial domains. Tang et al. (2023) focus on urban traffic accidents in China, scrutinising both moving-vehicle and fixed-object crashes using multi-scale geographically weighted regression. The research unveils similar clustering patterns for both crash types, uncovering overlap** accident-prone areas. They focus not only on technical aspects but also on selecting road characteristics and their contextual factors influencing accident occurrences.

In this framework, spatial road safety analysis has been developed in order to examine the geographical distribution and patterns of road accidents, enabling targeted interventions and precise adjustments to road infrastructure aimed at reducing risks and enhancing overall safety within specific areas (see Ziakopoulos and Yannis (2020) for a review of spatial approaches). Specifically, models operating on road networks have been extensively used in recent years for analysing streets at a very detailed level. In particular, point processes have been employed to assess the occurrence of events (such as accidents), by considering spatial and topological road features (see, e.g., Baddeley et al. (2021) for a review on these topics). Yet, due to constraints in available datasets, count models often stand as a pragmatic adaptation of point processes within this context (see, e.g., Tang et al., 2023; McSwiggan, 2019; Baddeley et al., 2020).

Within the domain of count models, various approaches exist in the literature, each offering unique insights. In our proposal, we present a merge of two components: a modification of the conditional autoregressive modelling (see Boulieri et al., 2016; Gilardi et al., 2022) incorporating spatial lagged effects, to estimate efficiently the risk of accidents at the road level, and a two-stage mixed geographically weighted Poisson regression (see Murakami et al., 2023; Briz-Redón et al., 2019; Gomes et al., 2017) to unveil local heterogeneity. Two basic versions (semi-parametric geographically weighted Poisson regression and conditional autoregressive prior) of these models have been indeed compared in Xu and Huang (2015), highlighting advantages and drawbacks. We show in this paper how a fusion between them helps to capture both network-wide trends and local nuances, providing a comprehensive understanding of accident risk factors. This combination is relevant because it combines the strengths of both approaches: the modified conditional autoregressive modelling captures trends in accident risk across the whole area, while the mixed geographically weighted Poisson regression reveals specific behaviour at the road level. By integrating these two methodologies, our approach offers a more comprehensive understanding of accident risk factors, ensuring that both network-wide patterns and local aspects are adequately addressed.

To demonstrate the effectiveness of our approach, we applied it to the province of Milan in Italy. We obtained road features from the OpenStreetMap contributors (2017) database and accident data from the ISTAT - Italian National Institute of Statistics (2021) for the period between 2016 and 2020. By using both models, we identified the key street characteristics that influence accident risk and revealed the distribution of relevant covariates at a road level. This offers valuable insights for informed decision-making in road safety measures.

The paper is structured as follows: Sect. 2 focuses on models specifically designed for spatial data. In particular, Sect. 2.1 provides a rationale for employing count models in our data analysis. Section 2.2 underscores the significance of incorporating spatial dependence in accident risk assessment models and outlines the general framework. Our approach is summarised in Sect. 3. Additionally, Sect. 3.1 details the application of proper conditional autoregressive models with spatial lags, while Sect. 3.2 introduces the two-stage mixed Geographically Weighted Poisson regression method. The description of the dataset and its structure are highlighted in Sect. 4. The subsequent Sect. 5, presents the numerical analysis and it is dedicated to summarise our findings. The discussion focuses on the results derived from the conditional autoregressive models (Sect. 5.1) and the geographically weighted Poisson regression (Sect. 5.2). Conclusions are drawn in the final section.

2 Models for spatial data for road safety: why do we use count model

Due to the absence of modern technology and appropriate algorithms for analysing street-level crash events, the initial papers on spatial road safety analysis were developed using the areal approach (Miaou et al. 2003; Aguero-Valverde and Jovanis 2006; Boulieri et al. 2016). As deeply reported in McSwiggan (2019) problems with aggregation of data are related to the so called “ ecological fallacy”. Freedman (1999) states: “ the ecological fallacy consists in thinking that relationships observed for groups necessarily hold for individuals”(this issue is also known as the modifiable areal unit problem (MAUP), see Gilardi et al., 2022). In our context, groups are the areal clusters used for the analysis and individuals are the accidents in the road segment. For example, we may state that speed is related to the crash intensity in a region but roads have different speed limits and some of them (or many of them) are pedestrian zones. Also thanks to the increasing solutions of computational burdens in the analysis of network of lines, the number of papers estimating the risk at the road level is increasing. Some recent applications are in Borgoni et al. (2021) and Gilardi et al. (2022). In this section, we motivate the use of count models, then we focus on count models with spatial dependence and finally we introduce proper conditional autoregressive models. We highlight some limitations of the existing models in our context in order to justify the approach proposed in Sect. 3.

2.1 Spatial point patterns vs count models

Models based on a network of lines have recently gained popularity due to the availability of open-source spatial databases (e.g. OpenStreetMap contributors, 2017) for analysing street networks at a very detailed level (see Barua et al., 2014; Barrington-Leigh & Millard-Ball, 2017; Mooney & Minghini, 2017; Briz-Redon et al., 2019, Gilardi et al., 2022). In this context, many approaches have employed spatial smoothing to simplify the estimation process. However, in this paper, we adopt count models on a network of lines because they offer high flexibility, allowing us to investigate crash intensity at the street segment level. This approach can provide more informative insights for social and policy monitoring. Alternative strategies and interpretations of road networks are extensively described in Lord and Mannering (2010), Savolainen and Mannering (2011), and Ziakopoulos and Yannis (2020).

To explain the reasons for using count models, we revisit the fundamental principles of a linear network (see Baddeley & Nair, 2012). It is defined as the union \(L = \bigcup _{i=1}^m s_i\) of a finite number m of line segments \(s_1, \dots , s_m\) in the plane, where \(s_i = [\zeta _i, \nu _i] = \{w: w = t \cdot \zeta _i + (1 - t) \cdot \nu _i, 0 \le t \le 1\}\) is the line segment with endpoints \(\zeta _i, \nu _i\), belonging to the two-dimensional space. Crashes located on a network correspond to a point pattern \(\textbf{x}\) on L. A point pattern is a finite set \(\textbf{x} = \{x_1, \dots , x_n\}\) of distinct points \(x_i \in L\), where \(n \ge 0\). For any set \(B \subset L\), let \(N_x(B) = N(\textbf{x} \cap B)\) be the number of points of \(\textbf{x}\) lying in B (Rakshit et al., 2021).

To assess the impact of covariates on occurrences, many models employed in spatial point processes use a density of the form \(\frac{g(u, \varvec{\theta })}{G(\varvec{\theta })}\) where g represents an explicitly defined function expressed in terms of interaction functions with the data, \(\varvec{\theta }\) is a vector of parameters, and \(G(\varvec{\theta })\) is a normalizing constant that cannot be explicitly evaluated. Since \(G(\varvec{\theta })\) is unknown, conducting standard likelihood estimation becomes challenging (Jensen & Møller, 1991).

In modelling car crashes, let us assume that point locations are distinct and possess an intensity function \(\lambda (u), u \in L\), enabling the computation of the average accidents in L, defined as it follows:

$$\begin{aligned} \mathbb {E}[N_X(B)] = \Lambda (B) = \int _B \lambda (u) \, \textrm{d}_1 u, \end{aligned}$$
(1)

where \(\textrm{d}_1 u\) denotes integration with respect to arc length (Ang et al., 2012; Baddeley et al., 2017; Rakshit et al., 2021). In the case of non-homogeneous Poisson point process, which is the most commonly used process for modelling crashes on a linear network, a significant implication is that the points in \(\textbf{x} \cap B\) are assumed to be independent and identically distributed (i.i.d.) with a probability density function \(f(u) = \frac{\lambda (u)}{\Lambda (B)}\). Consequently, the log-likelihood can be expressed as:

$$\begin{aligned} l = \sum _{i=1}^n \log \lambda (x_i) - \int _L \lambda (u) \, \textrm{d}_1 u. \end{aligned}$$
(2)

This expression is numerically intractable and necessitates approximation, which can be achieved using the Berman–Turner device specifically designed for networks (Berman & Turner, 1993; McSwiggan, 2019).

A crucial technical consideration in this context is that a spatial covariate V on L is assumed to be a real- or vector-valued function \(V(u), u \in L\). It is furthermore assumed that the values V(u) are fixed and known (in principle) for all locations \(u \in L\), in order to model, within a Generalised Linear Model (GLM) framework, links of the form \(\lambda (u) = \exp (\varvec{\beta }^T V(u))\), where \(\varvec{\beta }\) represents the standard vector of parameters. In practical scenarios, these values may only be available at specified sample locations. and it is relevant to possess information about V(u) at locations u beyond those confined to the point pattern. An alternative to GLM is the estimation of spatially-varying event densities using kernel density estimators. The goal of density estimation is to statistically infer the spatially-varying density from an observed point pattern \(\textbf{x}\) with only minimal assumptions about the underlying point process. However, it is important to note that this topic falls outside the scope of this paper. We simply mention that kernel estimation on a linear network is not a standard application of kernels and can be computationally complex; various different techniques have been proposed (see Baddeley et al., 2021 for a review on this topic).

In general, the framework depicted above suggests that, in principle, any point process model applied to a linear network should not assume that a covariate remains constant along any edge of the network. This is achievable, for instance, when information about the distance of a crash from an intersection can be accurately computed or when the variational risk (such as the length of road visible to a driver at their location) along a curve can be determined. However, it is worth noting that in cases where extremely high map resolution and precise location data are not always available, some authors Briz-Redon et al. (2019) propose a workaround by capturing the differential risk, for example, between road locations around intersections and road segments. This is done by dividing the original network structure into shorter road segments in the proximity of each road intersection for a more refined analysis.

Similar to many other applications described in the literature, the dataset available to us does not allow the fit of a point process. This is primarily due to the fact that the location of accidents often lacks the corresponding street number or the exact kilometer of the event occurrence, especially in cases of accidents along highways. Consequently, we will investigate count models with spatial components.

If we assume that the linear network is partitioned into J disjoint subsets \(l_1, \dots , l_J\) (for instance, the edges of the network) and if the spatial covariate functions are assumed to be invariant on each subset, then

$$\begin{aligned} \textbf{V}(u) = \textbf{v}_j \text { for } u \in l_j \end{aligned}$$
(3)

where \(\textbf{v}_j = (v_{j1}, \dots , v_{jp})\) is the vector of p features of \(l_j\). This allows us to aggregate the point processes over these subsets, resulting in observable random variables \(N_j = N(\textbf{x} \cap l_j)\), representing the counts of points falling into each subset, with \(\mu _j = \mathbb {E}[N_j]\) that denotes the expected number of counts in \(l_j\). These assumptions hold significance in our context, aligning with the customary practice in most of road accident research where the subsets \(l_j\) correspond to the original segments which defined the network, and the available covariates remain constant along each segment.

In this scenario, the non-homogeneous Poisson process with a log-linear intensity exhibits a constant intensity within each subset. This implies that \(N_j\) can be used to fit a Poisson log-linear regression with

$$\begin{aligned} N_j \sim \text {Pois}(\mu _j) \text { } \forall j \text { i.i.d. r.v. with } \mu _j = \int _{l_j} \lambda (u) \, \textrm{d}_1 u. \end{aligned}$$
(4)

Consequently, the log-likelihood of a point process corresponds to a Poisson count model

$$\begin{aligned} l = \sum _{j} \left( n_j \log \mu _j - \mu _j \right) \end{aligned}$$
(5)

2.2 Spatial dependence

In order to model spatial data (see, for example, Glaser, 2017; Gschlossl & Czado, 2007), we must introduce the concept of spatial dependence, which necessitates the definition of a proximity matrix. Let \(\textbf{Y} = (Y_1, Y_2, \dots , Y_n)^T\) be the vector of random variables observed in n different regions (in our case, edges). Consider the \(n \times n\) proximity matrix or neighbourhood matrix, \(\textbf{W}\). In an area partitioned into n subareas, the element \(w_{ij}\) may represent the weight, the connectivity degree, or the spatial proximity intensity between \(l_{i}\) and \(l_{j}\), with \(w_{i,i} = 0 \text { }\forall i\in \left\{ 1,...,n\right\} \). The other elements can be selected rather arbitrarily. One of the most common choices to define the non-null weights \(w_{ij}\) is based upon neighbourhood considerations:

$$\begin{aligned} w_{ij} = {\left\{ \begin{array}{ll} 1, &{} \text {if } j \in Q(i) \\ 0, &{} \text {if } j \notin Q(i) \end{array}\right. } \end{aligned}$$
(6)

where the set Q(i) denotes the neighbours of variable \(Y_i\). Specifically, \(w_{ij}\) equals 1 when areas i and j share boundaries, and 0 otherwise.

The next ingredient is the model. The most common model is the conditionally autoregressive (CAR) model by Besag (1974). Its general form is:

$$\begin{aligned} Y_i = \mu _i + \phi _i + e_i \end{aligned}$$
(7)

where, \(\phi _i\) and \(e_i\) are the spatially structured random effect and the unstructured effect, respectively, i.e., a location-specific component. To incorporate spatial covariates, a mean specification dependent on covariates can be introduced by setting:

$$\begin{aligned} \mu _i = \textbf{x}_i^T \varvec{\beta } \end{aligned}$$
(8)

where \(\textbf{x}_i^T = (1, x_{i1}, \dots , x_{ik})^T\) is the transposed vector of k covariates observed in the region/link i, and \(\varvec{\beta } = (\beta _0, \beta _1, \dots , \beta _k)^T\) is a vector of parameters. For identification purposes, this model is typically addressed within a Bayesian framework to capture local and spatial uncertainty due to factors not measurable only through the datasets, also because of the dynamics of crash occurrences. A prior distribution for each \(\phi _i\) is assumed to be:

$$\begin{aligned} \phi _i \mid \phi _j, j \ne i \sim N\left( \sum _j w_{ij} \phi _j, \sigma ^2\right) \end{aligned}$$
(9)

i.e., it is assumed to be distributed according to a normal random variable with an average equal to the sum of the weighted values of its neighbours and an unknown variance. The joint distribution of \(\varvec{\phi }\) is a multivariate normal random variable \(\varvec{\phi } \sim N(\textbf{0}, \textbf{Q}^{-1})\) with \(\textbf{Q} = [\tau \textbf{D} (I - \alpha \textbf{B})]\) where \(\textbf{D}\) is a diagonal matrix whose entries represent the number of neighbours, \(\tau \) is the precision parameter, \(\alpha \) controls the degree of spatial correlation (\(\alpha =0\) implies spatial independence and \(\alpha =1\) implies complete spatial correlation). Finally, \(\textbf{B} = \textbf{D}^{-1}\textbf{W}\) is the scaled adjacency matrix.

Due to computational complexity (see Cantaluppi et al., 2023), the Intrinsic Conditional Auto-Regressive (ICAR) models are often employed, setting \(\alpha =1\). In this case, \(\textbf{Q}\) simplifies to \(\tau (\textbf{D}-\textbf{W})\). It is possible to prove that the ICAR prior may face rank-deficiency issues. In general, a set of sum-to-zero constraints on the vector \(\varvec{\phi }\) is necessary for each group of connected segments (Hodges et al., 2003). In our application, we deal with a fully connected road network, so we only need to impose one set of constraints (Gilardi et al., 2022).

Applying this method to data of a network of lines has several limitations, as it does not consider the true distance between two locations in a directed graph. Additionally, it must be applied to a whole area, making it challenging to interpret the impact on the estimates of different domain characteristics.

Therefore, in the following section, we provide a different proposal that allows us to solve the issues listed above.

3 What model for crash intensity?

We propose an extension of existing models to fit car crashes. Specifically, our proposal consists of two main steps. The first one enhances CAR by incorporating a spatial lag of \(\textbf{X}\) (SLX), a concept borrowed from the econometric literature. This step is relatively fast, reliable and flexible. It allows to model spatial heterogeneity and to estimate in a compact way the risk of accidents of the edges of the area under consideration. The second step introduces a novel two-stage Graphically Weighted Poisson Regression (GWPR). This step proves invaluable for a more in-depth exploration of the characteristics that influence a specific segment or edge of the road network, revealing the key components that are likely to contribute significantly to the expected local accident rate. The two solutions will be detailed in Sects. 3.1 and 3.2, respectively.

3.1 CAR with SLX

With the purpose of extending CAR, we need to consider the distance along a directed network, taking into account the real constraints of street navigation from one point to another. To include the intuitive notion that links too far apart are conceptually not dependent, we employ the bi-square kernel function (see Nakaya et al., 2005), where the weights \(w_{ij}\) with \(i \ne j\) of the proximity matrix are defined as follows:

$$\begin{aligned} w_{ij} = {\left\{ \begin{array}{ll} \left[ 1 - \left( \dfrac{d_{ij}}{h}\right) ^2\right] ^2, &{} \text {if } d_{ij} < h \\ 0,&{} \text {if } d_{ij} \ge h \end{array}\right. } \end{aligned}$$
(10)

and \(d_{ij}\) is the distance between i and j, h is the bandwidth and it will be selected using a cross-validation procedure. It is worth pointing out that the weight decreases as the distance increases, and the selection of h allows to add sparsity to the matrix, resulting in a significant reduction of the computational time.

Additionally, to account for the constraints imposed by the road infrastructure, we do not define \(d_{ij}\) as the Euclidean distance but as the weighted shortest path distance. Specifically, a directed path in a graph denotes the shortest route from one node (or vertex) to another, where links have the same specific direction. The length of this path, in the case of a weighted graph, is obtained by summing the weights of each link involved in the path.

Given the distances outlined above, we propose a hybrid of two spatial autoregressive models (SLM and SLX, for detailed information and further references see Vega & Elhorst, 2015). In general, the average count of a spatially dependent variable in the i-th road is modelled according to:

$$\begin{aligned} Y_{i} | \{y_{j}, j \in N(i)\} \sim \text {Pois}(E_i \mu _i) \end{aligned}$$
(11)

where the dependent variable, conditioned on its neighbours, follows a Poisson distribution. For practical applications in the insurance context, see, for instance, (Gschlossl & Czado, 2007; Tufvesson et al., 2019).

Notice that \(E_i\) represents the exposure parameter, also known as offset. This component is added to the model to consider the occurrences of the response variable in relation to a unit measure. In this context, it is common to use metrics such as the length of the road, traffic volume, or the vehicle miles travelled (VMT), which is the product of the road length and the number of cars over a specified time unit. Specifically, VMT serves as an indicator of the total distance travelled (in miles or kilometers) by all vehicles within a specific area/road and time period. It aids in identifying regions that experience higher travel frequency. For our purposes, we will use the VMT.

In general, traffic information is available at some cost. Most common providers (as Google, HERE, Bing, TomTom, etc.) partially allow the download of traffic flows for free, but only for limited areas or periods. Alternative indicators can be obviously used. Boulieri et al. (2016) use annual average daily traffic, which is the number of vehicles travelling along a given point on a highway on an average day in the year. Authors in Gilardi et al. (2023) propose measurement error models to filter the uncertainty in measuring traffic. For our purposes, we saved traffic data for the Milan province.

Considering \(\mu _i\), typically it is assumed that

$$\begin{aligned} \log (\mu _i) = \sum _{j=1}^{k} x_{ij} \cdot \beta _j + \phi _i + e_i \end{aligned}$$
(12)

The fixed and the spatial components on the right, which encompass the link characteristics, are not indexed with respect to time because they refer to the structure of the network, which is substantially assumed to be constant over the reference time horizon. It models the average occurrences on a link in the time unit. Due to the lack of consistent data at sufficiently short intervals, for the response variable we consider the time component only at the yearly level using the approach proposed in Korn (2021).

To address the aforementioned details, we propose combining the CAR model and the SLX model to limit unobservable components in formula (7) to measure clearly the impact of features on risk. We will also explicitly incorporate the spatial dependence of features and accident occurrences. This approach will ensure a logical flow of information to consider the influence of neighbouring network structure.Footnote 1 We set:

$$\begin{aligned} \varvec{\mu } = \rho \textbf{W}\textbf{y} + \textbf{X} \varvec{\beta } + \textbf{W}\textbf{X}_{(-1)} \varvec{\eta } \end{aligned}$$
(13)

with \(w_{ij} = 0\) for \(i = j\) and \(\left\| w_{i.}\right\| = 1\) ensuring row-wise normalization of weights to 1. The parameter \(\rho \) modulates the spatial dependence among occurrences, making it a valuable component for identifying areas more susceptible to accidents. \(\varvec{\eta }\) is a vector of parameters that measures the marginal impact of the features of explanatory variables associated with neighbouring links. The notation \(\textbf{X}_{(-1)}\) indicates the \(\textbf{X}\) matrix without the column associated to the intercept.

The average of the spatial Poisson model can be expressed in a reduced form as:

$$\begin{aligned} \log (\lambda ) = (\textbf{I}_{n} - \rho \textbf{W})^{-1} \left( \textbf{X}\varvec{\beta } + \textbf{W}\textbf{X}_{(-1)}\varvec{\eta }\right) + \text {offset} \end{aligned}$$
(14)

To facilitate rapid estimation of \(\rho \), \(\varvec{\beta }\), \(\varvec{\eta }\), and their associated uncertainty, we employ the software INLAFootnote 2 (Gómez-Rubio et al., 2020, 2021; Lindgren & Rue, 2015), utilising the homonymous R package (R Core Team, 2022; Bivand et al., 2015). A hierarchical structure is assumed. We have set \(0< \rho < 1\) with \(\theta \sim \text {logitbeta}(1,1)\)Footnote 3 where \(\theta = \log \left( \frac{\rho }{1 - \rho }\right) \). For each parameter in \(\varvec{\beta }\), \(\varvec{\eta }\), Gaussian vague priors, i.e. N(0, 1000), are employed.

To reduce the overall complexity of formula (14) we cannot perform standard variable selection methods. There are no p-values in INLA. Importance or significance of variables requires to examine if their (e.g.) 2.5% and 97.5% posterior estimates overlap zero. This involves removing covariates and seeing how this might change model fit according to the model’s Deviance Information Criterion (DIC). We executed this step iteratively, reducing progressively the number of covariates until when the removal of any additional covariate resulted in a DIC increase exceeding 10%. In addition a hierarchical selection has been applied: if a variable in \(\textbf{X}\) is removed then also the corresponding variable in \(\textbf{W}\textbf{X}_{(-1)}\) is removed. The vice versa was not allowed. To consistently account for the year-to-year effect, the year component has never been removed. The step is somehow computational demanding but it helped to improve significantly the interpretation of the final models.

3.2 GWPR: a hierearchical approach

To integrate the detailed analysis of a Poisson point process with the approximations of a Poisson count model and to leverage the flexibility offered by variable selection techniques in identifying subsets of covariates that can capture local heterogeneity, we propose a two-stage mixed geographically weighted Poisson regression (GWPR).

Geographically weighted regression (GWR) (see Brunsdon et al., 1998) performs regression analyses using local samples within a specified bandwidth distance to explore spatially varying relationships between explanatory and dependent variables. Specifically, we focus on the GWPR version, which assumes at the generic position u:

$$\begin{aligned} Y_j \sim \text {Pois}(\mu _j E_j) \end{aligned}$$
(15)

\( Y_j\) are i.i.d. and \(\mu _j = \exp (\textbf{x}_j \varvec{\beta } (u))\) and \(E_j\) is the offset, while \(\varvec{\beta } (u)\) is the vector of varying coefficients. These coefficients are calibrated using a kernel regression methodology, in which we estimate smoothed geographical variations of parameters with a spatial weighting kernel. The estimates of the parameters are calibrated in a point-wise way (see Brunsdon et al., 2005). The local log-likelihood at position u is:

$$\begin{aligned} l(u) = \sum _{j=1}^{n} (\mu _j E_j + y_j \log (\mu _j E_j)) \cdot w_{ij}(\Vert u - u_j\Vert ) \end{aligned}$$
(16)

where \(w_{ij}\) is the geographical weight of the j-th observation at the i-th regression point. The weighting kernel used is the bi-square kernel function provided in formula (10), with \(d_{ij}\) set to the shortest path distance.

It is noteworthy that the parameters depend on u. This implies an interaction between geographical location and the functional relationships within the linear predictor. The model potentially encompasses different coefficients for each u. What is particularly intriguing is that while conventional kernel regression modelling aims at estimating a regression function f by approximating it with polynomials centred on specific values of \({\textbf {x}}_i\), in GWPR, the likelihood is geographically weighted with the weights being determined by a kernel function centred on u. The key distinction lies in the fact that in GWR, the kernel is defined in a geographical space while the regression model pertains to predictor-variable space.

Therefore, this approach enables us to map the variation in the regression coefficients, providing insights into the spatial patterns between the predictor and response variables. An example of the application of standard GWR for car crashes is Pirdavani et al. (2014). What the literature on crash occurrences has not yet thoroughly examined is the potential geographical dependency of certain covariates, while others may not exhibit this characteristic. For instance, traffic lights in urban areas may demonstrate spatial dependence, but in rural areas, that may not hold true. Hence, for \(\mu _j\) in (15) we propose the following structure:

$$\begin{aligned} \mu _j = \exp \left( \sum _{j=1}^{k} x_{ij} \beta _j(u_i) + \sum _{m=1}^{q} x_{im} \gamma _m\right) \end{aligned}$$
(17)

where \(\gamma _m\) are the coefficients of the variables that do not show geographical dependence. The estimation process involves an iterative method (see (Nakaya et al., 2005)). However, it appears crucial how to effectively partition the explanatory variables into the two subsets.

To this purpose, we propose a hierarchical GWPR algorithm:

  1. 1.

    Setting the bandwidth

    • Start with an initial value for the bandwidth, denoted as h.

    • Define \({\textbf {z}}_{(-i)}\) as the vector \({\textbf {z}}\) without the i-th element or the statistics obtained by removing the i-th row of the data matrix.

    • Solve the following optimization problem:

      $$\begin{aligned} \hat{h} : \min \sum _{i=1}^{n} \left[ y_i - \hat{y}_{(-i)}(h)\right] ^2 \end{aligned}$$
      (18)

      where

      $$\begin{aligned} \hat{y}_{(-i)}(h) = E_j \exp ({\textbf {x}}_j \hat{\varvec{\beta }}(u)) \end{aligned}$$
      (19)

      and

      $$\begin{aligned} \hat{\varvec{\beta }}(u): \max _{\varvec{\beta }(u)} \left\{ l_{(-i)}(u)\right\} \end{aligned}$$
      (20)

      Note: The mean squared prediction error is used, but it can be replaced with other appropriate cost functions like deviance, Akaike Information Criteria, etc.

  2. 2.

    Penalised GWPR

    • Formulate the penalised GWPR problem as follows:

      $$\begin{aligned} \max _{\varvec{\beta }(u)} \left\{ l(u) - \gamma \left( \sum _{l=1}^{p} \left[ (1 - \alpha ) \beta _l^2(u) + \alpha |\beta _l(u)|\right] \right) \Bigg | \, \, \gamma , \alpha , \hat{h} \right\} \end{aligned}$$
      (21)

      where \(\gamma > 0\) is the regularisation parameter and \(0 \le \alpha \le 1\) moderates the elastic net penalty (Zou & Hastie, 2005).

  3. 3.

    Splitting the explanatory variables

    • Divide the explanatory variables into \(\textbf{X}_{(-\textbf{z})}\) and \(\textbf{Z}\), where \(\textbf{Z}\) contains the not significant variables. In case of \(\alpha = 1\) (i.e., the LASSO framework), \(\textbf{Z}\) will include variables with coefficients equal to 0. \(\textbf{X}_{(-z)}\) contains all the variables but those in \(\textbf{Z}\).

    • \(\textbf{Z}\) might contain variables that are either truly irrelevant to the problem or are local, i.e., with negligible to null geographical effect.

  4. 4.

    Orthogonal subsets

    • Using the vector of residuals at step 3, apply a penalised (not geographical) elastic net method using only the variables in \(\textbf{Z}\). Let \(\textbf{Q}\) be the variables selected by the method. \(\textbf{Q}\) should contain the subset of not geographical dependent variables.

  5. 5.

    Mixed GWPR

    • In case the set \(\textbf{Q}\) is not empty, consider potential covariate dependence not addressed in step 3. Fit a mixed-GWPR (see Nakaya et al., 2005) with:

      $$\begin{aligned} \log (\mu (u)) = \textbf{X}_{(-\textbf{z})} \varvec{\beta }_{(-\textbf{z})}(u) + \textbf{Q}\varvec{\delta } \end{aligned}$$
      (22)

      with geographical weights applied only to \(\textbf{X}_{(-\textbf{z})}\) and penalization given by:

      $$\begin{aligned} \max _{\varvec{\theta }} \left\{ l(u) - \gamma \sum _{l=1}^{p} \left[ (1 - \alpha ) \theta _l + \alpha |\theta _l|\right] \, \Bigg | \, \gamma , \alpha , \hat{h} \right\} \end{aligned}$$
      (23)

      where \(\varvec{\theta } = \begin{bmatrix} \varvec{\beta }_{(-\textbf{z})}(u)\vspace{2mm}\\ \varvec{\delta } \end{bmatrix}\).

To our knowledge, the attempt to blend elastic net methodology and GWPR is novel in the literature. Although the computational time may be non-negligible, advancements in technology have somewhat mitigated this concern. The main aim of our proposal is to gain evidence regarding the local impact of variables. While car crashes are undoubtedly affected by general factors (e.g. traffic, density of population, speed,...), addressing specific/local components (e.g. lack of traffic lights, pedestrian crossings,...) is essential. Therefore, this approach allows for more effective implementation of political and administrative decisions without affecting overly large domains. An alternative approach would involve fitting a penalised CAR-SLX Poisson model to identify which components in the lagged \(\textbf{W}\textbf{X}\) variables are not selected. The main limitation of CAR-SLX lies in interpreting the estimates (see (Golgher & Voss, 2016) for a comprehensive overview of parameters interpretation in spatial econometric models) while GWPR offers interpretation consistent with generalised linear modelling. In our opinion both methods provide important information to investigate the risk of accidents. In Table 1 we list some pros and cons of both.

Table 1 Comparison of CAR-SLX and GWPR approaches

4 Datasets for car crash modelling: description and issues

In this section, we provide details about the datasets used for assessing the risk associated with car crashes. Our focus is on presenting the key characteristics and challenges we have dealt with.

Table 2 Characteristics retrieved for road networks
Fig. 1
figure 1

Map of Milan Province displaying accident locations marked with black dots. The inner red line delineates the boundary of the city of Milan. (Color figure online)

  • Road Network and its Characteristics In our applications, we extensively used the Open Street Map (OSM). This data can be obtained through various methods, such as using the R packages osmdata Padgham and Rudis (2017) or via the Overpass Application Programming Interface (API). This API provides custom-selected portions of the OSM map data for download. Numerous details are available for each segment of the network. A comprehensive list of covariates used to model spatial point objects is reported in Table 2.

  • Specific Considerations for VariablesNumber of CrossingsandCurvature Degree” The native OSM database does not naturally provide information on the number of road crossings for each link. To include this key risk factor, we estimated the number of crossings by computing the number of links that intersect with other links and, at the same time, culminate in a stop sign. This joint condition is necessary because typically a road can be split into subsequent segments, where the vertices do not always represent road crossings. Similarly, we have also estimated the curvature degree of each road. This metric indicates the minimum number of shape points required to maintain a curve, within 2 ms of a road’s centreline. Both of these features have frequently shown significance in predicting accident occurrences.

  • Socio-demographic Features Many studies (see, e.g., ISTAT - Italian National Institute of Statistics (2021); Choudhari and Maji (2019)), have reported the correlation between local socio-demographic factors (such as population density, family concentrations, housing etc.) and road casualties. This information is typically not available at a street level. On Italian official websites, this classification is reported at a census level, which represents the highest degree of territorial subdivision in Italy. To integrate socio-demographic data with specific road segments, a specific procedure has been implemented to match for each road the corresponding socio-demographic value. For applications these features have been grouped into 7 classes with the same frequency (we have used the empirical quantiles to fix the extremes of the classes).

  • Types of Accidents and Accident Locations The list of accidents we have used refers to crashes provided by the Italian National Office of Statistics (ISTAT) that includes only accidents reported to the police. Thus it mostly lists crashes related to fatalities or injuries. In Appendix A, we report references for obtaining this kind of data in Italy and in other countries. For the Italian dataset, most of accident locations lack details such as house numbers or precise positions along primary roads or highways. In such cases, the accidents have been geographically centred at the middle of the street. As detailed in the previous section, in the count model framework this approximation has a negligible impact. The list of accidents’ locations lacked georeferencing, necessitating reverse geocoding to map them onto the road network. It is important to notice that coordinates returned by the reverse geocoding procedure (in our case based on the R package hereR Unterfinger and Possenriede (2023)) may not perfectly match the road network provided by a different source, such as OSM. Some degrees of approximation is possible (Kılıç et al. (2023)). Consequently, it is possible that the coordinates are just in close proximity to road segments. To overcome this issue, we employed the strategy to project accidents orthogonally onto the nearest roads selecting the link with the shortest distance from them. To account for the potential misassignment of an accident to a wrong road through this strategy, we implemented a code to randomly assign those accidents to one of the roads within a convex hull built near the coordinates that do not align with the road network. The impact on the estimates is negligible and we do not report these results in this paper. The map illustrating accident locations is depicted in Fig. 1. In Fig. 2, we present an example from a specific area in Milan, where certain misalignments were observed and subsequently addressed by employing the random assignment method described above. From Fig. 1 it is clear the high concentration of events in Milan city.

  • Short Segments and Structural Zero Accidents Short segments and structural zero accidents pose a challenge in our network, composed of thousands of short segments. Indeed, some segments may be associated with zero accidents not only because no events have occurred in the past, but also due to the extremely short length of the segments, making it improbable for an event to be observed there. This is not a problem that can be resolved with a zero inflated model alone. Such an issue, also noted in Gilardi et al. (2022), may artificially inflate the number of segments with zero crashes, potentially biasing the results of models. To mitigate this, each model was fitted by systematically filtering out edges with lengths ranging from 0 to 500 ms in intervals of 25 ms. The best filter has been set at the level that allowed the best prediction in terms of minimizing the deviance of residuals on the set of segments removed by the process. The procedure significantly enhanced the quality of the fit while minimizing the impact of roads with potential structural zero accidents.

Fig. 2
figure 2

An excerpt from the map of Milan Province showing accident locations and highlighting specific reverse geocoding misalignments, which were subsequently resolved using the random assignment method described in Sect. 4

5 Case study

The proposed approach has been subjected to testing employing both CAR-SLX and GWPR methods across the extensive areas of Milan city and its province. This assessment aims to evaluate the method’s robustness when applied to different network structures. The dataset covers the period from 2016 to 2020. It comprises 168,965 links totalling 49,377 claims related to personal injury occurred along a total of 11,672.88 km network length. By splitting the information between the city of Milan and the province, we display details in Table 3, where we report some statistics for most relevant road features.

Table 3 Some descriptive statistics of selected road features of the province of Milan and of Milan city

With the only exception of Milan the area has been split into the 132 municipalities. For Milan we have used the corresponding 38 postal code areas (ZIP code). Different tassellation methods can be adopted. Our choice is addressed by the possibility of easy interpretation of results for decision makers. The weighing scheme in model (13) and, natively, in the GWPR model mitigate the subjective choice of any tassellation scheme. To consider the local network complexity the overall length of the network of each area has been used to weight the averages of the road feature of each sub-area. The weighted average of accidents (per year) equals 25.44 for the province versus an average of 171.51 accidents for Milan, thus showing the much higher risk of accidents in the Milan urban area. The empirical quantiles of order 5 and 95 compared with the average show a high asymmetry for most of the variables. As expected pedestrian crossing and traffic lights are much more concentrated in Milan city than in the province.

Although we have not reported the details in Table 3, depending on the areas under investigation, we observed sometimes a strong unbalance in some dichotomous explanatory variables. It may cause quasi-collinearity problems. To address this issue, our program systematically checks the numerical stability of results. In the rare instances where the problem was observed, we have implemented a function to investigate the causes of the failure. Categories found to be collinear with others are removed, and the code restarts from the beginning to ensure robustness of results. To limit the impact of short segments, 5431 segments have been filtered out as per Sect. 4. For each sub-areas on average the percentage of edges of the original network we have used to fit our models is 96.18% and 99.13% for the province and the city of Milan, respectively. The high percentages of valid edges we have used allowed to map almost completely the whole network.

5.1 CAR-SLX model fit

We start focusing on the fit of the model in formula (14). Statistics for each of the 170 subareas used to divide the province of Milan cannot be reported. Therefore, we present some statistics of risk coefficient estimates for the main covariates used in formula (13) in aggregated mode (see Table 4). It is worth noting that, for the province of Milan, the quantile of order 95% of \(\hat{\rho }\) equals 0.438 and the median is 0. Considering the city of Milan, the same statistics are 0.734 and 0.661, respectively. As almost expected, crash occurrences show a stronger and more persistent spatial dependence in areas with high road density compared to less concentrated road networks.

The column titled “# estimates >0” indicates the number of times the average estimate of an area is greater than zero, thus increasing the risk of accidents. The number of road crossings, pedestrian crossings, and traffic lights is generally positive both inside and outside Milan city centre, and this is persistent in most areas. Priority roads appear to have a significant impact on the risk of accidents in the province of Milan. In the city, roads with a fixed speed limit of 50 km/h are more exposed to the risk of accidents. Additionally, the higher the building density, the higher the risk coefficient. The situation is different for the population data. One possible explanation is the lower availability of this data in the province of Milan compared to the city. The column labelled “# feature N/A” indicates the number of times that a particular feature was not available for the specific subset. In some areas, at least one population or building category could not be used. Conversely, the number of times that the posterior interval of estimates overlaps zero for the population and building variables lagged in formula (13) is moderately limited and often positive. This means that the contribution to the risk of accidents at a specific location also depends on how many people live in the neighbourhood of that location. The year effect shows a decrease in risk relative to 2016, primarily due to a slight reduction of accidents, except notably in 2018, which experienced a moderate peak in accidents. This is reflected in the decrease in values within the column labelled “# estimates >0”. It is interesting to notice that the posterior interval of the year’s estimates never overlaps zero, highlighting the significant contribution of this component. Further details can be found in (ISTAT - Italian National Institute of Statistics, 2021).

The last rows in Table 4 report some descriptive statistics of the crash risk estimate, i.e. \(\hat{\lambda }\) from formula (14), at a street/edge level for year. Just for interpretation, the column labelled “crash risk estimate (quantile 1%)” contains the vector of the 1st quantiles of the risk estimates of each areas. The column “average” reports the corresponding weighted average. Moving from the province of Milan to the city of Milan it is clear the estimates and, correspondingly the risk, increase. The annual average probability to record one accident outside Milan equals 0.6% versus 4.59% in Milan city.

Figure 3 shows into details the distribution of the risk on the province of Milan. To facilitate the view of the map we have grouped the vector of \(\hat{\lambda }\) of each areas into classes of deciles and used colours to plot each of them. Areas where it is expected a high concentration of traffic or of population seem to be coherently more exposed to the risk of accidents.

Table 4 CAR-SLX estimates
Fig. 3
figure 3

(top) Map of the risk of accidents for the whole province of Milan; (bottom) Zoom on Milan city centre. In both figures, roads are coloured from the min (green) to the max (red) local risk, clustering roads in deciles. (Color figure online)

From now on we present the results at the city level. Specifically, we compare the averages of the fitted values with the observed values for each city. For Milan, we further break down the results for each ZIP code. Figure 4 shows that, on average, the model provides reliable estimates of the risk with respect to the observed claims.

Additionally, it can be observed how Milan is characterised by a higher risk expressed in terms of accidents with respect to the province. Cities typically experience higher rates of car accidents due to increased traffic density, diverse road users, varying driving speeds, and intersections, all of which contribute to a higher likelihood of collisions compared to less populated neighbourhoods.

Fig. 4
figure 4

Comparison between average observed and fitted values for cities. For Milan, we further break down the results for each ZIP code

Fig. 5
figure 5

Mean, quantiles at 10% and 90% and incidence in terms of number of links

The emphasis on these trends becomes more pronounced when examining the fitted values, as displayed in Fig. 5. Notably, all ZIP codes within Milan show a consistently higher level of risk. Moreover, cities situated in close proximity (such as Cinisello Balsamo, Cormano, Sesto San Giovanni) or located along significant thoroughfares linking to neighbouring provinces (like on the connection between Milan and Monza) exhibit relevant levels of risk. This accentuates the influence of immediate proximity to larger cities or key transportation routes on the heightened risk profiles observed in these areas.

Figure 6 analyses the yearly behaviour of the estimates. Specifically, the data for the year 2020 is dissected into quarters, providing a detailed view of the trends. It is clear that there is a significant decrease in car accidents during the first quarter, which coincided with the nationwide lockdown imposed by the Italian government on March 9, 2020. This stringent measure restricted population movement, permitting only essential outings for specific work, health, and urgent needs as a response to the escalating COVID-19 pandemic in the country. While the second quarter of 2020 showcased a marginal uptick in accidents, it remained notably lower. This period witnessed the relaxation of certain restrictions, yet factors like shifts in commuting patterns due to widespread remote work practices and a heightened emphasis on health and safety measures continued to influence the accident rates, contributing to this observed decrease despite the partial easing of lockdown measures.

Fig. 6
figure 6

Boxplot of fitted values according to year and quarter

Moving our attention to the intricacies and significance of street infrastructure, our analysis in Fig. 7 displays road behaviour based on their allowance for traffic movement. Meanwhile, Fig. 8 focuses on the prevalence of crossroads. These visualizations show a well defined trend: a discernible increase in accident occurrences aligning with the escalation of street complexity. Specifically, as depicted in the data, there exists a direct relationship between the complexity of roads and accident frequency. As the number of crossroads surges, so does the incidence of accidents. This phenomenon is rooted in the increased potential for collisions at these junctures, as intersections inherently pose complexity and potential hazards within traffic flow dynamics. The increased number of intersecting points elevates the complexity of navigation, thereby contributing significantly to the risk of accidents within such environments.

Fig. 7
figure 7

Boxplot of fitted values according to the complexity/importance of the road. In red the incidence for each class (i.e. number of links over the total). Class 1 includes roads that allow for high volume, maximum speed traffic movement between and through major metropolitan areas. Class 5 is applied to roads whose volume and traffic movement are below the level of any other class. In addition, walkways, truck only roads, bus only roads, and emergency vehicle only roads receive. (Color figure online)

Fig. 8
figure 8

Boxplot of fitted values according to number of crossroads. In red the incidence for each class (i.e. number of links over the total). (Color figure online)

We focus in Fig. 9 on the effect of several dichotomous variables that have been considered in the model. We observe that the presence of specific traffic control mechanisms significantly influences the risk profile associated with different street segments. Our analysis reveals distinct risk differentials linked to the presence or absence of roundabouts and traffic lights. Roundabouts emerge as a notable factor in mitigating accident risks. Streets equipped with roundabouts exhibit an average estimate risk around 1% lower than those streets lacking this traffic control feature. The design and functionality of roundabouts, promoting continuous traffic flow and reduced collision points, contribute to this lower risk quotient. Their ability to enforce reduced speeds and encourage cautious manoeuvring diminishes the probability of severe accidents, thereby positively impacting overall safety on such road segments. Conversely, the presence of traffic lights significantly elevates the risk profile. Our estimates shows a noteworthy average risk higher than 7% in streets where traffic lights are present. Traffic lights, while crucial for regulating traffic flow and pedestrian safety, can inadvertently heighten risks due to factors such as sudden changes in signal phases, potential red-light violations, and the potential for high-speed collisions at intersections. This elevation in risk underscores the complexities and challenges associated with managing safety at signal-controlled intersections, necessitating a closer examination of strategies to mitigate these heightened risks within such environments.

Fig. 9
figure 9

Incidence and risk for dichotomous features used in the model

Particularly noteworthy is also the impact of the “ No overtaking” signal compared to the “ Stop” signal on accident risk. Streets marked with the “ No overtaking” signal show a risk reduction trend in comparison to those featuring the “ Stop” signal. The imposition of restrictions on passing opportunities fosters safer driving conditions by limiting overtaking actions, thereby diminishing the probability of accidents along these segments. Conversely, streets regulated by “ Stop” signals portray a comparatively higher risk profile. The imposition of these signals, while essential for controlling intersections and managing right-of-way, appears to contribute to greater risk levels. Factors such as sudden stops, intersection complexities, and potential non-compliance with stop indications may collectively elevate the risk quotient for streets governed by these signals.

The presence of pedestrian crossings seems also correlated with increased risk on roadways. This is probably due to several contributing factors, as increased interaction points, reduced visibility and concentration of vulnerable users. Indeed, pedestrian crossings introduce interaction points between vehicles and pedestrians. These locations become critical areas where differing speeds and modes of transportation intersect, heightening the risk of accidents. Drivers must constantly be vigilant for pedestrians crossing, potentially leading to sudden stops. In some cases, pedestrian crossings, especially those without proper visibility aids or in poorly lit areas, can decrease the visibility of pedestrians for drivers, increasing the likelihood of accidents, particularly during low-light conditions or adverse weather. Additionally, they often concentrate vulnerable road users in specific areas, such as children, elderly individuals, or individuals with disabilities. This concentration increases the potential severity of accidents if a collision occurs.

Also bridges, integral parts of roadways, often pose heightened risk factors contributing to potential accidents. Challenges emerge from restricted visibility around bends or inclines, hindering anticipatory actions. These narrower lanes necessitate cautious navigation, potentially leading to abrupt lane changes or limiting flexibility in maneuvering. Transitioning onto bridges can induce speed variations, impacting traffic flow and increasing collision risks. Additionally, bridges are weather-sensitive, prone to hazardous conditions like icy surfaces or high winds, further elevating accident probabilities. Complex traffic dynamics often accompany bridges, with merges, exits, or intersections nearby, fostering congestion and abrupt traffic changes. Collectively, these aspects accentuate the complexity of bridge-related driving, warranting heightened driver vigilance and careful navigation to navigate these inherent risks effectively.

We also notice in Fig. 10 how higher densities in buildings and population tend to amplify the risk of accidents due to several interrelated factors. In particular, higher densities in buildings and population contribute to a more complex and congested road environment, characterised by increased traffic volume, pedestrian activity, limited space, and intricate intersection dynamics. These factors collectively elevate the risk of accidents, necessitating heightened awareness, patience, and caution from drivers navigating such areas.

Fig. 10
figure 10

Boxplot of fitted values according to building and population densities. Building and population classes are based on deciles. In red the incidence for each class (i.e. number of links over the total). (Color figure online)

Lastly, we conclude by examining the relation between risk and limit speed categories. We observe that urban roads (i.e. class (30,50]) often exhibit higher accident risks due to increased traffic density, diverse road users (pedestrians, cyclists, motorists), frequent intersections, and varying driving speeds. The close proximity of vehicles and the complexity of navigating through city streets increase the chances of collisions, particularly at intersections or during congested periods. On the other hand, we observe that roads with very low speed limits, often at or below 30 km/h, reduce accident risks by minimizing speed differences between vehicles, providing more reaction time for drivers to respond to unexpected events, enhancing visibility and control, ensuring safety for pedestrians and cyclists, calming traffic flow, and aligning with community safety objectives in residential areas. These factors collectively create a safer driving environment and lower the probability of accidents on these roads (Fig. 11).

Highways and belt ways typically have higher speeds and fewer intersections, leading to a different set of risks. Accidents on highways often result from high-speed collisions, lane changes without proper signalling, driver fatigue, and sudden braking due to congestion or road hazards. However, highways generally have fewer points of conflict compared to urban roads, which can reduce certain types of accidents, especially those related to intersections.

Fig. 11
figure 11

Boxplot of fitted values according to speed limit categories. Speed categories are based on the maximum speed allowed in the street. In red the incidence for each class (i.e. number of links over the total). (Color figure online)

5.2 GWPR results

In the previous section, we highlighted the CAR-SLX results, which provide insights into relevant features for evaluating accident risk at the area level. In this section, we concentrate on GWPR results, which allow us to separate the effects at the link level, taking into account the heterogeneity across different streets. The majority of the estimates’ distributions of formula (13) at the road level are symmetric with respect to zero. To reduce the number of plots, Fig. 12 displays the distributions of features that have a non-zero modal value or significant skewness. At the street level, it is confirmed that the number of intersections, pedestrian crossings, and traffic lights have a positive mode. Conversely, roundabouts, stops, roads with no overtaking signals, sharp curves, and roads with lower traffic volume and movement compared to other classes are mostly in the negative domain. To explore the dependence between them, the corresponding Pearson correlation matrix is plotted in Fig. 13. Hierarchical clustering was used to group variables that naturally show clustering, and combinations with a significance value greater than 10% are hidden by a black square. Research has shown that traffic lights and pedestrian crossings can increase the risk of accidents when used together. However, in areas where speed limits are lower, such as city centres, the complexity of signals can help to reduce the risk. On the other hand, the number of crossings and roads with speeds exceeding 90 km/h can increase the risk of accidents.

Fig. 12
figure 12

Distribution of estimates at the road level for various features. Included are distributions solely for features demonstrating a non-zero modal value or displaying considerable skewness, highlighting their discernible patterns within the dataset

Fig. 13
figure 13

Pearson correlation matrix between the vector of estimates obtained at road level for features. We have masked combinations with p-values \(\ge \) 10% using a black square to focus on the most relevant correlations

The availability of estimates at a street level enables the creation of a choropleth map, providing a visual representation of the locations or areas where a feature significantly contributes to the risk. An example of this can be seen in Fig. 14. To facilitate comparison, each vector of estimates has been classified using the same classes as those reported in the legend of the figure. In the right-hand column, we provide a comparison of the spatial distribution of pedestrian crossings and traffic lights at the area level from top to bottom. From the colour intensity, it is clear that areas with traffic lights are darker than the corresponding areas with pedestrian crossings. By following the same procedure as Fig. 3, it is possible to determine where a specific feature has the most impact on an edge. As shown in Fig. 14 on the right, we can delve deeper. To account for local spatial dependence we have plot estimates at the street level with a focus on the centre of Milan city. The figures reveal that certain areas on the outskirts of Milan are vulnerable to both traffic lights and pedestrian crossings.

Fig. 14
figure 14

Choropleth maps depicting the varying contributions of two distinct features to the risk assessment across different areas. The top panel shows the influence of pedestrian crossings, with the left side illustrating data for the province of Milan and the right side for the city of Milan. Meanwhile, the bottom panel exhibits the impact of traffic lights, again divided between the province of Milan on the left and the city of Milan on the right

Another interesting result concerns the analysis of the variables selected to compose the matrix \(\textbf{Q}\) of formula (22). Table 5 shows the number of times each variable has been used as a local feature. The list is sorted according to the overall total number of occurrences. The majority of links does not depend on specific local variables (referred to as “no local variables”). However, it is worth noting that the variables at the top of the list represent specific characteristics of the road, while those at the bottom of the list mostly depend on the local road network architecture. In those cases it is suggested that the risk of accidents cannot be mitigated by acting on just one component. Instead, the overall local infrastructure must be carefully considered.

Table 5 Number of times each variable has been used as of a local feature

6 Conclusion

This paper aims to provide a comprehensive understanding of the accident risk associated with road infrastructure, beyond statistical analysis. The objective is to furnish policy makers with necessary information to make informed decisions to reduce the societal impact of crashes. The analysis of road accident data using spatial models emphasises the need to address road safety concerns.

Drawing upon a wide literature, Conditional Autoregressive modelling and geographically weighted Poisson regression have been merged to create an innovative approach. This fusion was designed not only to capture overarching trends within road networks but also to unveil local nuances, allowing for a detailed understanding of accident risk factors.

The application of these models to real-world data from the city of Milan (Italy) and its province between 2016 and 2020 has yielded insightful results. Valuable insights for informed decision-making in road safety measures have been provided through the identification of key street characteristics that influence accident risk and the spatial distribution of covariates at a road level. This approach not only offers a practical solution but also sets a precedent for leveraging open data for crucial societal issues. The case study focuses on a specific area, but this does not limit the proposal’s applicability. Scalability is feasible for any region with a relevant accident location database, as computational burdens are no longer a challenge. The CAR-SLX model was successfully fitted for the city of Milan, covering over 2000 km, in approximately 31 min using a standard laptop.

Therefore, the proposed approach, combining spatial modelling techniques, offers a promising way to understand the root causes of road crashes and how to mitigate their occurrences. Its effectiveness in identifying critical risk factors at a detailed level underlines its potential to guide targeted interventions and policy decisions aimed at improving road safety. The comprehensive nature of this approach, encompassing both network-wide trends and local intricacies, underlines its importance in advancing the discussion on road safety analysis and policy formulation. However, although we acknowledge the significance of the dataset used in this paper, an extensive, open-source, geocoded repository of car crashes in Italy, it is crucial to recognise also its inherent limitations. The reliance on police-reported accidents implies a potential under-representation of the complete spectrum of accidents occurring in the region, as it excludes those unreported to authorities. This exclusion is believed to predominantly encompass accidents of lesser severity, not resulting in injuries or fatalities. Socially and statistically, these unreported incidents are expected to have a comparatively lower impact on individuals’ lives and the healthcare system.

Indeed, it is reasonable to assume that events resulting in serious adverse outcomes, such as injuries or fatalities, are less likely to go unreported and therefore have a higher representation in the data set. Therefore, while recognising the inherent limitations, in particular the extent of unreported events, the focus remains primarily on accidents with significant societal and individual consequences, with the aim of providing meaningful evidence to improve road safety policies.