1 Introduction

Traffic signals are often chosen for operational reasons and may involve trade-offs between safety and mobility. According to the National Highway Traffic Safety Administration (NHTSA) Fatality Analysis and Reporting System (FARS), signalized intersections represent about one-third of all intersection fatalities, including a large proportion that involves red-light running. As evidenced by NHTSA, in the year 2020 alone, there were over 38,000 fatalities at signalized intersections in the United States. Notably, nearly 30 percent of these accidents were associated with human-driven vehicles navigating intersections. In order to mitigate the number of fatal accidents attributed to human errors, there has been a growing interest in exploring the implementation of autonomous vehicles within traffic systems. While autonomous vehicles offer numerous benefits, challenges arise when it comes to coordinating their actions, especially at signalized intersections. To address these challenges, there is a need to develop effective action control policies for reliable decision-making of autonomous vehicles at signalized intersections.

Rule-based decision-making methods are frequently emplo-yed in autonomous driving, valued for their reliability and interpretability [1, 17, 23, 25, 45, 46]. In straightforward driving situations, these approaches can deliver strong safety performance through manually crafted rules. However, the effectiveness of rule-based methods diminishes when faced with intricate urban settings, especially at road intersections, where a multitude of uncertainties ranging from vehicle dynamics to sensor inaccuracies are at play. In such complex scenarios, the limitations of rule-based strategies become evident, unable to consistently ensure both safety and efficiency. They often lack the adaptability required for dynamic traffic scenarios, struggling with complexity and unforeseen nuances that deviate from predefined rules. Scalability becomes an issue as intersections grow in complexity, leading to incomplete rule coverage. Rule conflicts, human-designed biases, and dependence on flawless data further impede their effectiveness. Consequently, alternative approaches that exhibit greater adaptability and intelligence, such as optimization techniques and learning-based methodologies, have garnered significant attention in recent years.

A popular approach for intersection planning is the optimization-based approach which models the planning problem into a real-time optimization problem and sets up the cost function and boundaries as well as constraints to facilitate heuristic planning and decision-making. Optimization-based strategies such as model predictive control (MPC) [18, 20, 26, 27, 31, 35, 37, 40] and the Bezier curve optimization method [2, 29, 32,33,34] have been widely used for solving planning and decision-making problems for autonomous vehicles at intersections. Despite their widespread use, they come with certain limitations that can impact their effectiveness in handling complex scenarios. These methods often suffer from computational complexity, especially when dealing with high-dimensional state spa-ces inherent to urban environments, leading to challenges in real-time decision-making. Handling uncertainties in vehicle dynamics and sensor noise is intricate, potentially compromising solution accuracy. Moreover, optimization based strategies struggle to swiftly adapt to changing road conditions and unforeseen events while also exhibiting limitations in scaling to large traffic networks. These strategies’ reactive nature and reliance on model assumptions hinder dynamic decision-making, and achieving real-time solutions within tight time constraints is a challenge.

Several machine learning-based approaches have been developed to address intersection challenges, such as imitation learning [5, 8,9,10, 13, 14], online planning [6, 15, 16, 42], and offline learning [7, 19, 21, 22, 39, 44, 47]. In imitation learning, policies are learned from human drivers, yet their efficacy falls short in scenarios where the agent encounters states beyond its training data. Online planners, alternatively, determine optimal actions by simulating future states from the current time step. Online planners based on partially observable Monte Carlo planning (POMCP) [4, 6, 24, 38] have been shown to handle intersections, but they rely on the existence of an accurate generative model. Offline learning, frequently employing Markov Decision Processes (MDP), addresses intersection complexities from a strategic standpoint [28, 36]. Reinforcement learning (RL) is a common offline learning approach for decision-making and planning for autonomous vehicles at intersections. In the context of RL algorithms, observations and a profound grasp of the environment are crucial for effective decision-making. Autonomous vehicles rely on sensors, such as cameras, lidar, radar, and other perception systems, to gather data about the surrounding environment. However, due to factors like uncertainties in measurement and system dynamics, accurately estimating the state of the environment becomes essential [3, 12, 30, 43].

Sensor data available for autonomous vehicles often presents challenges due to its partial, noisy, or missing nature in accurately representing the vehicle’s state and the surrounding environment. However, these data sources hold the potential to significantly enhance prediction accuracy when harmonized with the underlying physics. In this paper, we propose a novel approach that integrates system dynamics with imperfect sensory information to control autonomous vehicles within signalized intersections. The proposed methodology introduces a physics-based representation of the system dynamics, which inherently encompasses stochastic elements arising from the disparities between real-world scenarios and model-based representations. These discrepancies, stemming from intricacies not captured by the model, are represented through a Markov decision process (MDP) model that accounts for the system’s inherent physics while accommodating uncertainties. Subsequently, we take advantage of the partial information provided by the sensor data to reduce the uncertainty associated with the parts of the system model that are reflected in the data or help better estimate the other physical variables that are not measurable through sensor data. The proposed framework develops particle-based reinforcement learning action policies that incorporate both data and physics to control an autonomous vehicle approaching a signalized intersection. The proposed “Point-Based" and “Distribution-Based" action policies account for the uncertainties in the system thro-ugh the decision-making according to a single estimate or the posterior distribution of the system state.

The rest of the paper is organized as follows. In Section 2, a detailed description of the proposed framework, including the problem statement, MDP modeling of the problem, and the proposed particle-based action control policies, is provided. This is followed by Section 3, which presents the numerical experiments demonstrating the performance of the proposed framework in different scenarios of missing state variables and varying levels of data sparsity. Finally, Section 4 contains the concluding remarks.

Fig. 1
figure 1

A Schematic representation of the problem where an autonomous vehicle with distance d and velocity v approaches a green traffic light with the elapsed time of \(15\,s\)

2 Proposed Framework

2.1 Problem Statement

In this work, we consider an autonomous vehicle approaching a signalized intersection, and our goal is to control the sequence of actions taken by the vehicle in order to traverse the intersection safely despite the various sources of uncertainty. Consider an autonomous vehicle positioned at a distance of d from the traffic light and approaching the intersection with the velocity of v and acceleration of a. The dynamical model of the vehicle in a time interval of \(\varDelta t\) can be formulated as follows [11]:

$$\begin{aligned} \begin{aligned} d'&= d-\frac{1}{2} a \, \varDelta t^2-v \, \varDelta t + n_d,\\ v'&= v + a \, \varDelta t + n_v, \end{aligned} \end{aligned}$$
(1)

where \(n_d \sim p_d\) and \(n_v \sim p_v\) characterize the discrepancies of the distance and the velocity of the vehicle after executing acceleration a from the real values of distance and velocity. A number of factors contribute to the discrepancy between the model and the true dynamics of the system, including the approximations made in the model, the differences between various vehicles, road-tire friction interactions, the number of passengers in the vehicle, variations in the road curvature, etc.

Let \(\phi \in \{{\textit{G, Y, R}}\}\) represent the current phase of the traffic light, indicating the green, yellow, and red colors. The timings of the traffic phases are often stochastic depending on the time of the day, weather, and traffic condition. We represent the expected lengths of the green, yellow, and red traffic signals by \(T_{\textit{G}}\), \(T_{\textit{Y}}\), and \(T_{\textit{R}}\), each with an additive noise \(n_{\phi } \sim p_{\phi }\) that models the uncertainty in the phase durations. Let \(t_{\phi }\) denote the time that has elapsed from the last transition of the traffic light. The possible values for \(t_\phi \) are determined as follows:

$$\begin{aligned} \begin{aligned}&\phi = {\textit{G}} \, : \, t_{\phi } \in [0 \quad T_{\textit{G}}+n_{\phi }]\\&\phi = {\textit{Y}} \, : \, t_{\phi } \in [0 \quad T_{\textit{Y}}+n_{\phi }]\\&\phi = {\textit{R}} \, : \, t_{\phi } \in [0 \quad T_{\textit{R}}+n_{\phi }] \end{aligned} \end{aligned}$$
(2)

where, within a time interval of \(\varDelta t\), if \(t_{\phi } + \varDelta t\) exceeds the duration of the current phase, a transition occurs based on the cyclic sequence of \({\text{ G }}\rightarrow {\text{ Y }}\rightarrow {\text{ R }}\rightarrow {\text{ G }}\); otherwise, there will be no change in the phase of the traffic light after the time interval \(\varDelta t\). Figure 1 is a representation of the system at one instance that an autonomous vehicle at a distance d moves toward an intersection with velocity v where the traffic light is green and the elapsed time is \(15\,s\).

2.2 Markov Decision Process (MDP) Modeling

Here, we model the problem as a Markov Decision Process (MDP). An MDP is a mathematical tool used to aid decision-making in complex systems. An MDP is defined by the tuple \((\mathcal {X}, \mathcal {A}, \mathcal {T}, R, \gamma )\), with the possible states \(\textbf{x}\in \mathcal {X}\) and possible actions \(\textbf{a}\in \mathcal {A}\) to be executed by an agent. The transition probability \(\mathcal {T}(\textbf{x}', \textbf{x}, \textbf{a}) = p(\textbf{x}' \mid \textbf{x}, \textbf{a})\) describes the probability of ending in state \(\textbf{x}'\) when executing action \(\textbf{a}\) in state \(\textbf{x}\), representing the dynamics. \(R(\textbf{x},\textbf{a})\) is the reward for taking action \(\textbf{a}\) in state \(\textbf{x}\). Finally, \(\gamma \in (0,1)\) is the discounting factor which prioritizes immediate rewards over rewards in the future. The goal of the agent is to find an optimal control policy \(\pi ^*: \mathcal {X}\rightarrow \mathcal {A}\) that maximizes the expected discounted accumulated reward, given by:

$$\begin{aligned} \pi ^*\,=\,\text {argmax}_{\pi } {\mathbb {E}}\left[ \sum _{k=0}^{\infty } \gamma ^{k} \,R\left( \textbf{x}_{k},\pi (\textbf{x}_k)\right) \right] \,. \end{aligned}$$
(3)

In the following parts, we formulate and construct each element of the MDP tuple for the control of an autonomous vehicle approaching a signalized intersection.

State Space

Let \(\textbf{x}_{k} \in \mathcal {X}\) denote the state vector at time step k (\(k = 1, 2, \dots \)), where each time step evolves in two sequential steps with the time interval of \(\varDelta t\). We include the status of the autonomous vehicle and the traffic light in the state vector, as:

$$\begin{aligned} \begin{aligned} \textbf{x}_{k} = [d_k \,\,\, v_k \,\,\, \phi _k \,\,\, t_{{\phi }_k}]^T, \end{aligned} \end{aligned}$$
(4)

where \(d_k\) is the distance of the vehicle from the traffic light, \(v_k\) is the speed of the vehicle, \(\phi _k\) is the phase of the traffic light, and \(t_{{\phi }_k}\) is the time elapsed from the last transition of the traffic light, all at time step k.

Action Space

The action space consists of the acceleration \(a_k\) executed by the autonomous vehicle at time step k. The goal here is to sequentially determine the desired acceleration taken by the autonomous vehicle for traversing a signalized intersection safely.

Transition Model

Given the state \(\textbf{x}_k\) and the vehicle’s acceleration at time k, i.e., \(a_k\), the state at time step \(k+1\) can be obtained as:

$$\begin{aligned} \begin{aligned} \textbf{x}_{k+1}&\sim p(\textbf{x}_{k+1}\mid \textbf{x}_k,a_k). \end{aligned} \end{aligned}$$
(5)

This is the general representation of state transition, where the Markovian assumption can be seen as the state at time step \(k+1\) depends only on the state and acceleration at time k. The state at time step \(k+1\) according to Eqs. 1-2 can be obtained as:

$$\begin{aligned} \begin{aligned}&\begin{bmatrix} d_{k+1}\\ v_{k+1}\\ \phi _{k+1}\\ t_{{\phi }_{k+1}} \end{bmatrix} \\&\quad = \begin{bmatrix} d_{k}- \frac{1}{2}a_{k}\,\varDelta t^2 -v_{k}\, \varDelta t +n_{d,k}\\ v_{k}+a_{k} \, \varDelta t + n_{v,k}\\ \phi _{k} \, {\textbf{1}}_{[t_{{\phi }_k}+\varDelta t \le T_{\phi _{k}}+n_{\phi ,k}]} + \phi _{k}^+ \, {\textbf{1}}_{[t_{{\phi }_k}+\varDelta t> T_{\phi _{k}}+n_{\phi ,k}]} \\ (t_{{\phi }_k}+\varDelta t) \, {\textbf{1}}_{[t_{{\phi }_k}+\varDelta t \le T_{\phi _{k}}+n_{\phi ,k}]} + t \, {\textbf{1}}_{[t_{{\phi }_k}+\varDelta t > T_{\phi _{k}}+n_{\phi ,k}]} \end{bmatrix}, \end{aligned} \end{aligned}$$
(6)

where \({\textbf{1}}\) is an indicator function, \(n_{d,k} \sim p_d\), \(n_{v,k} \sim p_v\), \(n_{\phi ,k} \sim p_{\phi }\), \(\phi _{k}^+\) represents the next phase from \(\phi _{k}\) according to the phase transition \({\text{ G }}\rightarrow {\text{ Y }}\rightarrow {\text{ R }}\rightarrow {\text{ G }}\), and \(t\sim \mathcal {U}(0,\varDelta t)\) to include the elapsed time in the duration of \(\varDelta t\) where \(\mathcal {U}\) indicates a uniform distribution. Thus, the system dynamics can be represented as:

$$\begin{aligned} \textbf{x}_{k+1}\, =\,\textbf{f}(\textbf{x}_{k},a_{k})+\textbf{n}_{k}, \end{aligned}$$
(7)

where \(\textbf{f}(.,.)\) is the function reflecting the system dynamics and \(\textbf{n}_{k}\) contains the noises associated with all state variables.

Reward Model

In this problem, our aim is to prevent the autonomous vehicle from running a red light (RRL) and encourage it to pass the intersection as soon as possible while the phase of the traffic light is green or yellow. We consider a high positive reward, \(R_{\text {term}}\), when the vehicle passes the intersection, which is the terminal state, once the traffic light is green or yellow. Additionally, the vehicle is punished with a high negative reward, \(R_{\text {RRL}}\), if it runs the red light. A negative reward, \(R_{\text {vel}}\), is also considered for the case when the vehicle’s velocity is higher than the allowed maximum speed or when the velocity is lower than the allowed minimum speed beyond a particular distance, \(d_\text {critical}\), from the intersection. Moreover, to have smooth changes in accelerations for the autonomous vehicle, we incorporate punishment \(R_{\text {smooth}}\) for substantial differences in consecutive actions taken by the agent. Furthermore, the vehicle is expected to take the least possible number of actions to cross the intersection. Thus, we consider a negative reward, \(R_{\text {act}}\), for each action it takes. Therefore, putting all these together, the reward function is given by:

$$\begin{aligned} \begin{aligned} R(\textbf{x}_{k},a_k) \,=&\, R_{\text {term}} \, {{{\textbf{1}}}}_{d_{k+1}<=0 \,\wedge \, \phi _{k+1}={\text{ G }}\, \vee \, {\text{ Y }}} \,\\&+\,R_{\text {RRL}}\,{{{\textbf{1}}}}_{d_{k+1}<=0 \,\wedge \, \phi _{k+1}={\text{ R }}}\\&+\,R_{\text {vel}}\,{{{\textbf{1}}}}_{\{v_{k+1}>v_{\text {max}}\}\, \vee \, \{v_{k+1}<v_{\text {min}} \, \wedge \, d_{k+1}>d_{\text {critical}}\}} \,\\&+\,R_{\text {smooth}}\,{{{\textbf{1}}}}_{|a_k-a_{k-1}|>{\text {thr}_{\text {smooth}}}}\,+\,R_{\text {act}}, \end{aligned} \end{aligned}$$
(8)

where \(\wedge \) and \(\vee \) represent logical AND and OR, \(v_{\text {max}}\) is the maximum allowed speed, and \({\text {thr}}_{\text {smooth}}\) is the smoothness threshold for the difference of consecutive actions.

Fig. 2
figure 2

A depiction of the action policy for the control of an autonomous vehicle in a traffic light environment

2.3 Action Control Policy

After the MDP modeling of the problem, we use Q-learning framework to approximate the stationary control policy in Eq. 3s. Q-learning is a model-free Reinforcement Learning technique that works by learning an action-value function Q. The Q value is the expected discounted reward for executing action a at state \(\textbf{x}\) and following policy \(\pi \) thereafter [41]. The objective in Q-learning is to estimate the Q values for an optimal policy. For the known state process in Eq. 6, the algorithm progressively computes the reward function of a given control strategy by generating several sample trajectories of the system dynamics and their associated rewards. The Q values in Q-learning are learned based on the simulated data generated from the state model. Assume that in the nth episode of the Q-learning algorithm, the system is in state \(\textbf{x}^j\), and by performing a control input, i.e., acceleration a, the system moves to state \(\textbf{x}^i\) in episode \(n+1\). All Q values at time \(n+1\) will be the same as time n except the Q value corresponding to state \(\textbf{x}^j\) and acceleration a which is updated as:

$$\begin{aligned} \begin{aligned} Q_{n+1}(\textbf{x}^j,a)\,&=\,\left( 1-\alpha \right) \,Q_{n}(\textbf{x}^j,a)\,\\&+\,\alpha \left( R(\textbf{x}^j,a)+\gamma \,\max _{a'\in \mathcal {A}}Q_{n}(\textbf{x}^i,a')\right) \,, \end{aligned} \end{aligned}$$
(9)

where \(0<\alpha <1\) is the learning rate, and \(0< \gamma < 1\) is the discounting factor. In [41], it has been shown that Q-learning converges to the optimum action-values with probability 1 as long as all actions are repeatedly sampled in all states and the action-values are represented discretely, i.e., \(Q_n(\textbf{x}, a) \rightarrow Q^*(\textbf{x}, a)\), for all \(\textbf{x}\in \mathcal {X}\) and \(a \in \mathcal {A}\) as \(n \rightarrow \infty \). After obtaining the stationary action-values \(Q^*(\textbf{x}, a)\), the autonomous vehicle is capable of decision-making for executing the acceleration at any state \(\textbf{x}_k\) as:

$$\begin{aligned} a_k\,=\,\text {argmax}_{a \in \mathcal {A}} Q^*(\textbf{x}_k, a)\,. \end{aligned}$$
(10)

This action policy, referred to as the “Baseline Action Policy" in the numerical experiments, is shown in Fig. 2.

2.4 Particle-Based Action Selection under Temporally Sparse Partial Observations

The Q-learning algorithm described in the previous subsection provides an action control policy for the autonomous vehicle, given full knowledge about the system state variables at all time steps. However, in reality, an autonomous vehicle can sense only part of the state variables, depending on the technology and the sensors, such as cameras, radars, and lidars used in the vehicle. Furthermore, the observations are often temporally sparse, meaning that the observed state variables may be separated by a few time steps. There are a variety of reasons for the sparsity of data, including sensor limitations and technological limitations in maintaining dense data, failures in communication, high costs of acquiring and processing data, and various obstacles at different times. In order to achieve reliable decision-making in the presence of partial and/or temporally sparse data, we propose a particle-based framework that incorporates the physics of the system as well as the data received from the system to select a control action for the autonomous vehicle in real-time. The proposed framework accounts for the uncertainty in the system dynamics due to the unmodeled or stochastic parts of the system and takes advantage of the data once received to modify the knowledge about the directly observed and even unobserved parts of the system state.

Fig. 3
figure 3

Point-Based Action Policy

We start the process by drawing N particles from the initial state distribution as:

$$\begin{aligned} \begin{aligned} \{{\textbf{x}}_0^{i}\}_{i=1}^{{N}} \sim p(\textbf{x}_0)&= p(d_0 , v_0 , \phi _0 , t_{{\phi }_0}), \end{aligned} \end{aligned}$$
(11)

where \(p(\textbf{x}_0)\) represents the belief about the distribution of initial state values. We approximate this joint distribution with three independent distributions:

$$\begin{aligned} \begin{aligned} p(\textbf{x}_0) = p(d_0)\, p(v_0)\, p(\phi _0 , t_{{\phi }_0}). \end{aligned} \end{aligned}$$
(12)

The distance \(d_0\) is the distance of the vehicle from the intersection, corresponding to the first instance that an observation is received by the vehicle, which depends on the technology used to obtain data.

We denote the minimum and maximum distances from which the vehicle can receive an observation by \(d_{\text {min}}\) and \(d_{\text {max}}\), and assume \(d_0\) is uniformly distributed as:

$$\begin{aligned} \begin{aligned} p(d_0) = \mathcal {U}(d_{\text {min}},d_{\text {max}}). \end{aligned} \end{aligned}$$
(13)

Suppose the initial velocity of the vehicle, \(v_0\), is from a uniform distribution with minimum and maximum speeds \(v_{\text {min}}\) and \(v_{\text {max}}\) as:

$$\begin{aligned} \begin{aligned} p(v_0) = \mathcal {U}(v_{\text {min}},v_{\text {max}}). \end{aligned} \end{aligned}$$
(14)

The initial phase of the traffic light once an observation is received and, consequently, the initial elapsed time can be drawn with probabilities:

$$\begin{aligned} p({\phi _0},t_{\phi _0})=\left\{ \begin{array}{lll} \frac{T_{{\text{ G }}}}{T_{{\text{ G }}}+T_{{\text{ Y }}}+T_{{\text{ R }}}} &{} \phi _0 = {\text{ G }}\,\, , \,\, t_{\phi _0} \sim \mathcal {U}(0,T_{{\text{ G }}}) \\ \frac{T_{{\text{ Y }}}}{T_{{\text{ G }}}+T_{{\text{ Y }}}+T_{{\text{ R }}}} &{} \phi _0 = {\text{ Y }}\,\, , \,\, t_{\phi _0} \sim \mathcal {U}(0,T_{{\text{ Y }}}) \\ \frac{T_{{\text{ R }}}}{T_{{\text{ G }}}+T_{{\text{ Y }}}+T_{{\text{ R }}}} &{} \phi _0 = {\text{ R }}\,\, , \,\, t_{\phi _0} \sim \mathcal {U}(0,T_{{\text{ R }}}) \end{array}\right. \end{aligned}$$
(15)

Having particles \(\{{\textbf{x}}_0^{i}\}_{i=1}^{{N}}\), we assign equal initial weight, i.e., \(\{{\mathbf {\omega }}_{0}^i\}_{i=1}^{{N}} = \frac{1}{N}\), to each particle to obtain the set \(\{\textbf{x}_{0}^i,{\mathbf {\omega }}_{0}^i\}_{i=1}^{{N}}\). In the following parts, the framework is described with the time notation of k to provide a general representation of the proposed approach.

Let \(\textbf{y}_{1:k}=(\textbf{y}_{1},...,\textbf{y}_{k})\) and \(\textbf{a}_{0:k-1}=(a_{0},..., a_{k-1})\) denote the set of partial and/or temporally sparse observations and the taken accelerations up to time k. The size of the observation vector at each time can be equal to (full observation) or smaller than (partial observation) the size of the system state vector, i.e., \(|\textbf{y}_i|\le |\textbf{x}_i|\) for all \(i = 1, \dots , k\), where the case \(|\textbf{y}_j| = 0\) corresponds to \(\textbf{y}_j = \emptyset \) when no observation is received at time step j as the observations received by the autonomous vehicle can be temporally sparse. Given these actions and observations, \(p(\textbf{x}_{k}\mid \textbf{y}_{1:k},\textbf{a}_{0:k-1})\) represents the probability distribution of the state of vehicle and traffic light at time k. Suppose we have N particles and the associated weights at time step k, i.e., \(\{\textbf{x}_{k}^i,{\mathbf {\omega }}_{k}^i\}_{i=1}^{{N}}\), where the weights are normalized and thus sum up to one. This set of particles and weights represents the posterior distribution of the state at time k given \(\textbf{y}_{1:k}\) and \(\textbf{a}_{0:k-1}\), as:

$$\begin{aligned} \begin{aligned} p(\textbf{x}_{k} \mid \textbf{y}_{1:k},\textbf{a}_{0:k-1}) \approx \,\sum _{i=1}^{{N}}{\mathbf {\omega }}_{k}^i\,\delta ({\textbf{x}_{k}-\textbf{x}_{k}^i})\,, \end{aligned} \end{aligned}$$
(16)

where \(\delta (.)\) denotes the Dirac delta function. Given these particles and their corresponding weights, we propose two approaches for the selection of the next acceleration to be taken by the autonomous vehicle, as described below.

Point-Based Action Policy

In this approach, we first estimate the current state of the system according to the weighted collection of current particles. Let the estimated state be \({\hat{\textbf{x}}}_{k} = [{\hat{d}}_{k} \,\,\, {\hat{v}}_{k} \,\,\, {\hat{\phi }}_{k} \,\,\, {\hat{t}_{{\phi }_{k}}}]^T\), where the estimated state variables are:

$$\begin{aligned} \begin{aligned} {\hat{d}}_{k} = \,\sum _{i=1}^{{N}}{\mathbf {\omega }}_{k}^i\,d_{k}^i\,, \end{aligned} \end{aligned}$$
(17)
$$\begin{aligned} \begin{aligned} {\hat{v}}_{k} = \,\sum _{i=1}^{{N}}{\mathbf {\omega }}_{k}^i\,v_{k}^i\,, \end{aligned} \end{aligned}$$
(18)
$$\begin{aligned} \begin{aligned} {\hat{\phi }}_{k} = \text {argmax}_{\phi \in \{{\text{ G }}, {\text{ Y }}, {\text{ R }}\}} \sum _{i=1}^{{N}}{\mathbf {\omega }}_{k}^i\,{\textbf{1}}_{\phi _{k}^i = \phi }\,, \end{aligned} \end{aligned}$$
(19)
$$\begin{aligned} \begin{aligned} {\hat{t}_{{\phi }_{k}}} = \,\sum _{i=1}^{N}{\mathbf {\omega }}_{k}^i\,{t}_{{\phi }_{k}}{\textbf{1}}_{\phi _{k}^i = {\hat{\phi }}_{k}}\,, \end{aligned} \end{aligned}$$
(20)

where \({\textbf{1}}\) is the indicator function. Based on the estimated state, the autonomous vehicle determines its next acceleration as follows:

$$\begin{aligned} a_{k}\,=\,\text {argmax}_{a \in \mathcal {A}} Q^*({\hat{\textbf{x}}}_{k}, a)\,. \end{aligned}$$
(21)

The point-based action policy is demonstrated in Fig. 3. The decision made for action selection in this approach is based on the single estimate of the state that might not be a precise indicator of \(p(\textbf{x}_{k}\mid \textbf{y}_{1:k},\textbf{a}_{0:k-1})\). In order to account for the whole distribution, in the approach presented next, we consider the posterior distribution of \(\textbf{x}_{k}\) instead of a single estimate \({\hat{\textbf{x}}}_{k}\) in the action selection policy.

Fig. 4
figure 4

Distribution-Based Action Policy

Distribution-Based Action Policy

In this approach, we propose to use the particles drawn from the system state distribution for the decision-making regarding the action selection, rather than relying on a single estimated state which may not be a good representative of the state distribution. In fact, in situations with peaked state distributions, the particles are all close to the mean (estimated state) and the decisions according to the mean and the distribution should be similar. On the other hand, for cases with less peaked state distributions, the estimated state is a single possible realization of the system state, whereas the real system state might differ from that and, in general, could be any particle with the positive weight representing the state distribution. To account for the entire state distribution, we propose a Bayesian approach that selects the action with the highest expected Q-value as:

$$\begin{aligned} a_{k}\,=\,\text {argmax}_{a \in \mathcal {A}} {\mathbb {E}}[Q^*(\textbf{x}_{k}, a)]\, \approx \,\text {argmax}_{a \in \mathcal {A}} \sum _{i=1}^{{N}}{\mathbf {\omega }}_{k}^i\,Q^*(\textbf{x}_{k}^i, a)\,, \end{aligned}$$
(22)

where the expectation in the first line is with respect to the posterior distribution of the state, which can be approximated given the current particles and the associated weights in the second line. Figure 4 represents the distribution-based action policy.

After making the decision regarding the next acceleration \(a_{k}\) according to the point-based action policy in Eq. 21 or the distribution-based action policy in Eq. 22, we use the system dynamics and the data received, if any, for the control of the autonomous vehicle at time step \(k+1\). Two possible scenarios can occur at time \(k+1\); the first scenario is that no data is received at time \(k+1\), and in the second scenario, a partial observation is received by the vehicle. For each scenario, we describe our proposed approach for obtaining \(\{\textbf{x}_{k+1}^i,{\mathbf {\omega }}_{k+1}^i\}_{i=1}^{{N}}\) and updating the posterior distribution of the system state at time \(k+1\), i.e., \(p(\textbf{x}_{k+1}\mid \textbf{y}_{1:k+1},\textbf{a}_{0:k})\), in the following parts.

No Data

In case no data is received by the vehicle at time step \(k+1\) due to the sparsity of observational data, the information obtained from observations remains the same, as \(\textbf{y}_{k+1} = \emptyset \). Therefore, the posterior distribution of the system state at time \(k+1\) can be expressed as:

$$\begin{aligned} \begin{aligned} p(\textbf{x}_{k+1}\mid \textbf{y}_{1:k+1},\textbf{a}_{0:k})\,&=\,p(\textbf{x}_{k+1}\mid \textbf{y}_{1:k},\textbf{a}_{0:k-1},\textbf{y}_{k+1},a_{k})\,\\&=\,p(\textbf{x}_{k+1}\mid \textbf{x}_{k},a_{k})\,. \end{aligned} \end{aligned}$$
(23)

We obtain this distribution by passing the particles from time step k, i.e., \(\{\textbf{x}_{k}^i\}_{i=1}^{{N}}\), and acceleration \(a_{k}\) thr-ough the state process presented in Eq. 6 to get the predictive particles at time \(k+1\), i.e., \(\{\textbf{x}_{k+1}^i\}_{i=1}^{{N}}\). Since there is no new data/information about the state of the system other than the system’s physics that we took into account, the weights at this time remain unchanged, meaning that \(\{{\mathbf {\omega }}_{k+1}^i\}_{i=1}^{{N}} = \{{\mathbf {\omega }}_{k}^i\}_{i=1}^{{N}}\). Therefore, we obtain the particles and their associated weights at time \(k+1\), i.e., \(\{\textbf{x}_{k+1}^i,{\mathbf {\omega }}_{k+1}^i\}_{i=1}^{{N}}\) that represent the posterior distribution of state at time \(k+1\).

Partial Datas

In case an observation is received by the vehicle at time \(k+1\), the collection of obtained data gets updated to \(\textbf{y}_{1:k+1}=(\textbf{y}_{1},...,\textbf{y}_{k+1})\). Given the next observation \(\textbf{y}_{k+1}\) and the vehicle’s acceleration \(a_{k}\) obtained from Eqs. 21 or 22, we incorporate the physics of the system with the observation received by the vehicle to estimate the state of the system at time \(k+1\). Having the particles at time k and passing these particles through the system dynamics to obtain the state particles at time \(k+1\), as described in the previous part, provides an uncertain estimate of the system state due to unmodeled/stochastic parts of the system dynamics. On the other hand, the new observation provides information about only a part of the system state. Therefore, the incorporation of system dynamics and the data not only enhances the estimation accuracy that can be achieved only from the physics but also enables estimating the other parts of the system state that are not included in the data. To achieve this, we use Bayes’ theorem to estimate the state at time step \(k+1\) as:

$$\begin{aligned} \begin{aligned} p(\textbf{x}_{k+1}\mid \textbf{y}_{1:k+1},\textbf{a}_{0:k})&={\frac{p(\textbf{y}_{k+1},\textbf{x}_{k+1}\mid \textbf{y}_{1:k},\textbf{a}_{0:k})}{p(\textbf{y}_{k+1}\mid \textbf{y}_{1:k},\textbf{a}_{0:k})}}\\&=\frac{p( \textbf{y}_{k+1}\mid \textbf{x}_{k+1})\,p(\textbf{x}_{k+1}\mid \textbf{y} _{1:k},\textbf{a}_{0:k})}{p(\textbf{y}_{k+1}\mid \textbf{y}_{1:k},\textbf{a}_{0:k})}. \end{aligned} \end{aligned}$$
(24)

According to the Chapman-Kolmogorov equation under the Markovian property of the system model, we have:

$$\begin{aligned} \begin{aligned} p(\textbf{x}_{k+1}&\mid \textbf{y}_{1:k},\textbf{a}_{0:k}) \\&= \int p(\textbf{x}_{k+1}\mid \textbf{x}_{k},a_{k})\, p(\textbf{x}_{k}\mid \textbf{y}_{1:k},\textbf{a}_{0:k-1})d\textbf{x}_{k}. \end{aligned} \end{aligned}$$
(25)

The denominator in Eq. 24 is a normalizing constant ensuring that the posterior distribution is a valid probability distribution.Therefore, the posterior density function of state at time step \(k+1\) given all data and vehicle accelerations, according to the recursive Bayesian state estimation in Eq. 24, can be derived as:

$$\begin{aligned} \begin{aligned}&p(\textbf{x}_{k+1} \mid \textbf{y}_{1:k+1},\textbf{a}_{0:k})\,\propto \\&\,p(\textbf{y}_{k+1} \mid \textbf{x}_{k+1})\,p(\textbf{x}_{k+1} \mid \textbf{x}_{k},a_{k})p(\textbf{x} _{k} \mid \textbf{y}_{1:k},\textbf{a}_{0:k-1}). \end{aligned} \end{aligned}$$
(26)

In order to approximate this distribution, we first predict \(p(\textbf{x}_{k+1} \mid \textbf{x}_{k}, a_{k})\) by passing the N particles at time k and acceleration \(a_{k}\) through the state process presented in Eq. 6 to obtain \(\{\textbf{x}_{k+1}^i\}_{i=1}^{{N}}\). Given \(p(\textbf{x}_{k} \mid \textbf{y}_{1:k},\textbf{a}_{0:k-1})\) in Eq. 16, we obtain:

$$\begin{aligned} \begin{aligned} p(\textbf{x}_{k+1} \mid \,&\textbf{y}_{1:k+1},\textbf{a}_{0:k})\\&\approx \,\sum _{i=1}^{{N}}p(\textbf{y}_{k+1} \mid \textbf{x}_{k+1}^i)\,{\mathbf {\omega }}_{k}^i\,\delta ({\textbf{x}_{k+1}-\textbf{x}_{k+1}^i})\,, \end{aligned} \end{aligned}$$
(27)

where the likelihood \(p(\textbf{y}_{k+1}\mid \textbf{x}_{k+1}^i)\) representing the conditional probability of the partial observational data \(\textbf{y}_{k+1}\) given each particle can be obtained using \(\textbf{f}(\textbf{x}_{k}^i,a_{k})\) and the statistics of the state noises in Eq. 7 according to the observed state variables contained in \(\textbf{y}_{k+1}\). Therefore, this likelihood can be computed as \(p(\textbf{y}_{k+1}\mid \textbf{f}(\textbf{x}_{k}^i,a_{k}))\). Consequently, the unnormalized weight associated with each particle \(\textbf{x}_{k+1}^i\) can be calculated as:

$$\begin{aligned} \begin{aligned} {\tilde{\mathbf {\omega }}}_{k+1}^i\,\propto \,p\left( \textbf{y}_{k+1}\mid \textbf{f}(\textbf{x}_{k}^i,a_{k})\right) \,{\mathbf {\omega }}_{k}^i. \end{aligned} \end{aligned}$$
(28)

The weights are then normalized as follows:

$$\begin{aligned} \begin{aligned} {\mathbf {\omega }}_{k+1}^i = \frac{{\tilde{\mathbf {\omega }}}_{k+1}^i}{\sum _{j = 1}^N {\tilde{\mathbf {\omega }}}_{k+1}^j}. \end{aligned} \end{aligned}$$
(29)

As the values of some of the state variables at time \(k+1\) are directly observed in the last data received, we replace the state variables in particles \(\{\textbf{x}_{k+1}^i\}_{i=1}^{{N}}\) with their known values in \(\textbf{y}_{k+1}\). Finally, we obtain the particles and their associated weights at time step \(k+1\), i.e., \(\{\textbf{x}_{k+1}^i,{\mathbf {\omega }}_{k+1}^i\}_{i=1}^{{N}}\).

The set of weighted particles, \(\{\textbf{x}_{k+1}^i,{\mathbf {\omega }}_{k+1}^i\}_{i=1}^{{N}}\), provides the necessary requirements to determine the next acceleration, \(a_{k+1}\), to be taken by the autonomous vehicle using the proposed point-based or distribution-based action policies in Eqs. 21 and 22s. Upon taking the next action, the system moves to the new state, which is estimated based on only the system dynamics (if no new data arrives) or the incorporation of both system dynamics and the new data (if new data arrives). The above procedure, demonstrated in Fig. 5, is repeated until the vehicle reaches the intersection.

Fig. 5
figure 5

Flowchart of the proposed framework

Fig. 6
figure 6

Transition of the state variables in one single trajectory with distribution-based action policy and \(t_s=3\,s\)

3 Numerical Experiments

In this part, we evaluate the performance of the proposed physics-informed particle-based reinforcement le-arning policies through some analyses. We use a simulation environment that emulates a signalized intersection and a vehicle approaching it. The simulation environment considers the distance of the vehicle toward the intersection, the velocity of the vehicle, the phase of the traffic light, and the time elapsed from the last phase transition of the traffic light as the system state. In all the experiments, the values associated with the variables in the model are \(T_{\textit{G}} = 10\,s\), \(T_{\textit{Y}} = 5\,s\), \(T_{\textit{R}} = 10\,s\), \(\varDelta t = 1\,s\) and \(a \in \mathcal {A}=\{-3,-2,-1,0,1,2,3\}\). The Q-learning parameters are set to \(\alpha =0.001\), \(\gamma =0.95\), \(R_{\text {term}}=100\), \(R_\text {RRL}=-200\), \(R_\text {vel}=-100\), \(d_{\text {critical}}=10\), \(R_\text {smooth}=-10\), \(\text {thr}_\text {smooth}=1\), and \(R_{\text {act}}=-2\). We consider the number of particles to be \(N=10,000\), and the parameters of the initial distributions of particles are set to \(d_\text {min}=90\,m\), \(d_\text {max}=120\,m\), \(v_\text {min}=5\,m/s\) and \(v_\text {max}=15\,m/s\). We consider \(p_d = \mathcal {N}(0,0.4)\), \(p_v = \mathcal {N}(0,0.1)\) and \(p_\phi =\mathcal {N}(0,0.01)\) as the state transition uncertainties.

Fig. 7
figure 7

Average rewards over time in 100 randomly generated trajectories, obtained by three action policies of baseline, distribution-based, and point-based. Each row corresponds to the set of state variables the agent observes, and each column shows the level of data sparsity. In all plots, the x-axis represents the time and the y-axis demonstrates the average rewards

To assess the performance of our proposed framework in scenarios where observations are sparse or intermittently available, we introduce a sparsity indicator denoted as \(t_s\). \(t_s\) demonstrates the frequency at which we obtain measurements for the state variables. Specifically, when \(t_s=1\), it signifies that we observe the measurements for the state variables every second. In other words, we have a continuous stream of observations with no gaps. On the other hand, when \(t_s\) is set to a value greater than 1, such as \(t_s=3\), 5, or 7 , it implies that we receive the observations at longer time intervals. For example, with \(t_s=3\), we collect the measurements every 3 seconds, and similarly for \(t_s=5\), and 7. In these cases, there are gaps between consecutive observations, leading to sparser data compared to the \(t_s=1\) case. By considering different values of \(t_s\), we can investigate the impact of observation sparsity and loss at certain time instances on the performance and effectiveness of our framework. This allows us to explore the robustness and adaptability of our approach in scenarios where observations are not available at every time step.

Single Trajectory Analysis

The primary objective of this paper is to develop an algorithm capable of effectively managing the actions of an autonomous vehicle approaching an intersection, specifically with the goal of preventing the vehicle from running a red light signal. To demonstrate the successful achievement of this aim, we present a single trajectory in Fig. 6. In this scenario, the vehicle with the initial distance of \(106.7\,m\) and velocity of \(10.8\,m/s\) is approaching a green traffic light with \(t_\phi =4\,s\). The actions are selected according to the proposed distribution-based action policy with the assumption that the state variables of d and \(\phi \) are observed every three seconds, i.e., \(t_s=3\). The distance and velocity plots are accompanied by a color-coded background representing the traffic light phase, while the acceleration of the vehicle at each time step is annotated on the velocity plot.

We initiate the analysis by considering the specific time instance \(t=6\,s\); the traffic light changes to the yellow phase, and the vehicle is positioned at a distance of \(d_6\) with a velocity of \(v_6\). As time progresses to \(t=11\,s\), the traffic light changes to red and the vehicle approaches the traffic light, reducing its velocity to \(v_{11}\) while maintaining a distance of \(d_{11}\). During the red phase, the vehicle reaches proximity to the traffic light and comes to a halt behind the intersection, waiting for a duration of \(10\,s\) until the traffic light shifts to green. Subsequently, the vehicle increases its acceleration and speed, allowing it to pass through the traffic light when it gets green. It is important to note that the agent’s behavior may vary in similar trajectories due to the stochastic nature of the system, including the stochasticity in the state transitions, observations, and state estimation.

Average Reward Analysis

We compare the proposed distribu-tion-based and point-based action policies with a baseline case scenario called “Baseline Action Policy". In the baseline action policy, we assume that all the state variables are fully observable at each time step and the actions are selected according to Eq. 10. In this analysis, 100 independent trajectories, each representing the vehicle’s path toward the traffic light, are considered. At each time step along these trajectories, the vehicle’s state variables and selected actions determined the reward received, as defined by the reward function in Eq. 8. Figure 7 illustrates the average rewards across all 100 trajectories over time. The x-axis represents the time step, extended to the maximum length of all trajectories for comprehensive comparison, and the y-axis demonstrates the average obtained rewards. Each row in this figure corresponds to a specific set of observed state variables, while each column represents the data sparsity.

Fig. 8
figure 8

RRL count for distribution-based and point-based action policies for different sets of observing state variables and sparsity levels. RRL stands for Running Red Light

The baseline action policy, relying solely on the true states at all time steps for action selection, remains unaffected by both observed state variables and sparsity level. However, in the distribution-based and point-based action policies, where the selection of actions depends on observations, the influence of observed state variables and data sparsity on average rewards becomes apparent. Observing the figure, it is evident that in each column, the average rewards decrease from top to bottom for both the distribution-based and point-based action policies, reflecting the impact of missing measurement of state variables. Additionally, within each row of the figure, an overall decreasing trend in average rewards can be observed as the sparsity of observations increases for both action policies.

Another notable observation from this figure is that, in most cases, the distribution-based action policy yields higher rewards compared to the point-based action policy. This observation demonstrates that the distribution-based action policy exhibits greater robustness against uncertainty and sparsity in the observations. Overall, as depicted in this figure, the average reward increases over time, indicating a growing number of vehicles safely passing through the intersection and earning positive rewards.

Safety Analysis

To evaluate the safety implications of our proposed framework, we employ two metrics: the frequency of the vehicle running a red light, referred to as running red light (RRL), and the velocity at which the vehicle crosses the red light. From a safety perspective, it is preferable for the vehicle to run the red light at a lower speed, as this reduces the likelihood of severe accidents and potential harm to both the vehicle itself and other vehicles sharing the road.

To illustrate these metrics, Fig. 8 is presented, showing the RRL counts for the proposed action policies: distribution-based and point-based. The results encompass 100 randomly generated and independent trajectories, considering various scenarios involving the observation of different state variables and varying levels of sparsity. With the baseline action policy in the first row, there is no RRL, meaning that the vehicle never passes a red traffic light when it observes all the state variables, even with different levels of sparsity. The analysis of the figure reveals an increasing trend in RRL counts as sparsity and loss of information in the observations intensify. As can be seen, the RRL counts increase in the second row where \(t_\phi \) is not observed, emphasizing the significance of observing traffic light elapsed time in making safer decisions at intersections, especially at critical conditions of passing the red light. It can be seen that the RRL counts increase in the third and fourth rows, where only two state variables are observed. This finding suggests that as the available information becomes scarcer, the likelihood of the vehicle running a red light rises. Moreover, the figure demonstrates that the distribution-based action policy consistently exhibits a lower RRL count compared to the point-based action policy. This observation provides additional support for the previously mentioned hypothesis that the distribution-based action policy has greater robustness in the face of missing information and sparsity when contrasted with the point-based action policy.

Fig. 9
figure 9

The vehicle velocity in cases that the vehicle passes the red light for different degrees of data sparsity (\(t_s\)) and different observed data using the point-based and distribution-based action policies

Based on the findings presented in Fig. 8, it can be observed that even under the worst-case conditions, applying the point-based action policy in the proposed framework leads to a safe passage of the vehicle through the traffic light in more than \(78\%\) of times. This level of safety is maintained even with a higher level of \(88\%\) once applying the proposed distribution-based action policy, where observations are received only every 7s with a substantial amount of missing information in observing the state variables. These outcomes demonstrate the effectiveness and reliability of the proposed framework in ensuring safe and acceptable performance.

Figure 9 displays the average velocity of the vehicle during the RRL cases that have been presented in Fig. 8. Consistent with the observed trend in the previous figure regarding RRL count, a similar pattern is seen here. Specifically, when employing the distribution-based action policy, the vehicle exhibits lower velocities while crossing the red light compared to when the point-based action policy is employed. This finding reinforces the notion that the distribution-based action policy offers a safer approach for action selection. By maintaining lower velocities during the red light violations, the distribution-based action policy demonstrates a greater emphasis on safety considerations. Consequently, it is recommended to prioritize the adoption of the distribution-based action policy for the purpose of enhancing overall safety.

Table 1 Action similarity (\(\text {sim}_a\)) and average estimation error (\({\bar{e}_.}\)) for different degrees of data sparsity (\(t_s\)) and different observed data using the point-based action policy

Action Similarity Analysis

In this analysis, we pre-sent a metric denoted as “action similarity" alongside the average error in the estimation of state variables when employing the point-based action policy, under varying degrees of sparsity and observation data incompleteness in Table 1. The term “action similarity" denotes the frequency with which the action chosen using the point-based action policy aligns with the baseline action that would have been selected based on the true state at a given time step. In this table, “\(\text{ sim}_a\)" refers to the average number of action similarities over 100 independent trajectories. The overall error of estimating the state in 100 trajectories is considered as a value denoted as \({\bar{e}}\) which is represented as a set of values \(({\bar{e}_d}, {\bar{e}_v}, {\bar{e}}_\phi , {\bar{e}}_{t_\phi })\). \({\bar{e}_d}, {\bar{e}_v}, {\bar{e}}_{t_\phi }\) indicate the discrepancies between the estimates and the actual values for distance, velocity, and the time elapsed since the last phase change, measured in terms of the Mean Absolute Error (MAE). \({\bar{e}}_\phi \) measures how often the estimated traffic light phase differs from the actual phase, on average. Evidently, for each distinct set of observations, the action similarity exhibits a decreasing trend as \(t_s\) increases. Concurrently, the estimation error associated with the state variables displays an ascending trajectory at each row. Furthermore, it is notable that in instances where there is a greater lack of observed variables within the dataset, the action similarity corresponding to the same \(t_s\) reduces.

4 Conclusion

This paper presents a framework for the effective control of autonomous vehicles within signalized intersections by integrating system dynamics with imperfect sensor data. By employing a physics-based Markov decision process (MDP) model and leveraging partial sensor data, the methodology reduces the uncertainty associated with measurable system model segments and facilitates the estimation of unmeasurable variables. Thr-ough the proposed particle-based reinforcement learning action policies, including “Point-Based" and “Dist-ribution-Based" strategies, the framework not only ad-apts to uncertainty but also showcases the synergy between data-driven insights and fundamental physics, ensuring safe control and decision-making in various scenarios. Numerical experiments confirm the effectiveness of the proposed framework, particularly highlighting the correlation between increased missing state variables, data sparsity, and the frequency of running red lights (RRL). The distribution-based action policy exhibits greater reliability in handling missing variables and data sparsity, emphasizing the importance of accurate state estimation for promoting safety. Specifically, results from the experiments indicate that even under the worst-case conditions, applying the point-based action policy in the proposed framework leads to a safe passage of the vehicle through the traffic light in more than \(78\%\) of instances. This level of safety is maintained even with a higher level of \(88\%\) once applying the proposed distribution-based action policy, where observations are received with a substantial level of sparsity and amount of missing information in observing the state variables. Furthermore, the relatively low average velocity of the vehicle during time instances when the vehicle violates the red light significantly reduces the risk of fatal accidents, further emphasizing the safety measures incorporated within the proposed framework.