1 Introduction

Hydrogen fuel cells, recognized for their high power density, rapid start-up, environmental friendliness, and exceptional energy conversion efficiency, have attracted substantial global investments in research and industrial development [1]. Their potential extends to enhancing societal well-being and fostering economic stability at a national level.

For proton exchange membrane fuel cells (PEMFCs), effective water management emerges as a critical factor influencing normal operation and overall efficiency. Maintaining an optimal water balance is essential to prevent issues such as overflow and dehydration that can compromise PEMFC performance. Excessive water reaching the flow channels or cathode and anode poses a risk of failure, while insufficient water can hinder proper operation [2]. These challenges underscore the pivotal role of water management in the performance of PEMFCs [3].

The conventional model-based approach to water management primarily concentrates on formulating mathematical models for simulating the internal water migration within fuel cells. In [4], a multiple-input-multiple-output fuzzy controller is developed and demonstrated for water and thermal management of fuel cell systems in real-time. Results show that the proposed fuzzy controller effectively increased the output power of PEM fuel cell. In [5], a 3D multi-phase model based on Eulerian-Eulerian is presented for water management model in the PEMFC system. In [6], an active disturbance rejection control strategy is used to balance the humidity of the PEMFC.

To enhance the dynamic water management method, [7] introduces a physics-based PEMFC model considering the localized impact of humidity on fuel cell stack performance. In [8], a proportional-integral and active disturbance rejection control strategies are devised for air supply and coolant flow in PEMFCs. In [9], a fractional-order PID dynamic control strategy is introduced to balance the membrane humidity of PEMFC.

Constructing physical models for water management in fuel cells poses considerable challenges, prompting the exploration of alternative approaches such as neural network algorithms. In [10], an artificial neural network is leveraged to emulate the power output of PEMFCs. Reference [11] introduces a dynamic neural network-based maximum power point tracking (MPPT) control methodology tailored for fuel cell applications. Furthermore, in [12], convolutional neural networks are employed to quantify liquid water content within PEM fuel cells.

The aforementioned studies primarily focus on the modeling of water management processes, which has posed significant challenges. It is evident that an inaccurate water management strategy can result in suboptimal performance, potentially compromising the overall efficiency and reliability of PEMFCs. Indeed, improper water management can lead to issues such as flooding or drying within PEMFCs.

In order to adapt the dynamic and complex environments inherent in PEMFCs, model-free model is introduced to deal with water management problem. An actor-critic-based model-free reinforcement learning approach is proposed for water management of the PEM fuel cell in [13]. The simulation results showcase the effectiveness of this approach in maintaining water balance within the fuel cell stack, thereby maximizing stack voltage.

The advancement of reinforcement learning algorithms has significantly contributed to the enhancement of control strategies for PEMFCs. In [14], a multi-objective optimal fractional-order proportion integration differentiation controller is proposed for PEMFCs. In addition, a novel large-scale deep reinforcement learning is designed to ensure the optimal and comprehensive control performance. In [15], an evolutionary curriculum imitation large-scale deep reinforcement learning algorithm is proposed to improve the robustness of the control strategy of PEMFCs.

Therefore, a model-free reinforcement learning method is proposed for water management in PEMFCs. First, to address the challenge of continuous action spaces, actor-critic-based reinforcement learning is employed. Then, the integration of deep neural networks into reinforcement learning is achieved through the utilization of experience replay buffers and four neural networks, leading to the formulation of the Deep Deterministic Policy Gradient (DDPG). Finally, a prioritized experience replay mechanism is incorporated, resulting in the development of the prioritized DDPG method for water management.

The proposed model-free water management system enables adjustments to the hydrogen gas inlet pressure, thereby regulating water content within the PEMFC based on observed stack current and voltage. This adjustment ensures the efficient operation of PEMFCs. The primary contribution lies in the introduction of the prioritized DDPG method, providing an effective solution to the complex water management challenges encountered in PEMFCs.

The structure of this paper is organized as follow: Sect. 2 gives the preliminary theory for reinforcement learning. The water management for problem PEMFCs is formulated in Sect. 3. Section 4 describes the proposed prioritized DDPG water management method. The results of simulation are presented and analysed in Sect. 5. Finally, Sect. 6 concludes this paper.

2 Reinforcement learning

Reinforcement learning is a goal-driven leaning method, which could acquire knowledge through interaction with environment [16]. The learner is agent, and everything it interacts is environment. The agent continuously updates its knowledge through a process of trial-and-error learning while interacting with the environment.

At each discrete time step denoted as t, the agent selects its action, denoted as At, based on the observed state from the environment, referred to as St. The state St exhibits the Markov properties, indicating that future states depend only on the current state of the process and are independent of the past. The chosen action At leads to a transformation of the environment state St. In response to this change, the agent receives an immediate reward rt from the environment.

The core of reinforcement learning is the policy π(At|St), which represents the action selection based on the states. The objective of the agent is to optimize the cumulative reward over an entire episode. At each time step within an episode, the agent aims to choose the optimal action from the set of possible actions based on its policy.

To determine the optimal policy for an agent, there are two fundamental challenges in reinforcement learning [17]. The first challenge is how to evaluate the long-term rewards of a sequence of actions. Long-term rewards remain unknown until the end of the episode. Furthermore, in the practical situation, environmental information is often partial or incomplete. Only experience is gained after engaging in the trial-and- error learning with the environment, consisting of a sequence of states, actions, and rewards. Consequently, extracting valuable information from experience is critical for reinforcement learning.

The second challenge is the dilemma between exploration and exploitation. If the agent only focuses on maximizing its current immediate reward, it might result in short-term gains. However if the agent only explores for potential future reward, it could miss out on maximizing immediate reward. Hence the balance between exploration and exploitation is also important for reinforcement learning.

3 Water management problem formulation

Proton exchange membrane fuel cells (PEMFCs) are a promising and environmentally friendly energy device that converting hydrogen gas and oxygen into electricity. A PEMFC includes anode, cathode, and proton exchange membrane (PEM) sandwiched between them. As shown in Fig. 1, hydrogen gas is introduced to the anode and dissociated into protons H+ and electrons e. Simultaneously, oxygen is supplied to the cathode for reduction reaction, which involves the acceptance of electrons and the creation of oxygen ions. Protons pass through the PEM, while electrons are directed through an external circuit, creating an electric potential difference across the cell. At the cathode, protons and electrons combine with oxygen to produce water H2O, releasing energy in the process. This reaction results in the formation of water as the only byproduct, making PEMFCs an environmentally friendly energy source.

Fig. 1
figure 1

Different water transfer processes in the electrode of a proton exchange membrane fuel cell

Water management in PEMFCs involves regulating the water content to maintain their optimal operation. This regulation is necessary to ensure the right moisture levels in different fuel cell components, including PEM, electrodes, and gas diffusion layers.

Firstly, it is crucial to maintain adequate moisture levels in PEM and catalyst layers. This moisture facilitates proton movement and the chemical reactions between hydrogen and oxygen. Maintaining the correct humidity levels of reactant gases, such as hydrogen and oxygen, is essential for achieving the desired moisture level within the cell. Dry gases can lead to PEM dehydration, while excessively humid gases can result in flooding.

Secondly, water is produced as a byproduct of the chemical reactions. If this water accumulates and is not removed effectively, it can block gas flow through the porous components, hindering the transport of reactants to the catalysts. This can result in decreased performance and potential cell damage.

Therefore,water management is vital because both excessive and insufficient water can harm the fuel cell's performance.

In this paper, the water management problem in PEMFCs is modeled as a Markov decision process (s,a,T,r,γ). The environmental state is represented as s = (I,V), including the stack current of the PEMFCs I and the stack voltage V. Based on the observed environmental state, corresponding actions a can be taken to adjust the hydrogen gas inlet pressure, thereby altering the water content within the PEMFC. The state transition probability is 1 in this paper. The immediate reward that the agent receives after taking an action in a given state is denoted as r. Reward r is defined as the squared difference between the objective and the actual power output of Proton Exchange Membrane Fuel Cells (PEMFCs), serving as a metric for stability. The discount factor for future rewards is represented by γ.

The policy π is defined as the strategy to maximize the cumulative reward for a given state, and the return is the aggregation of discounted rewards from t = 1 to time t = T,

$$G(\pi ) = r_{1} + \gamma r_{2} + ... + \gamma^{T - 1} r_{T}$$
(1)

If discount factor γ = 0, the agent only focuses on immediate reward while disregarding future rewards. If discount factor γ = 1, the agent accords equal significance to both immediate and future rewards. In general, 0 < γ < 1.

In the next section, the model-free reinforcement learning algorithm is introduced in this paper to address the Markov decision problem constructed for water management in PEMFCs.

4 Water management strategy based on prioritized DDPG

4.1 Evaluating long-term rewards for optimal water management

The traditional Q-learning is introduced to address the two fundamental challenges in Sect. 2. To solve the first problem of evaluating long-term rewards for different actions, state-action value function is introduced to assess how good actions are given the observed state s.

To evaluate the long-term rewards, the action-value function (s,a) is designed for estimating how good the action a is in the given state s. (s,a) is defined as the expected long-term return for taking action a in given state s following policy π as follows,

$${\text{Q}}_{\pi } (s,a) = {\rm E}[G_{t} \left| {S_{t} = s,A_{t} = a} \right.]$$
(2)

Based on the Bellman optimality principle, it can be divided into immediate reward and discounted next-state rewards as,

$${\text{Q}}_{\pi } (s,a) = {\rm E}[R + \gamma {\text{Q}}_{\pi } (s^{\prime},a^{\prime})]$$
(3)

For water management in PEMFCs, it's hard to model the entire chemical reaction. Therefore, the expected long-term rewards from experience is used. At each time step, the agent observes states, which are the stack current and stack voltage states of the PEMFC. Based on the action-value function (s,a), it chooses the action for hydrogen gas inlet pressure. After executing the action a, the state of the PEMFC changes to st and receives an immediate reward r. The expected value function is as follows,

$${\text{Q}}_{E} (s,a) = R + \gamma \mathop {\max }\limits_{a} {\text{Q}}(s^{\prime},a)$$
(4)

The action-value function Eq. 3 is updated towards to the estimated return Eq. 4 by the following Eq. 5,

$${\text{Q}}(s,a) \leftarrow {\text{Q}}(s,a) + \alpha \left( {R + \gamma \mathop {\max }\limits_{a} {\text{Q}}(s^{\prime},a){\text{ - Q}}(s,a)} \right)$$
(5)

where α ∈ (0, 1) is the learning rate, which represents how quickly the agent learns from new ones. The action-value Q is updated constantly step by step. After a number of episodes, the Q derived from experience is approximate to the actual value.

4.2 The trade-off between exploration and exploitation for water management strategy

To solve the problem of the trade-off between exploration and exploitation, the ε-greedy algorithm is used in this paper. With the probability ε, the agent chooses an action at random. With the probability 1 − ε, the agent chooses an action based on the maximal action-value function. The map** from state s to action a is as follow,

$$\pi \left( {a|s} \right) = \left\{ {\begin{array}{*{20}l} {\frac{\varepsilon }{m} + 1 - \varepsilon ,\;if\;a^{*} = \mathop {\arg \max }\limits_{a \in A} Q(s,a)} \hfill \\ {\frac{\varepsilon }{m},\qquad \qquad otherwise} \hfill \\ \end{array} ,} \right.$$
(6)

where m is the number of all the actions. The ε-greedy algorithm is a near-greedy algorithm. The greedy action is chosen most of the time, which is the process of exploitation. The exploitation process is to maximize immediate reward based on the learned experience. But every once in a while, the action is chosen randomly, which is the process of exploration. The exploration process is responsible on gathering more information about the water management problem.

The choice of the exploration rate ε significantly influences the balance between exploration and exploitation. A high ε encourages more exploration, which can help discover better actions but may result in lower short-term rewards. Conversely, a low ε prioritizes exploitation, potentially leading to faster convergence to locally optimal actions but risking the possibility of missing globally optimal solutions. Therefore, the trade-off between exploration and exploitation is solved by ε -greedy.

4.3 The deep deterministic policy gradient algorithm

Traditional Q-learning effectively handles the two fundamental challenges in reinforcement learning as described earlier. However, it is hard to address the water management problem in PEMFCs. The first difficulty is that hydrogen gas inlet pressure action is continuous, while traditional Q-learning only can handle discrete action. The second difficulty is how to introduce deep neural networks for better nonlinear fitting of the water management problem in PEMFCs.

The deep deterministic policy gradient (DDPG) algorithm is introduced in this paper to address the aforementioned water management problem in PEMFCs. DDPG is a reinforcement learning method based on the Actor-Critic framework [18]. The Actor is responsible for obtain action based on state, and the Critic is used for evaluating the policy. Therefore, DDPG can handle continuous hydrogen gas inlet pressure action in water management.

To integrate deep neural networks into reinforcement learning, it is crucial to ensure that the training data adheres to the principle of being independently and identically distributed (i.i.d.). However, the data generated in water management for PEMFCs exhibit high correlation. To mitigate this correlation within the reinforcement learning data, we introduce an experience replay buffer for training deep neural networks. The samples acquired at each time step during the interaction with the environment are stored in the replay buffer. During the training of deep neural networks for water management, small batches of samples are subsequently extracted from this set.

To ensure the stability of the reinforcement learning training, four neural networks are stablished, which are online Actor network πθ(s|a), offline Actor network π′θ(s|a), online Critic network qϕ (s,a) and offline Critic network qϕ(s,a). First, the parameters of the offline networks (θ′ and ϕ′) are fixed. Then, the online network (θ and ϕ) are trained using the offline network. After a certain period, the parameters of the current network are transferred to the offline network.

First, initialize the parameters of the four neural networks θ, θ′, ϕ, ϕ′, which consist of two Actor networks and two Critic networks, within the DDPG framework. At time step t, based on the observed stack current and stack voltage state st, select the corresponding hydrogen inlet pressure action at using the online Actor network πθ (s|a). After executing the action, obtain the t + 1 state st+1 and immediate reward rt. Store this information st, at, rt, st+1 in the experience replay. After a certain period of step, randomly select M samples from the experience replay collection to update the networks.

For the jth sample sj,aj, rj, sj+1 in M samples, select an action aj+1 for the state sj+1 based on the offline Actor network πθ(s|a), and calculate the Q-value yj for the state sj+1 and action aj+1 based on the offline Critic network qϕ(s,a).

$$y_{j} = r_{j} + q_{\omega ^{\prime}} (s_{j + 1} ,\pi_{\theta ^{\prime}} (s_{j + 1} \left| a \right._{j + 1} ))$$
(7)

For mini-batch water management samples, calculate the mean squared error loss function as follows,

$$Loss\left( \omega \right) = \frac{1}{M}\sum\nolimits_{j} {\left( {y_{j} - q_{\omega } (s_{j} ,a_{j} )} \right)}^{2}$$
(8)

Update the parameters of the online Critic network,

$$\varphi \leftarrow \varphi + \alpha_{c} \nabla {\text{Loss}}\left( \varphi \right)$$
(9)

where αc is the learning rate for online Critic network.

Compute the policy gradient for sampling,

$$\nabla J\left( \theta \right) = \frac{1}{M}\sum\nolimits_{j} {\nabla Q_{\varphi } \left( {s_{j} ,a_{j} } \right)\left| {_{{s = s_{j} ,a = \pi_{\theta } (s_{j} )}} } \right.} \nabla_{\theta } \pi_{\theta } (s)\left| {_{{s = s_{j} }} } \right.$$
(10)

Update the parameters of the online Actor network based on the policy gradient,

$$\theta \leftarrow \theta + \alpha_{a} \nabla J\left( \theta \right)$$
(11)

where αa is the learning rate for online Actor network.

Once the current network reaches a certain update frequency, update the offline Actor and offline Critic networks,

$$\theta ^{\prime} \leftarrow \beta \theta + (1 - \beta )\theta ^{\prime}$$
(12)
$$\varphi ^{\prime} \leftarrow \beta \varphi + (1 - \beta )\varphi ^{\prime}$$
(13)

where β is the learning rate for offline network.

4.4 The prioritized deep deterministic policy gradient algorithm

However, experience replay involves storing past experiences in a replay buffer and then sampling mini-batches of experiences from this buffer for training deep neural networks. If the replay buffer contains lots of biased sample of experiences (e.g., a sub-optimal strategy or over-representation of certain state-action pairs), it can lead to suboptimal learning.

In standard experience replay, all experiences are treated equally, and samples are drawn uniformly from the replay buffer. To increase sample efficiency and faster learning, prioritized experience replay is introduced. In prioritized experience replay, each experience is assigned a priority value that reflects its potential significance for learning. The priority is determined based on the magnitude of the temporal difference error. Experiences with higher temporal difference errors are given higher priority because they represent situations where the agent's predictions are far from the target values.

Traditional prioritized experience replay are set based on the loss function. Experiences with larger errors are accorded higher priority, subsequently enhancing their likelihood of being selected for replay during the training process. However, this prioritizing method exhibits certain limitations. It can lead to the omission of certain samples that already had low errors initially. Moreover, it is overly sensitive to noise, as the negative effects introduced by some noisy samples gradually worsen during training. Relying only on the loss function for prioritization can result in a lack of diversity and insufficient robustness in water management strategies.

To enhance the efficacy of sampling within the replay buffer, we introduce a probabilistic sampling technique. This approach facilitates the prioritization of samples based on their relative importance, such that samples with elevated priority are more likely to be chosen during the training of scheduling policies, all the while upholding a non-zero sampling probability for the lowest-priority samples. Concretely, the priority,denoted as Pi, for a given sample i, is formally articulated as follows,

$$P_{i} = (|\nabla Loss(s,a,r,s^{\prime})| + p)^{\mu }$$
(13)

where the constant p is introduced to ensure that all samples have nonzero sampling probabilities, and µ represents the priority level.

In addition, during the training process of actual water management strategies, sorting all samples within the experience replay pool by priority before sampling each time results in low efficiency. Therefore, in this section, we employ a binary tree structure called SumTree to store the samples, enabling fast storage and retrieval of prioritized water management samples.

As shown in Fig. 2, this binary tree stores the sampling probabilities corresponding to eight water management samples. Different sampling probability values P determine the length of the intervals occupied by the samples. the longer the interval, the higher the priority of the corresponding water management sample. The parent node represents the sum of the two corresponding child nodes. The value of the root node encompasses all stored samples.

Fig. 2
figure 2

SumTree for prioritized water management samples in DDPG

The steps of selecting samples from replay buffer are as follows. First, sample uniformly from the interval [0, P15] to get the sampling probability Pk. Then, compare Pk with P13 on the left, if Pk is smaller than P13, then use Pk to continue down from the current node; otherwise, use Pk − P13 from the right node. The comparison is continued with the nodes in the third layer until the sampling Pk is finally determined. By utilizing the binary tree structure of SumTree, the samples are sampled and updated more efficiently during the training process of the reinforcement learning of water management. The sampling and updating during the training process of reinforcement learning is more efficient with a sampling complexity of O (log N). Further, the range [0, P15] is averaged into Nbatch ranges, which are uniformly divided into Nbatch ranges.

The random sampling in traditional deep Q-learning experience playback mechanisms is uniform, resulting in some samples that may never be played back or may take a long time to be played back. In this section, we design a prioritized sampling mechanism that assigns different sampling probabilities to different samples. If some samples are taken out repeatedly and corrected many times, a smaller error will result in a lower priority. Therefore, priority sampling is more inclined to take out new samples, which ensures the timeliness of sample sampling, improves the sample utilization, and is more suitable for solving the water management problem of PEMFCs.

In summary, the proposed water management algorithm based on prioritized DDPG for PEMFCs is as Fig. 3.

Fig. 3
figure 3

Water management scheme based on prioritized DDPG for PEMFCs

5 Experiment

The proposed DDPG-based PEMFCs water management algorithm is verified in this section. As shown in Fig. 4, the experimental platform consists of a PEMFC system, a hydrogen supply system, a M9716 programmable DC electronic load, a self- developed hydrogen fuel cell detection system, a computer with USB CAN driver, and the platform is operating under open-circuit voltage conditions. Figure 5 depicts the control interface of the hydrogen fuel cell system platform, which was designed using the Troowin FCCU software [19]. The software is used for collecting and exhibiting various data during the operation of the hydrogen fuel cell, the data acquisition interface is shown in Fig. 6.

Fig. 4
figure 4

Hydrogen fuel cell water management experimental platform including Computer, Electronic load, Detection system, Hydrogen cylinder and PEMFC

Fig. 5
figure 5

The control and data reading interface for hydrogen fuel cell system using Troowin FCCU software

Fig. 6
figure 6

Display of operational data for hydrogen fuel cell using Troowin FCCU software

In the dynamic operational context of external voltage fluctuations, the operation of PEMFCs involves systematic control of hydrogen inlet pressure and precise adjustments to the water injection level. The regulation of hydrogen inlet pressure can be achieved by modulating the magnitude of the inlet hydrogen flow, thereby enabling control over the reaction rate and adjusting the water injection level. This control is executed in response to the continuously observed state, specifically the stack current and stack voltage.

The primary objective of this control mechanism is to uphold the water balance within the hydrogen fuel cell, ensuring a consistently stable power output. For a more granular understanding, Fig. 7 provides detailed insights into the temporal changes in both the observed stack current and stack voltage. The two sub-figures serve as valuable indicators of the consequential shifts within the operational environment, and offers a comprehensive perspective on how the fuel cell system responds to the external dynamics, contributing to a more robust comprehension of the system's behavior.

Fig. 7
figure 7

The observed states in hydrogen fuel cell for water management

The "Tianshou" platform is introduced for reinfocement learning. The "Tianshou" platform provides a fast-speed framework and pythonic API for building the deep reinforcement learning agent based on pure PyTorch. It is very fast thanks to numba jit function and vectorized numpy operation [20]. The proposed prioritized DDPG algorithm is implemented in Python 3.8. The experience replay buffer size D is set to 1000, the batch size M is set to 128, the reward discount factor is 0.99, the learning rate for the actor network is 0.0001, and the learning rate for the critic network is 0.001.

After numerous iterations of the learning process, the rewards exhibit a gradual convergence toward the optimal values, as illustrated in Fig. 8. This convergence serves as a testament to the successful implementation of intelligent water management. It highlights the algorithm's inherent capacity to dynamically adapt and optimize the water management strategy in response to the continuously evolving operational conditions within the hydrogen fuel cell system.

Fig. 8
figure 8

Convergence process by prioritized DDPG for water management

The observed stability in reward convergence not only underscores the effectiveness of the proposed control mechanism but also attests to its resilience in addressing the inherent dynamism and uncertainties pervasive in the operational environment.

The adaptability of the algorithm, evidenced by the consistent convergence of rewards, underscores its proficiency in navigating diverse scenarios while maintaining a stable and optimal water balance. This adaptability assumes paramount importance in real-world applications, where external factors, such as voltage fluctuations and environmental changes, wield considerable influence over the performance of PEMFCs. Therefore, the demonstrated stability in reward convergence provides robust evidence that the intelligent water management strategy is well-equipped to cope with the dynamic variations in the operational landscape, ensuring steadfast and reliable performance across a spectrum of conditions.

Figure 9 shows the power action and power output by the proposed water management method in PEMFC. The sustained stability of the output, averaging around 0.75W, serves as a tangible manifestation of the prioritized DDPG's adeptness in swiftly learning and adapting to the dynamic external voltage conditions. This stability not only signifies the successful convergence of the reinforcement learning process but also underscores the robustness of the prioritized DDPG approach in upholding a consistent and optimal water balance.

Fig. 9
figure 9

The power in hydrogen fuel cell for water management

In the context of the prioritized DDPG water management strategy, as illustrated in Fig. 9, the dynamic adjustments of hydrogen inlet pressure values exhibit a close association with the observed state s. The incorporation of the prioritized DDPG algorithm introduces a refined prioritization mechanism within the experience replay, fostering heightened learning efficiency and accelerated convergence. This prioritization feature enables the model to selectively emphasize experiences with greater learning potential, thereby influencing the adaptation of hydrogen inlet pressure values.

The observed correlation between the prioritized DDPG's learning efficiency and the stability in the hydrogen fuel cell's power output reaffirms the method's efficacy in addressing the challenges posed by the dynamic operational environment. The prioritized DDPG not only optimizes water management but also demonstrates an enhanced capability to navigate through the intricacies of external voltage variations, thereby showcasing its potential for practical applications in real-world scenarios.

6 Conclusion

In response to the complex and dynamic external power voltage conditions in the working environment of hydrogen fuel cells, this paper presents a model-free water management method based on the prioritized DDPG reinforcement learning approach. This method operates within the Actor-Critic reinforcement learning framework, comprising two sets of neural networks within the Actor component. It incorporates an prioritized experience replay mechanism and uses observations of stack current and stack voltage to execute hydrogen inlet pressure actions. The Critic component also consists of two sets of neural networks to evaluate the quality of executed actions. Through continuous interaction with the environment and learning, the method adjusts the water content of the hydrogen fuel cell to achieve water balance.

The proposed prioritized DDPG-based hydrogen fuel cell water management methodology is implemented using a constructed experimental platform for hydrogen fuel cell water management and the “Tianshou” platform. Experimental results serve as empirical evidence, substantiating the effectiveness and validity of the proposed approach.

In addition, future research endeavors could concentrate on enhancing the prioritized experience replay mechanism to improve learning efficiency and expedite convergence rates. This may entail investigating alternative prioritization strategies or integrating adaptive mechanisms to dynamically adjust replay priorities according to the relevance of experiences.