Log in

Multi-agent reinforcement learning for fast-timescale demand response of residential loads

  • Published:
Machine Learning Aims and scope Submit manuscript

A Correction to this article was published on 23 February 2024

This article has been updated

Abstract

To integrate high amounts of renewable energy resources, electrical power grids must be able to cope with high amplitude, fast timescale variations in power generation. Frequency regulation through demand response has the potential to coordinate temporally flexible loads, such as air conditioners, to counteract these variations. Existing approaches for discrete control with dynamic constraints struggle to provide satisfactory performance for fast timescale action selection with hundreds of agents. We propose a decentralized agent trained with multi-agent proximal policy optimization with localized communication. We explore two communication frameworks: hand-engineered, or learned through targeted multi-agent communication. The resulting policies perform well and robustly for frequency regulation, and scale seamlessly to arbitrary numbers of houses for constant processing times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Germany)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The simulator is open-sourced at https://github.com/ALLabMTL/marl-demandresponse.

Code availability

The code to reproduce the results are available at https://github.com/ALLabMTL/marl-demandresponse.

Change history

Notes

  1. The code is hosted on https://github.com/ALLabMTL/marl-demandresponse.

References

  • Agency, I. E. (2018). The Future of Cooling, url: https://www.iea.org/reports/the-future-of-cooling

  • Agency, I. E. (2021). Greenhouse Gas Emissions from Energy: Overview.

  • Agency, I. E. (2022). Energy Statistics Data Browser – Data Tools. Available on: https://www.iea.org/data-and-statistics/data-tools/energy-statistics-data-browser (Accessed on Sept 15).

  • Ahmadiahangar, R., Häring, T., Rosin, A., Korõtko, T., Martins, J. (2019). Residential load forecasting for flexibility prediction using machine learning-based regression model. In 2019 IEEE International Conference on Environment and Electrical Engineering and 2019 IEEE Industrial and Commercial Power Systems Europe (EEEIC / I &CPS Europe), pp. 1–4. https://doi.org/10.1109/EEEIC.2019.8783634

  • Ahrarinouri, M., Rastegar, M., & Seifi, A. R. (2021). Multiagent reinforcement learning for energy management in residential buildings. IEEE Transactions on Industrial Informatics, 17(1), 659–666. https://doi.org/10.1109/TII.2020.2977104

    Article  Google Scholar 

  • Aladdin, S., El-Tantawy, S., Fouda, M. M., & Tag Eldien, A. S. (2020). Marla-sg: Multi-agent reinforcement learning algorithm for efficient demand response in smart grid. IEEE Access, 8, 210626–210639. https://doi.org/10.1109/ACCESS.2020.3038863

    Article  Google Scholar 

  • Amin, U., Hossain, M., & Fernandez, E. (2020). Optimal price based control of hvac systems in multizone office buildings for demand response. Journal of Cleaner Production, 270, 122059.

    Article  Google Scholar 

  • Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell, G., McGrew, B., Mordatch, I. (2020). Emergent tool use from multi-agent autocurricula. ar**v:1909.07528 [cs, stat]

  • Betelle Memorial Institute: Residential Module User’s Guide. In GridLAB-D Wiki. Available at: http://gridlab-d.shoutwiki.com/wiki/Main_Page (Accessed: September 15, 2022) (Accessed 2022). http://gridlab-d.shoutwiki.com/wiki/Main_Page

  • Bevrani, H., Ghosh, A., & Ledwich, G. (2010). Renewable energy sources and frequency regulation: survey and new perspectives. IET Renewable Power Generation, 4(5), 438–457.

    Article  Google Scholar 

  • Biagioni, D., Zhang, X., Wald, D., Vaidhynathan, D., Chintala, R., King, J., Zamzam, A. S. (2021). PowerGridworld: A Framework for Multi-Agent Reinforcement Learning in Power Systems. ar** the energy storage potential in electric loads to deliver load following and regulation, with application to wind energy. Energy Conversion and Management, 50(5), 1389–1400.

    Article  Google Scholar 

  • Chen, B., Francis, J., Pritoni, M., Kar, S., Bergés, M. (2020). Cohort: Coordination of heterogeneous thermostatically controlled loads for demand flexibility. In Proceedings of the 7th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, pp. 31–40. https://doi.org/10.1145/3408308.3427980.ar**v:2010.03659 [cs, eess]. url: http://arxiv.org/abs/2010.03659

  • Chen, B., Francis, J., Pritoni, M., Kar, S., Bergés, M. (2020). Cohort: Coordination of heterogeneous thermostatically controlled loads for demand flexibility. In Proceedings of the 7th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, pp. 31–40.

  • CIBSE (2015). Guide A: Environmental Design, 8th edn. Chartered Institution of Building Services Engineers.

  • Dantzig, G. B. (1957). Discrete-variable extremum problems. Operations Research, 5(2), 266–288. https://doi.org/10.1287/opre.5.2.266

    Article  MathSciNet  Google Scholar 

  • Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., Pineau, J. (2019). Tarmac: Targeted multi-agent communication. In Proceedings of the 36th International Conference on Machine Learning, pp. 1538–1546. PMLR, url: https://proceedings.mlr.press/v97/das19a.html

  • Diamond, S., Boyd, S. (2016). CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research. To appear.

  • Dusparic, I., Harris, C., Marinescu, A., Cahill, V., Clarke, S. (2013). Multi-agent residential demand response based on load forecasting. In 2013 1st IEEE Conference on Technologies for Sustainability (SusTech), pp. 90–96 https://doi.org/10.1109/SusTech.2013.6617303

  • Fuchs, A., Walton, M., Chadwick, T., Lange, D. (2021). Theory of mind for deep reinforcement learning in hanabi. ar**v:2101.09328 [cs].

  • Fuchs, A., Walton, M., Chadwick, T., Lange, D. (2021). Theory of mind for deep reinforcement learning in hanabi. ar**v:2101.09328 [cs].

  • Gronauer, S., & Diepold, K. (2021). Multi-agent deep reinforcement learning: A survey. Artificial Intelligence Review. https://doi.org/10.1007/s10462-021-09996-w

    Article  Google Scholar 

  • Guan, C., Chen, F., Yuan, L., Zhang, Z., Yu, Y. (2023). Efficient communication via self-supervised information aggregation for online and offline multi-agent reinforcement learning (ar**v:2302.09605) https://doi.org/10.48550/ar**v.2302.09605.ar**v:2302.09605 [cs]

  • Gupta, J.K., Egorov, M., Kochenderfer, M. (2017) In: Sukthankar, G., Rodriguez-Aguilar, J.A. (eds.) Cooperative Multi-agent Control Using Deep Reinforcement Learning. Lecture Notes in Computer Science, vol. 10642, pp. 66–83. Springer, Cham. https://doi.org/10.1007/978-3-319-71682-4_5

  • Gurobi Optimization. (2022). LLC: Gurobi Optimizer Reference Manual. https://www.gurobi.com

  • Jiang, J., Lu, Z. (2018). Learning attentional communication for multi-agent cooperation. In Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc., url: https://proceedings.neurips.cc/paper/2018/hash/6a8018b3a00b69c008601b8becae392b-Abstract.html

  • Dong, J., Olama, M., Kuruganti, T., Nutaro, J., Winstead, C., Xue, Y., Melin, A. (2018). Model predictive control of building on/off hvac systems to compensate fluctuations in solar power generation. In 2018 9th IEEE International Symposium on Power Electronics for Distributed Generation Systems (PEDG), pp. 1–5. https://doi.org/10.1109/PEDG.2018.8447840

  • Kingma, D. P., Ba, J. (2017). Adam: A method for stochastic optimization. ar**v:1412.6980 [cs] ar**v: 1412.6980

  • Kraemer, L., & Banerjee, B. (2016). Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190, 82–94. https://doi.org/10.1016/j.neucom.2016.01.031

    Article  Google Scholar 

  • Kundur, P. (2007). Power system stability. Power system stability and control, pp. 7–1.

  • Lacoste, A., Luccioni, A., Schmidt, V., Dandres, T. (2019). Quantifying the carbon emissions of machine learning (ar**v:1910.09700) https://doi.org/10.48550/ar**v.1910.09700.ar**v:1910.09700 [cs].

  • Lagae, A., Lefebvre, S., Cook, R., DeRose, T., Drettakis, G., Ebert, D. S., Lewis, J. P., Perlin, K., & Zwicker, M. (2010). A survey of procedural noise functions. Computer Graphics Forum, 29(8), 2579–2600. https://doi.org/10.1111/j.1467-8659.2010.01827.x

    Article  Google Scholar 

  • Lauro, F., Moretti, F., Capozzoli, A., Panzieri, S. (2015). Model predictive control for building active demand response systems. Energy Procedia 83, 494–503. https://doi.org/10.1016/j.egypro.2015.12.169.Sustainability in Energy and Buildings: Proceedings of the 7th International Conference SEB-15.

  • Lee, Y. M., Horesh, R., & Liberti, L. (2015). Optimal hvac control as demand response with on-site energy storage and generation system. Energy Procedia, 78, 2106–2111.

    Article  Google Scholar 

  • Lesage-Landry, A., Taylor, J. A. (2021). Callaway, D.S.: Online convex optimization with binary constraints. IEEE Transactions on Automatic Control.

  • Lesage-Landry, A., & Taylor, J. A. (2018). Setpoint tracking with partially observed loads. IEEE Transactions on Power Systems, 33(5), 5615–5627.

    Article  ADS  Google Scholar 

  • Lillicrap, T.P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D. (2019). Continuous control with deep reinforcement learningar**v:1509.02971 [cs, stat].

  • Liu, M., & Shi, Y. (2015). Model predictive control of aggregated heterogeneous second-order thermostatically controlled loads for ancillary services. IEEE Transactions on Power Systems, 31(3), 1963–1971.

    Article  ADS  Google Scholar 

  • Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I. (2020). Multi-agent actor-critic for mixed cooperative-competitive environments. ar**v:1706.02275 [cs].

  • Maasoumy, M., Sanandaji, B. M., Sangiovanni-Vincentelli, A., Poolla, K. (2014). Model predictive control of regulation services from commercial buildings to the smart grid. In 2014 American Control Conference, pp. 2226–2233 IEEE.

  • Mai, V., Maisonneuve, P., Zhang, T., Nekoei, H., Paull, L., Lesage-Landry, A. (2023). Multi-agent reinforcement learning for fast-timescale demand response of residential loads. In AAMAS’23: Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS’23).

  • Mathieu, J. L., Koch, S., & Callaway, D. S. (2012). State estimation and control of electric loads to manage real-time energy imbalance. IEEE Transactions on Power Systems, 28(1), 430–440.

    Article  ADS  Google Scholar 

  • Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Silver, D., Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. ar**v:1602.01783 [cs]. ar**v: 1602.01783

  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(75407540), 529–533. https://doi.org/10.1038/nature14236

    Article  ADS  CAS  PubMed  Google Scholar 

  • Olama, M. M., Kuruganti, T., Nutaro, J., & Dong, J. (2018). Coordination and control of building hvac systems to provide frequency regulation to the electric grid. Energies. https://doi.org/10.3390/en11071852

    Article  Google Scholar 

  • OpenAI, Berner, C., Brockman, G., Chan, B., Cheung, V., Dȩbiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., Pinto, H.P.d.O., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., Zhang, S. (2019). Dota 2 with large scale deep reinforcement learning ar**v: 1912.06680

  • Pardo, F., Tavakoli, A., Levdik, V., Kormushev, P. (2018). Time limits in reinforcement learning. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 4045–4054. PMLR. url: https://proceedings.mlr.press/v80/pardo18a.html

  • Pigott, A., Crozier, C., Baker, K., Nagy, Z. (2021). GridLearn: Multiagent Reinforcement Learning for Grid-Aware Building Energy Management. https://doi.org/10.48550/ARXIV.2110.06396

  • Qin, Z., Zhu, H., Ye, J. (2022). Reinforcement learning for ridesharing: An extended survey. ar**v:2105.01099 [cs].

  • Roesch, M., Linder, C., Zimmermann, R., Rudolf, A., Hohmann, A., & Reinhart, G. (2020). Smart grid for industry using multi-agent reinforcement learning. Applied Sciences. https://doi.org/10.3390/app10196900

    Article  Google Scholar 

  • Sartoretti, G., Kerr, J., Shi, Y., Wagner, G., Kumar, T.K.S., Koenig, S., Choset, H. (2019). Primal: Pathfinding via reinforcement and imitation multi-agent learning. IEEE Robotics and Automation Letters 4(3), 2378–2385. https://doi.org/10.1109/LRA.2019.2903261.ar**v:1809.03531 [cs].

  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. https://doi.org/10.1007/s00038-010-0125-8

  • Siano, P. (2014). Demand response and smart grids-a survey. Renewable and Sustainable Energy Reviews, 30, 461–478.

    Article  Google Scholar 

  • Subramanian, J., Seraj, R., Mahajan, A. (2018). Reinforcement learning for mean field teams. In Workshop on Adaptive and Learning Agents at the International Conference on Autonomous Agents and Multi-Agent Systems.

  • Taylor, J. A., Dhople, S. V., & Callaway, D. S. (2016). Power systems without fuel. Renewable and Sustainable Energy Reviews, 57, 1322–1336.

    Article  Google Scholar 

  • Vazquez-Canteli, J. R., Dey, S., Henze, G., Nagy, Z. (2020). Citylearn: Standardizing research in multi-agent reinforcement learning for demand response and urban energy management. https://doi.org/10.1145/3408308.3427604

  • Wang, J., Xu, W., Gu, Y., Song, W., Green, T. C. (2022). Multi-agent reinforcement learning for active voltage control on power distribution networks (ar**v:2110.14300). https://doi.org/10.48550/ar**v.2110.14300.ar**v:2110.14300 [cs].

  • Wang, Z., Chen, B., Li, H., & Hong, T. (2021). Alphabuilding rescommunity: A multi-agent virtual testbed for community-level load coordination. Advances in Applied Energy, 4, 100061.

    Article  Google Scholar 

  • Wu, X., He, J., Xu, Y., Lu, J., Lu, N., & Wang, X. (2018). Hierarchical control of residential hvac units for primary frequency regulation. IEEE Transactions on Smart Grid, 9(4), 3844–3856. https://doi.org/10.1109/TSG.2017.2766880

    Article  Google Scholar 

  • **, L., Jianfeng, H., & Y., Xu, Y., Liu, L., Zhou, Y., Li, Y. (2018). Smart generation control based on multi-agent reinforcement learning with the idea of the time tunnel. Energy, 153, 977–987. https://doi.org/10.1016/j.energy.2018.04.042

  • Yang, Y., Hao, J., Zheng, Y., Hao, X., Fu, B. (2019). Large-scale home energy management using entropy-based collective multiagent reinforcement learning framework. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. AAMAS ’19, pp. 2285–2287. International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC.

  • Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean field multi-agent reinforcement learning. https://doi.org/10.48550/ar**v.1802.05438

  • Yu, C., Velu, A., Vinitsky, E., Wang, Y., Bayen, A., Wu, Y. (2021). The surprising effectiveness of ppo in cooperative, multi-agent games. ar**v:2103.01955 [cs].

  • Yuan, L., Wang, J., Zhang, F., Wang, C., Zhang, Z., Yu, Y., & Zhang, C. (2022). Multi-agent incentive communication via decentralized teammate modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 36(99), 9466–9474. https://doi.org/10.1609/aaai.v36i9.21179

    Article  Google Scholar 

  • Zhang, W., Lian, J., Chang, C.-Y., & Kalsi, K. (2013). Aggregated modeling and control of air conditioning loads for demand response. IEEE Transactions on Power Systems, 28(4), 4655–4664.

    Article  ADS  Google Scholar 

  • Zhou, X., Dall’Anese, E., & Chen, L. (2019). Online stochastic optimization of networked distributed energy resources. IEEE Transactions on Automatic Control, 65(6), 2387–2401.

    Article  MathSciNet  Google Scholar 

Download references

Funding

Natural Sciences and Engineering Research Council of Canada (NSERC): Vincent Mai (VM), Hadi Nekoei (HN), Liam Paull (LP). Institut de Valorisation des Données: Philippe Maisonneuve (PM), Antoine Lesage-Landry (ALL). Microsoft Research and Samsung: Tianyu Zhang (TZ).

Author information

Authors and Affiliations

Authors

Contributions

Problem identification and formalization: VM and ALL Simulator implementation: VM and PM Classical baselines implementations: PM Machine learning algorithms: VM, TZ, HN Experiments: VM and PM Analysis: Everyone Manuscript redaction: VM Manuscript review: Everyone Supervision and guidance: ALL and LP.

Corresponding author

Correspondence to Hadi Nekoei.

Ethics declarations

Conflict of interest

None

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Emma Brunskill, Minmin Chen, Omer Gottesman, Lihong Li, Yuxi Li, Yao Liu, Zonging Lu, Niranjani Prasad, Zhiwei Qin, Csaba Szepesvari, Matthew Taylor.

The original online version of this article was revised: There were many instances of wrong in-text reference format in the original article, which would impair the reading of the paper

Appendices

Appendix A: Notation

Table 7 contains the different notations we use in this paper.

Table 7 Notation table

Appendix B: Carbon emissions of the research project

As a significant amount of electricity has been used to train and run the models for this work, we publish its estimated carbon footprint.

Experiments were conducted using a private infrastructure, which has a carbon efficiency of 0.049 kg CO\(_2\)eq/kWh. A cumulative of 10895 days, or 261480 h, of computation was mainly performed on CPU of type Intel Xeon Processor E5-2683 v4 (TDP of 120W). We assume on average a power usage of half the TDP for CPUs.

The total emissions are estimated to be 628 kg CO\(_2\)eq of which 0% were directly offset. This is equivalent to 2550 km driven by an average car, or 314 kg of burned coal.

These estimations were conducted using the MachineLearning Impact calculator (Lacoste et al., 2019).

Appendix C: Environment details

1.1 C.1: Detailed house thermal model

The air temperature in each house evolves separately, based on its thermal characteristics \(\theta _h\), its current state, the outdoor conditions such as outdoor temperature and solar gain, and the status of the air conditioner in the house. The second-order model is based on Gridlab-D’s Residential module user’s guide (Betelle Memorial Institute, 2022).

Using Gridlab-D’s module, we model an 8\(\times\)12.5 m, one level rectangular house, with a ceiling height of 2.5 m, four 1.8m\(^2\), 2-layer, aluminum windows, and two 2 m\(^2\) wooden doors, leading to the values presented in Table 8.

Table 8 Default house thermal parameters \(\theta _h\)

To model the evolution of the house’s air temperature \(T_{h,t}\) and its mass temperature \(T_{m,t}\), we assume that this temperature is homogeneous and do not consider the heat propagation in the house. We define the following variables:

$$\begin{aligned} a =&C_m C_h / H_m \\ b =&C_m (U_h + H_m)/H_m + C_h \\ c =&U_h\\ d =&Q_{a,t} + Q_{s,t} + U_hT_{o,t}\\ dT_{h,t}/dt =&\left( H_mT_{m,t} - (U_h + H_m) T_{h,t}\right. \\&\left. + U_hT_{o,t} + Q_{h,t} + Q_{s,t}\right) /C_h. \end{aligned}$$

The following coefficient are then computed:

$$\begin{aligned} r_1 =&(-b + \sqrt{b^2 - 4ac})/2a\\ r_2 =&(-b - \sqrt{b^2 - 4ac})/2a\\ A_1 =\,&(r_2 T_{h,t} - dT_{h,t}/dt - r_2d/c)/(r2-r1)\\ A_2 =&T_{h,t} - d/c - A_1\\ A_3 =&(r_1 C_h + U_h + H_m)/H_m\\ A_4 =&(r_2 C_h + U_h + H_m)/H_m. \end{aligned}$$

These coefficients are finally applied to the following dynamic equations:

$$\begin{aligned} T_{h,t+1} =&A_1 e^{r_1 \delta t} + A_2 e^{r_2 \delta t} + d/c \\ T_{m, t+1} =\,&A_1A_3 e^{r_t \delta t} + A_2A_4 e^{r_2 \delta _t} + d/c. \end{aligned}$$

1.1.1 C.1.1: Solar gain

It is possible to add the solar gain to the simulator. It is computed based on the CIBSE Environmental Design Guide (CIBSE, 2015).

The house’s lighting characteristics \(\theta _S\), which include the window area and the shading coefficient of 0.67 are needed to model the solar gain, \(Q_{s,t}\).

Then, the following assumptions are made:

  • The latitude is 30\(^\circ\).

  • The solar gain is negligible before 7:30 am and after 5:30 pm at such latitude.

  • The windows are distributed evenly around the building, in the 4 orientations.

  • All windows are vertical.

This allows us to compute the coefficients of a fourth-degree bivariate polynomial to model the solar gain of the house based on the time of the day and the day of the year.

1.2 C.2: Detailed air conditioner model

Once again based on the Gridlab-D Residential module user’s guide (Betelle Memorial Institute, 2022), we model the air conditioner’s power consumption \(P_{a,t}\) when turned on, and the heat removed from the air \(Q_{a,t}\), based on its characteristics \(\theta _H\), such as cooling capacity \(K_a\), coefficient of performance \(COP_a\), and the latent cooling fraction \(L_a\).

\(COP_a\) and \(L_a\) are considered constant and based on default values of the guide: \(COP_a = 2.5\) and \(L_a = 0.35\). We have:

$$\begin{aligned} Q_{a,t}&= - \dfrac{K_a}{1+L_a} \\ P_{a,t}&= \dfrac{K_a}{COP_a}. \end{aligned}$$

We set \(K_a\) to 15 kW, or 50 000 BTU/hr, to be able to control the air temperature even with high outdoor temperatures. This is higher than most house ACs, but allows to have sufficient flexibility even at high outdoor temperatures (a 5kW AC would have to be always on to keep a 20\(^{\circ }\)C temperature when it is 38\(^{\circ }\)C outside). This choice does not significantly affect our results: with lower outdoor temperatures, the problem is equivalent with lower AC power.

1.3 C.3: Regulation signal

1.3.1 C.3.1: Interpolation for the base signal

As described in Sect. 3.1.4, we estimate \(D_{a,t}\) by interpolation. A bang-bang controller is ran without lockout for 5 min, and we compute the average power that was consumed. This gives a proxy for the amount of power necessary in a given situation.

A database was created by estimating \(D_{a,t}\) for a single house for more than 4 million combinations of the following parameters: the house thermal characteristics \(\theta _h\), the differences between its air and mass temperatures \(T_{a,t}\) and \(T_{m,t}\) and the target temperature \(T_T\), the outdoor temperature \(T_{o,t}\), and the AC’s cooling capacity \(K_a\). If the solar gain is added to the simulation, the hour of the day and the day of the year are also considered.

When the environment is simulated, every 5 min, \(D_{a,t}\) is computed by summing the interpolated necessary consumption of every house of the cluster. The interpolation process is linear for most parameters except for the 4 elements of \(\theta _h\) and for \(K_a\), which are instead using nearest neighbours to reduce the complexity of the operation.

1.3.2 C.3.2: Perlin noise

1-D Perlin noise is used to compute \(\delta _{\Pi ,t}\), the power generation high-frequency element. Designed for the field of computer image generation, this noise has several interesting properties for our use case.

Perlin noise is most of the time generated by the superposition of several sub-noises called octaves. It is possible to restrict the span of the values that they can take. Thus, it is possible to test the agents in an environment taking into account several frequencies of non-regular noise, but whose values are restricted within realistic limits. Moreover, the average value of the noise can be easily defined and does not deviate, which ensures that for a sufficiently long time horizon, the noise average is 0.

Each octave is characterized by 2 parameters: an amplitude and a frequency ratio. The frequency represents the distance between two random deviations. The amplitude represents the magnitude of the variation. Normally the frequency increases as the amplitude decreases. This way, high-amplitude noise is spread over a wider interval and lower amplitude noise is more frequent and compact. This is illustrated in Fig. 7.

Fig. 7
figure 7

Illustration of how several octaves add up to form Perlin noise. The frequency of the octaves increases as their amplitude decreases

In our case, we use 5 octaves, with an amplitude ratio of 0.9 between each octave and a frequency proportional to the number of the octave.

Appendix D: Algorithm details

1.1 D.1: Model Predictive Control

Our MPC is based on a centralized model. At each time step, information about the state of the agents is used to find the future controls that minimize the reward function over the next H time steps. The optimal immediate action is then communicated to the agents. At each time step, the algorithm calculates the ideal control combination for the H-time step horizon.

The cost function for both the signal and the temperature to minimize being the RMSE, the problem is modeled as a quadratic mixed-integer program. The solver used to solve the MPC is the commercial solver Gurobi 9.5.1 (Gurobi Optimization, 2022) together with CVXPY 1.3 (Diamond & Boyd, 2016). Gurobi being a licensed solver, its exact internal behavior is unknown to us and it acts as a black box for our MPC. However, we know that it solves convex integer problems using the branch and bound algorithm. The speed of resolution depends mainly on the quality of the solver’s heuristics.

The computation time required for each step of the MPC increases drastically with the number of agents and/or H. To be able to test this approach with enough agents and a rolling horizon allowing to have reasonable performance, it was necessary to increase the time step at which the agents make decisions to 12 s (instead of 4 for other agents).

It was impossible to launch an experiment with the MPC agent for 48 h in a reasonable time. To compensate, we launched in parallel 200 agents having been started at random simulated times. In order to reach quickly the stability of the environment, the noise on the temperature was reduced to 0.05\(^{\circ }\)C. We then measured the average RMSE over the first 2 h of simulation for each agent.

Despite this, it was impossible to test the MPC with more than 10 agents while kee** the computation time reasonable enough to be used in real time. That is to say, in a time shorter than the duration between two-time steps.

At each time step, the MPC solves the following optimization problem:

$$\min _{a \in \left\{ {0,1} \right\}^{N \times H} } \sum_{t \in H} {\alpha _{{\text{sig}}} } \left( {\sum_{i{{\epsilon }}N} {P_{i,t} } - s_0 } \right)^2 + \alpha _{{\text{temp}}} \sum_{i{{\epsilon }}N} {\left( {T_{h,t,i} - T_{t,t,i} } \right)^2 } ,$$

such that it obeys the following physical constraints of the environment:

$$\begin{aligned} T_{h,t,i}, T_{m,t,i} = F_1(a_{i,t}, T_{h,t-1,i},T_{m,t-1,i}) \ {}&\forall \ t \in H, i \in N \\ P_{h,t,i} = a_{i,t} F_2(\theta _{a}^i) \&\forall \ t \in H, i \in N, \end{aligned}$$

and the lockout constraint:

$$\begin{aligned} l_{max}(a_{i,t} - \omega _{i,t-1}) - \sum _{k=0}^{l_{max}} (1-\omega _{i,t-k})\le 0 \ \forall \ t \in H, i \in N, \end{aligned}$$

where \(F_1\) and \(F_2\) are convex functions that can be deduced from the physical equations given in Sect. 3.

1.2 D.2: Learning-based methods

1.2.1 D.2.1: TarMAC and MA-PPO

The original implementation of TarMAC (Das et al., 2019) is built over the Asynchronous Advantage Actor-Critic (A3C) algorithm (Mnih et al., 2016). The environments on which it is trained have very short episodes, making it possible for the agents to train online over the whole memory as one mini-batch.

This is not possible with our environment where training episodes last around 16000 time steps. As a result, we built TarMAC over our existing MA-PPO implementation. The same loss functions were used to train the actor and the critic.

The critic is given all agents’ observation as an input.

The actor’s architecture is described in Fig. 8. Agent i’s observations are passed through a first multi-layer perceptron (MLP), outputting a hidden state x. x is then used to produce a key, a value, and a query by three MLPs. The key and value are sent to the other agents, while agent i receives the other agents’ keys and values. The other agents’ keys are multiplied using a dot product with agent i’s query, and passed through a softmax to produce the attention. Here, a mask is applied to impose the localized communication constraints and ensure agent i only listen to its neighbours. The attention is then used as weights for the values, which are summed together to produce the communication vector for agent i. For multi-round communication, the communication vector and x are concatenated and passed through another MLP to produce a new x, and the communication process is repeated for the number of communication hops. Once done, the final x and communication vector are once more concatenated and passed through the last MLP, the actor, to produce the action probabilities.

We take advantage of the centralized training approach to connect the agents’ communications in the computational graph during training. Once trained, the agents can be deployed in a decentralized way.

In order to maintain the privacy constraint of local communications, meaning that information about an agent is only shared with its immediate neighbours, two design choices have been made. First, the agents do not retain any memory of past messages received, relying solely on their present observations to generate messages. Second, the communication is limited to a single hop, ensuring that messages travel only to the neighbouring agents.

Fig. 8
figure 8

Architecture of the TarMAC-PPO actor

1.2.2 D.2.2: Neural networks architecture and optimization

For MA-DQN as well as for MA-PPO-HE, every neural network has the same structure, except for the number of inputs and outputs. The networks are composed of 2 hidden layers of 100 neurons, activated with ReLU, and are trained with Adam (Kingma & Ba, 2017).

For TarMAC-PPO, the actor’s obs2hidden, hidden2key, hidden2val, hidden2query and actor MLPs (as shown in Fig. 8) all have one hidden layer of size 32. obs2hidden and actor are activated by ReLU whereas the three communication MLPs are activated by hyperbolic tangent. The hidden state x also has a size of 32.

The centralized critic is an MLP with two hidden layers of size 128 activated with ReLU. The input size is the number of agents multiplied by their observation size, and the output size is the number of agents.

For all networks, the inputs are normalized by constants approximating the mean and standard deviation of the features, to facilitate the training. The networks are optimized using Adam.

1.2.3 D.2.3: Hyperparameters

We carefully tuned the hyperparameters through grid searches. Table 9 shows the hyperparameters selected for the agents presented in the paper.

Table 9 Training hyperparameters

Appendix E: \(N_\textrm{de}\) and per-agent RMSE

In this section, we discuss the relation between the per-agent signal RMSE of an aggregation of N homogeneous agents if N is multiplied by an integer \(k \in \mathbb {N}\).

We consider the aggregation of size kN as the aggregation of k homogeneous groups \(g_j\) of N agents which consumes a power \(P^j_{g,t} = \sum _i^N P^i_{a,t}\). We have: \(P_t = \sum _i^{kN} P^i_{a,t} = \sum _j^k P^j_{g,t}\).

We assume that each group tracks an equal portion of the signal \(s^j_t = s_t/k\). We assume that the tracking error \(P^j_{g,t} - s^j_t\) follows a 0-mean Gaussian of standard deviation \(\sigma _g\). This Gaussian error is uncorrelated to the noise of other groups.

It follows from the properties of Gaussian random variables that the aggregation signal error \(P_t - s_t\) follows a Gaussian distribution of mean \(\mu _k = 0\) and standard deviation \(\sigma _k = \sqrt{k}\sigma _g\) for all \(k \ge 1\) with \(k \in \mathbb {N}\).

Hence the signal’s RMSE of a group of kN agents, which is a measured estimation of \(\sigma _k\), is approximately \(\sqrt{k}\) times the RMSE of a group of N agents, which estimates \(\sigma _g\). Finally, the per-agent RMSE is computed as the group’s RMSE divided by the number of agents. We therefore have that the per-agent RMSE of kN agents is approximately \(\sqrt{k}/k = 1/\sqrt{k}\) times the RMSE of N agents.

This discussion provides an intuitive explanation for the diminution of the relative RMSE when the number of agents increases. However, it is based on the assumption that the error of each group is not biased, which is not necessarily true with our agents. This explains why the RMSEs are not 10 times lower passing from \(N_\textrm{de} = 10\) to \(N_\textrm{de} = 1000\).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mai, V., Maisonneuve, P., Zhang, T. et al. Multi-agent reinforcement learning for fast-timescale demand response of residential loads. Mach Learn (2023). https://doi.org/10.1007/s10994-023-06460-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10994-023-06460-4

Keywords

Navigation