1 Introduction

Modern cyber-physical systems (CPS), e.g., cars, aircrafts, advanced robots, and drones, are characterized by an increasing complexity that calls for new technologies and architectural solutions to guarantee predictability, safety, and security requirements. In addition, the increased level of autonomy specified for such systems requires the adoption of artificial intelligence (AI) and, more specifically, machine learning algorithms, which in turn imply heavy use of hardware acceleration to satisfy the stringent real-time constraints imposed by the applications.

Unfortunately, however, today’s AI algorithms are not ready to be integrated in mission-critical CPS, since their results cannot be always trusted and well-accepted engineering methodologies to mitigate the problem are still missing. A promising solution consists in coupling AI models with a set of classical algorithms that can take over the control of the system whenever the outputs produced by AI are not deemed safe, with the aim of bringing the system into fail-safe or fail-operational conditions.

In such complex systems, at least two groups of software components can be distinguished, being characterized by different sets of requirements and criticality levels:

  • Software components that require support from a rich execution environment (e.g., based on the Linux operating system), like AI algorithms, acquisition and processing stacks for complex sensors (as cameras and LiDARs), and high-speed network communication services.

  • Software components that require a high-integrity execution environment (e.g., powered by a real-time operating system), like low-level control functions, safety-critical monitoring activities, and procedures to ensure fail-safe/fail-operational behavior.

The components belonging to the first group can be deemed not critical for safety and security, provided that they are properly isolated from the critical ones belonging to the second group. In this context, strong isolation is required to ensure that non-critical components cannot affect the execution of critical ones, including the guarantee that cyber-attacks and faults cannot propagate from the former to the latter.

Isolation could be achieved by executing these software components on different hardware platforms, e.g., reserving an independent platform to host the execution of critical software only. However, in several cases, as for battery-operated flying drones, such features have to be provided under stringent resource constraints, imposing additional limitations in terms of space, weight, power, and cost (SWaP-C). For this reason, a more appropriate solution is to host the execution of software components with mixed and independent safety and security levels on the same hardware platform. These systems are also referred to as mixed-criticality software systems and can leverage hypervisor technology to enforce isolation as well as enable the execution of multiple operating systems on the same hardware.

Mixed-criticality systems powered by hypervisor technology have been investigated since many years from different perspectives, especially in domains such as avionics (Gaska et al. 2011), aerospace (Crespo et al. 2009), and control (Crespo et al. 2018). Farrukh and West (2022) proposed a hypervisor-based architecture for combining a Linux domain with a real-time critical domain on the same platform for a drone application, but no machine learning and hardware acceleration were exploited. Scordino et al. (2020) presented a modular hypervisor-based platform for industrial automation for integrating both real-time control code and software design tools, but no AI algorithms and FPGA acceleration were employed.

Similarly, the challenges of achieving real-time performance in AI-powered cyber-physical systems have been discussed and reviewed by several authors (Musliner et al. 1995; Radanliev et al. 2020; Seng et al. 2021). For instance, Wang and Luo (2022) presented a review on the optimal design of neural networks on FPGA platforms. Ji et al. (2021) presented the implementation of a deep neural network for real-time object detection and tracking on an embedded system based on an FPGA Zynq platform. Sciangula et al. (2022) proposed an efficient method for accelerating deep neural networks for autonomous driving applications on an FPGA-based SoC. All these works, however, did not leverage hypervisor technology to integrate mixed criticality components.

A conceptual hypervisor-based architecture for supporting the execution of complex functionalities that are typical of AI-enabled CPS was proposed by Biondi et al. (2020); however, to the best of our records, a practical solution that integrates the acceleration of deep neural networks with real-time control on a single platform using hypervisor technology is still missing.

1.1 Contribution

This work presents a concrete software architecture for supporting AI-enabled CPS with mixed-criticality components. The proposed architecture targets heterogeneous computing platforms that couple asymmetric multicores with programmable logic (FPGA). It leverages hypervisor technology with strong isolation, hardware acceleration of AI algorithms implemented in programmable logic, and monitoring strategies to take over the control of the system whenever non-critical software components fail, are attacked, or produce results that are deemed unsafe. The architecture is then specialized for the case of autonomous flying drones, showing how it can be used to build a safe and secure tracking application.

1.2 Paper organization

The rest of the paper is organized as follows: Sect. 2 discusses some relevant related work; Sect. 3 presents the general architectural approach; Sect. 4 describes how the proposed architecture has been instantiated to a specific use case consisting of a visual tracking application performed by a drone; Sect. 5 reports some experimental results; and, finally, Sect. 6 concludes the paper and presents some future work.

2 State of the art

To the best of our records, there is no established approach in the literature for develo** AI-powered cyber-physical systems; rather, several architectural solutions have been proposed by researchers in different contexts.

A classical approach to handle functions with different criticality requirements consists in executing them on separate computing platforms, typically managed by different operating systems that communicate through an external link, such as a serial line, a CAN network, or an Ethernet bus. Examples of CPS that adopt this approach are the Intel Ready-to-Fly (RTF) Drone (Intel Corporation), the Cube Autopilot (CubePilot), and several solutions that leverage the Robot Operating System (ROS) (Gutiérrez et al. 2016) implemented a real-time publish-subscribe communication mechanism in the Xstratum (Crespo et al. 2009) hypervisor integrating ARINC-653 with the Data Distribution Service (DDS). Biondi et al. (2021) proposed a hypervisor-based architecture for safety-critical embedded systems providing both time/memory isolation, security, real-time communication channels, as well as I/O virtualization to allow different virtual machines to share the peripheral devices. Farrukh and West (2022) proposed a hypervisor-based solution characterized by low overheads in accessing resources. Their approach requires strict time guarantees for both domains, forcing the execution of Linux on a single core with SCHED_DEADLINE (Lelli et al. 2016), which is a viable solution in terms of real-time constraints, but can introduce several limitations in the implementation of complex AI-based solutions.

2.2 Hardware acceleration

A peculiar feature of AI-powered cyber-physical systems is their massive computational workload for executing AI algorithms such as deep neural networks. The most demanding functions of these algorithms need to be accelerated on specific hardware, such as general-purpose graphics processing units (GPUs) or field-programmable gate arrays (FPGA), to satisfy real-time requirements.

Modern GPU-based heterogeneous platforms benefit from powerful and mature software support to accelerate AI algorithms, which allows the user to significantly contain the effort for achieving efficient implementations of tasks like object detection, image segmentation, and tracking. The acceleration frameworks for GPU-based platforms also allow a developer to seamlessly use, with no or just a few modifications, the AI models available in frameworks such as Tensorflow, PyTorch, and Caffe, even with the native parameters with floating-point precision.

On the other hand, when compared to FPGA-based platforms, GPU-based platforms are very demanding in terms of power consumption and struggle in providing a high degree of time predictability for hardware acceleration. Their power consumption can be one order of magnitude larger than the one required by FPGA-based platforms (Sciangula et al. 2022; Qasaimeh et al. 2019). Furthermore, as observed by Cavicchioli et al. (2017), GPU acceleration introduces highly variable delays that cannot easily be bounded a priori, also due to the contention occurring on shared memory in the case of memory-intensive GPU tasks. As such, GPU-based are not the ideal solution for battery-operated CPS such as drones.

Besides being characterized by less energy consumption, FPGAs provide a very predictable execution behavior with respect to GPUs for hardware acceleration.

Two main approaches are used to accelerate deep neural networks by means of FPGA technology:

  1. 1.

    The synthesis of a network-specific accelerator provides the best performance but suffers from poor flexibility and scalability, especially for large networks. To name one of the most relevant issues, this approach ends up in deploying replicated logic that implements the same operation (e.g., convolutions) on different data. Some tools provide IPs (e.g., HLS4ML Fahim et al. 2021; AMD **linx: FINN) as standalone Verilog/VHDL entities, which can be later integrated into more complex designs. Another relevant limitation of this approach is that the generated IP must be entirely rebuilt every time there is any change in the network.

  2. 2.

    A more flexible solution is to accelerate neural networks by means of a dedicated softcore. For instance, **linx provides a Deep Learning Processor Unit  (AMD **linx: DPU) as a library component in the Vitis-AI environment (AMD **linx: Vitis AI). Besides the evident benefits in terms of flexibility provided by a network-agnostic accelerator such as the DPU, an advantage of this approach is that a single DPU can concurrently accelerate multiple networks, while in the other approach, the number of networks that can be accelerated is mainly limited by the amount of FPGA resources (such as LUTs).

The main disadvantage of FPGAs is that they require a larger programming effort than GPUs, especially when develo** a network-specific accelerator. Another restriction is the limited amount of FPGA resources that is available in several embedded platforms, which calls for the usage of dynamic partial reconfiguration (Biondi et al. 2016; Seyoum et al. 2021) of the FPGA at the cost of additional delays when serving acceleration requests. Furthermore, as observed by some authors (Vaishnav et al. 2018; Happe et al. 2015; Rupnow et al. 2009), even without dynamic reconfiguration, sharing an FPGA among tasks managed by a preemptive scheduling policy is not trivial, due to the significant amount of time required to save the state of the device.

Fortunately, especially in the case of softcore accelerators such as the DPU, compilation and optimization frameworks are available to drastically simplify the deployment of accelerated neural networks. These frameworks employ pruning and quantization (Zhou et al. 2017) of the network parameters to achieve an efficient execution on the FPGA. The accuracy drop of these optimization processes was found not to be significant in several application scenarios (Gholami et al. 2022; Liang et al. 2021). The optimization algorithms dealing with the conversion from floating point to integer values are indeed now efficient enough to guarantee consistency in the transformation of the models from one platform to another (GPU to FPGA) (Ding et al.

  • ZCU104 board by **linx/AMD, equipped with an Ultrascale+ MPSoC (XCZU7EV);

  • **linx/AMD DPU accelerator (DPUCZDX8G) to be deployed onto the FPGA fabric of the Ultrascale+;

  • CLARE-Hypervisor by (Accelerat: The CLARE Software Stack);

  • Linux operating system for the non-critical, high-performance domain; and

  • FreeRTOS as the real-time operating system for the critical domain.

  • The ZCU104, although conceived as a development board, allows matching SWaP-C constraints for several target applications, at least in their prototype stage. At the same time, the amount of FPGA resources available in the MPSoC installed on the ZCU104 allows deploying peripherals that are missing in the board, with a significant speed-up and flexibility in the hardware setup. Finally, the Deep Learning Processor Unit (DPU) by ** mixed-criticality CPS applications. It has been designed to explicitly support modern heterogeneous platforms, such as GPGPU- and FPGA-based MPSoC, to safely and securely control their computational resources. CLARE-Hypervisor also provides multi-domain virtualization of the FPGA area, enabling strong isolation also for PL components such as hardware accelerators.

    Linux has been selected as GPOS for its extensive support for peripherals drivers, communication stacks, and modern AI frameworks.

    FreeRTOS has been selected as RTOS because its execution model is suitable for timing analysis and, because of its diffusion, it includes a rich set of drivers for low-level devices.

    The resulting specialized architecture is illustrated in Fig. 1.

    Fig. 1
    figure 1

    Illustration of the proposed specialized architecture

    4 The case for autonomous drones

    This section describes how the architecture presented in Sect. 3 can be used to implement an AI-powered visual tracking application on a quadcopter drone equipped with an inertial measurement unit (IMU), a camera for object tracking, and two directional LiDAR sensors for obstacle detection, one pointing forward and one backward.

    The overall block diagram of the multi-domain application that controls the drone is illustrated in Fig. 2, which also distinguishes the functions executed in the Linux domain (blue blocks) from those running in the critical domain (orange blocks). In particular, the ARMv8 processing system is divided across domains, so that three out of four cores are assigned to the Linux domain, while the remaining one is assigned to the freeRTOS domain. The figure also highlights with a double border the modules that are either entirely implemented in FPGA or leverage the FPGA to accelerate some functions.

    Fig. 2
    figure 2

    Function diagram of the multi-domain application that controls the drone

    The main task of the Linux domain is the inference of a deep neural network (DNN) for real-time multiple object tracking using a strategy derived from DeepSORT (Wojke et al. 2017) and BYTEtrack (Zhang et al. 2021). The generated bounding boxes, paired with a unique ID of the object, are used to compute a setpoint for the low-level drone controller running in the critical domain. In this context, support for hardware acceleration is essential to achieve acceptable performance, because all state-of-the-art neural trackers generate a significant workload that has to be executed in real-time (normally at the camera frame rate). Table 1 summarizes the main functions that compose the system.

    Table 1 Application functions

    4.1 Devices synthesized on the FPGA

    The FPGA is used to synthesize a number of devices that are assigned to the virtual machines by the hypervisor. In particular, each device is exclusively assigned in a pass-through way to one domain only, while the hypervisor is responsible for providing strong isolation. In particular, the **linx DPU device is accessible by the Linux domain, while all the other devices are assigned to the critical domain. All custom devices synthesized on the programmable logic are described in the following list:

    1. 1.

      DPU: The Deep Learning Processor Unit (DPU) is a softcore provided by **linx to efficiently accelerate the inference of deep neural networks.

    2. 2.

      Radio decoder: It takes the pulse position modulated (PPM) signal from the radio receiver, decodes it, and puts the corresponding digital values in a set of registers. Without the help of specialized hardware, PPM signals would have to be managed in software using, for instance, GPIOs configured to raise interrupts at each edge in the signal to process it. This may easily lead to poor performance and excessive interference on the processors due to the service of interrupts and the consequent context switches. The use of a dedicated FPGA component to handle the PPM signal of the radio receiver hence relieves the processors from this burden and reduces the corresponding overhead and jitter.

    3. 3.

      \(I^2C\) device: The ZCU104 board allows exposing an \(I^2C\) peripheral working with 1.8 V logic levels, while the adopted IMU works with 3.3 V logic levels. To avoid introducing third-party electronics to adapt the logic levels (e.g., using a voltage-level translator), an AXI-based 3.3 V \(I^2C\) master device to be deployed on FPGA was developed.

    4. 4.

      UART device: A custom AXI-based UART peripheral to be deployed on FPGA was developed for the same reasons mentioned above, given that the adopted LiDAR works with 3.3 V logic levels.

    5. 5.

      PWM driver: It is used to generate pulse width modulation (PWM) signals to drive the drone motors. Although the Ultrascale+ MPSoC allows generating PWM signals by means of triple timer counters (TTC), a specialized FPGA module was developed for the sake of simplicity and flexibility.

    Efficient implementations of the drivers for the above peripherals, except the DPU, were performed from scratch to offload the CPU as much as possible, as well as minimize execution time variability and the number of memory accesses.

    4.2 Inter-domain communication channels

    The two domains exchange data by means of two non-blocking communication channels based on shared-memory regions provided by CLARE, where the Linux domain acts as a producer and the critical one as a consumer. The channels are accessed by means of a middleware (available for both Linux and FreeRTOS) that does not require the intervention of the hypervisor at each access and ensures wait-free synchronization in the presence of concurrent accesses. The first channel is used to exchange setpoints for the drone controller, whereas the second one is used to transmit heartbeat packets for health monitoring.

    4.3 Linux domain

    The Linux domain is responsible for visual tracking and navigation. It includes four tasks, namely Camera, Detector, Tracker, and HB generator. Details on these tasks are reported in Table 2.

    Table 2 Linux application-level task set. Priority value ranges between 1 and 99, where higher values correspond to a higher priority

    The Camera task periodically captures a new frame from the camera and puts it in a queue of frames ready to be processed. The Detector task performs object detection by accelerating the inference of a YOLOv3 (Redmon and Farhadi

    Fig. 6
    figure 6

    Frame rate distributions for the Detector task on a 3-core configuration without hypervisor (red bars) and with hypervisor (blue bars)

    In another test, the object detection performance resulting from a 3-core configuration with hypervisor has been compared with the one achievable on the full ZCU-104 without hypervisor, that is, assigning all four cores to Linux. The results are illustrated in Fig. 7, which shows that, by allocating one extra core to Linux, the average frame rate of the object detection task increases from 27.5 FPS to 29.6 FPS.

    Note that, in the 4-core configuration, the object detection pipeline uses four parallel threads to match the number of physical cores available on the platform. As expected, this leads to a performance increase, but the observed improvement is not significant, since the processing pipeline is constrained by the acquisition rate of the camera (30 FPS), which limits the benefit of the increased HW and SW parallelism.

    Fig. 7
    figure 7

    Frame rate distributions for the Detector task on a 3-core configuration with hypervisor (blue bars) and on a 4-core configuration without hypervisor (red bars)

    Viewed from another angle, the results reported in Fig. 7 show that the two-domain architecture enabled by the hypervisor does not significantly degrade the performance, with respect to a full platform configuration, but certainly provides other relevant advantages in terms of time predictability and security for the critical components of the system.

    5.3 Application-level end-to-end delays

    This section reports on two experiments aimed at showing the time it takes for the generated data to be propagated from one domain to the other. Since this operation involves a data exchange between two very different operating systems, the communication latency depends on different factors. In particular, from the moment in which the data is written by Linux into the shared memory, three factors come into play: (i) the period of the consumer task running in the FreeRTOS domain, (ii) the time at which this task is scheduled by FreeRTOS, and (iii) the interference experienced by the consumer from the other tasks, which depends on the assigned priorities and the task execution times.

    Figure 8 illustrates a possible interleaving of the producer and consumer tasks, where the delay is significant. In the figure, the message is sent by the producer at time \(t_1\), it is delivered to the other domain at time \(t_a\), and finally consumed at time \(t_2\). As it can be seen, the overall end-to-end delay (\(t_2 - t_1\)) is given by the sum of the channel latency (\(L_c\)), the activation interval (\(A_c\)), the interference of the high-priority tasks (\(I_{hp}\)), and the computation time of the consumer task (\(C_c\)). Since the channel latency is always below one microsecond and the execution times of FreeRTOS tasks are in the order of a few hundred microseconds, the major contribution to the end-to-end communication delay is due to the period of the consumer task, which is in the order of milliseconds.

    Fig. 8
    figure 8

    Example of task interleaving characterized by a long end-to-end delay. In this case, the delay is mainly dominated by the activation interval \(A_c\), which has the same order of magnitude of the consumer period \(T_c\) (ms), whereas the channel latency \(L_c\), the interference \(I_{hp}\) from the high-priority tasks, and the consumer computation time \(C_c\) are at least two orders of magnitude smaller

    The best-case situation for the end-to-end communication delay is illustrated in Fig. 9, where the consumer task is executed just after the message is delivered to the FreeRTOS domain. In this case, the end-to-end delay is in the order of a few hundred microseconds.

    Fig. 9
    figure 9

    Example of task interleaving characterized by a short end-to-end delay. It is mainly caused by the channel latency \(L_c\) and the execution time \(C_c\) of the consumer task

    The end-to-end delay measurements performed in this experiment confirm the observations reported above. Figure 10 shows the distribution of the end-to-end delay for the setpoint communication, measured over one hour of continuous execution from the time the setpoint is generated in Linux to the time it is read in FreeRTOS by the Safety module task, which has a period of 10 ms and the lowest priority.

    Fig. 10
    figure 10

    Setpoint transmission delay

    As expected, the maximum observed delay resulted to be 9.5 ms (close to the period of 10 ms assigned to the Safety module task), whereas the minimum observed delay resulted to be 890 \(\mu \)s, due to the sum of its own execution time and the suffered interference from some higher priority task.

    Figure 11 shows the distribution of the end-to-end delay from the time the heartbeat is generated in Linux to the time it is received in FreeRTOS.

    Fig. 11
    figure 11

    Heartbeat transmission delay

    In this case, the maximum observed delay resulted to be of 3.74 ms (close to the period of 4 ms of the HB checker task), whereas the minimum observed delay resulted to be of about 303 \(\mu \)s, shorter than the other, because the HB checker task has the highest priority and cannot suffer interference from the other tasks.

    5.4 Fault reaction time

    A final experiment was carried out to measure the latency of the fail-safe procedure triggered by a system fault. For this specific test, the drone is programmed to track a target by controlling only the yaw angle. Hence, Fig. 12 reports the variation of the yaw angle during a tracking operation, when the backup controller is invoked to keep the drone in a safe state after a system fault is injected in Linux.

    Fig. 12
    figure 12

    Reaction time for the Backup controller to a system fault

    In this experiment, the system heartbeat validation threshold was set to 3, meaning that the HB checker task can tolerate 3 readings of heartbeat data in the FreeRTOS domain without detecting an update. Note that, since the period of the HB checker is set to 4 ms and the threshold is 3, the expected delay to detect a fault is between 8 and 12 ms. In Fig. 12, the measured delay is represented by the purple arrow between the two vertical red dashed lines, denoting the transient interval between the normal functioning and the fail-safe mode.

    In this test, the fault has been injected in Linux after 6.235 s from the beginning of the plot and the backup controller took place at \(t = 6.245\) seconds, that is, after about 10 ms. The orange horizontal line highlights the yaw angle recorded when the fail-safe mode was enabled. In this implementation, the backup controller has been programmed to keep the yaw angle to that reference value.

    The plot shows that, when the backup controller was activated, the yaw angle was \(36.3^{\circ }\) and the flight controller acted to keep the yaw angle at this reference value, while simultaneously kee** both pitch and roll angles at \(0^{\circ }\) (hovering). Notice that a little overshoot occurs in most cases, since the yaw PID controller is the one with less authority over the physical system. In fact, yaw actuation is generated by differences in motor torques, while pitch and roll are generated by changing thrusts, so the faster the control (larger gains) the higher the overshoot to be compensated. This can be observed in the figure after the second red line, when the angle continues increasing for a few milliseconds, after which it converges to the reference angle recorded at the fail-safe activation. This behavior is caused by the system dynamics rather than by a delay in the execution of the flight control task.

    Note that, without the fail-safe mechanism, the drone would no longer receive a setpoint and therefore would continue to apply the last valid setpoint received, thus kee** rotating around itself. In a more complex use case involving multiple degrees of freedom, this situation would inevitably lead to dangerous consequences.