1 Introduction

Object detection (OD) models are commonly used in computer vision in various industries, including autonomous driving (Khatab et al., 2021), security surveillance (Kalli et al., 2021), and retail (Melek et al., 2017; Fuchs et al., 2019; Cai et al., 2021). However, these models are susceptible to adversarial attacks—a significant threat that can undermine the models’ performance and reliability  (Hu et al., 2021). This vulnerability affects both existing and newly developed OD-based solutions, such as OD-based smart shop** carts. Since existing retail automated checkout solutions often necessitate extensive changes to a store’s facilities (as in Amazon’s Just Walk-Out; Green, 2021), a simpler solution based on a removable plugin placed on the shop** cart and an OD model was proposed (both in the literature (Oh & Chun, 2020; Santra & Mukherjee, 2019) and by industry).Footnote 1 This computer-vision-based approach is already replacing traditional bar-code systems in various supermarkets worldwide, signaling a shift towards enhancing the shop** experiences. Nonetheless, the vulnerability of OD models to adversarial attacks (Hu et al., 2021) poses a potential risk to the integrity of such smart shop** carts and similar OD solutions in other domains. For example, an adversarial customer could place an adversarial patch on a high-cost product (such as an expensive bottle of wine), which would cause the OD model to misclassify it as a cheaper one (such as a carton of milk). Due to the expansion in retail thefts (Federation, 2022; Forbes, 2022), such attacks would result in a revenue loss for a retail chain and compromise the solution’s reliability, if not addressed.

Fig. 1
figure 1

Adversarial theft detection when using X-Detect

The detection of such physical patch attacks that are designed to deceive OD models into hallucinating a different object, presents a significant challenge in retail and similar OD-based solutions settings, since: (1) the defense mechanism (adversarial detector) should raise an alert in an actionable time frame to prevent the theft from taking place; (2) the adversarial detector should provide explanations for the raised adversarial alert prevent a retail chain from falsely accusing a customer of being a thief (Amazon, 2022); and (3) the adversarial detector should be capable of detecting unfamiliar threats (i.e., adversarial patches in different shapes, colors, and textures; Zhang et al., 2022) and Ad-YOLO  (Ji et al., 2021)] without interfering with the detection in benign scenes, and provide explanations for the raised adversarial alerts, without being pre-exposed to adversarial examples. The main contributions of this paper are as follows:

  • To the best of our knowledge, X-Detect is the first explainable adversarial detector for OD models; moreover, X-Detect can be employed in any user-oriented domain where explanations are needed.

  • X-Detect is a model-agnostic solution. By requiring only black-box access to the target model, it can be used for adversarial detection for any OD algorithm.

  • X-Detect supports the addition of new classes without any additional training, which is essential in retail where new items are added to inventory on a daily basis.

  • The resources created in this research can be used by the research community to further investigate adversarial attacks in the retail domain, i.e., the code implementation, Superstore dataset and the corresponding adversarial videos are available at the following link: https://github.com/omerHofBGU/X-Detect.

The structure of the upcoming sections is as follows: Sect. 2 includes relevant background on adversarial patch attacks and used computer vision techniques. Section 3 delves into relevant related works. Section 4 outlines X-Detect components and pipeline; In Sect. 5, details on the evaluation and experimental settings are presented. Section 6 introduces the evaluation results and relevant insights. Finally, Sects. 7 and 8 include the discussion, conclusions, and future work respectively.

2 Background

Adversarial samples are real data samples that have been perturbed by an attacker to influence an ML model’s prediction (Chen et al., 2020; Chakraborty et al., 2021). Numerous digital and physical adversarial attacks have been proposed (Carlini & Wagner, 2017; Brown et al., 2019)—an image manipulation technique that extracts the style “characteristics” from a given style image and blends them into a given input image; (3) Scale invariant feature transform (SIFT) (Lowe, 2004)—an explainable image matching technique that extracts a set of key points that represent the “essence” of an image, allowing the comparison of different images. The key points are selected by examining their surrounding pixels’ gradients after applying varied levels of blurring; and (4) The use of class prototypes (Roscher et al., 2020)—an explainability technique that identifies data samples that best represent a class, i.e., the samples that are most related to a given class (Molnar, 2020).

3 Related work

While adversarial detection in image classification has been extensively researched (Aldahdooh et al., 2022; Chou et al., 2020; Yang et al., 2020), only a few studies have been performed in the OD field. Moreover, it has been shown that adversarial patch attacks and detection techniques suited for image classification cannot be successfully transferred to the OD domain (Liu et al.,

4 The method

X-Detect ’s design is based on the assumption that the attacker crafts the adversarial patch for a specific attack environment (the target task is OD, and the input is preprocessed in a specific way), i.e., any change in the attack environment will harm the patch’s capabilities. X-Detect starts by locating and classifying the main object in a given scene (which is the object most likely to be attacked) by using two explainable-by-design base detectors that change the attack environment. If there is a disagreement between the classification of X-Detect and the target model, X-Detect will raise an adversarial alert. In this section, we introduce X-Detect ’s components and structure (as illustrated in Fig. 2). X-Detect consists of two base detectors, the object extraction detector (OED) and the scene processing detector (SPD), each of which utilizes different scene manipulation techniques to neutralize the adversarial patch’s effect on an object’s classification. These components can be used separately (by comparing the selected base detector’s classification to the target model’s classification) or as an ensemble to benefit from the advantages of both base detectors (by aggregating the outputs of both base detectors and comparing the result to the target model’s classification).

Fig. 2
figure 2

X-Detect structure and components

The following notation is used: Let F be an OD model and s be an input scene. Let \(F(s)=\{O_b,O_p,O_c\}\) be the output of F for the main object in scene s, where \(O_b\) is the object’s bounding box, \(O_c\in C\) is the classification of the object originating from the class set C, and \(O_p\in [0,1]\) is the confidence of F in the classification \(O_c\).

4.1 Object extraction detector

The OED receives an input scene s and outputs its classification for the main object in s. First, the OED uses an object extraction model to eliminate the background noise from the main object in s. As opposed to OD models, object extraction models use segmentation techniques that focus on the object’s shape rather than on other properties. Patch attacks on OD models change the object’s classification without changing the object’s outlines (Liu et al., 2022), therefore the patch will not affect the object extraction model’s output. Additionally, by using object extraction, the OED changes the assumed object surrounding by eliminating the scene’s background, which may affect the final classification. Then, the output of the object extraction model is classified by the prototype-KNN classifier—a customized KNN model. KNN is an explainable-by-design algorithm, which, for a given sample, returns the k closest samples according to a predefined proximity metric and uses majority voting to classify it. Specifically, the prototype-KNN chooses the k closest neighbors from a predefined set of prototype samples P from every class. By changing the ML task to classification, the assumed attack environment has been changed. In addition, using prototypes as the neighbors guarantees that the set of neighbors will properly represent the different classes. We chose to use KNN and not other classification algorithms since it is relatively straightforward for humans to understand why a prediction was made. The prototype-KNN proximity metric is based on the number of identical visual features shared by the two objects examined. The visual features are extracted using SIFT (Lowe, 2004), and the object’s unique characteristics are represented by a set of key points. The class of the prototype that has the highest number of matching key points with the examined object is selected. We chose to use SIFT since its matching points simulate the human thinking process when comparing two images. The OED’s functionality is presented in Eq. (1):

$$\begin{aligned} OED(s) = P_{KNN}\left(\max _{p_i\in P}\Big |SIFT\left(OE(s),p_i\right)\Big |\right) \end{aligned}$$
(1)

where \(P_{KNN}\) is the prototype-KNN, \(p_i \in P\) is a prototype sample, and OE is the object extraction model. The OED is considered explainable-by-design: (1) SIFT produces an explainable output that visually connects each matching point in the two scenes; and (2) the K neighbors used for the classification can explain why the prototype-KNN outputs the predicted class.

4.2 Scene processing detector

As the OED, the SPD receives an input scene s and outputs its classification for the main object in s. First, the SPD applies multiple image processing techniques on s. Then it feeds the processed scenes to the target model and aggregates the results into a classification. The image processing techniques are applied to change the assumed OD pipeline by adding a preprocessing step, which would limit the patch’s effect on the target model (Thys et al., 2019; Hu et al., 2021). The effect of the image processing techniques applied on the target model needs to be considered, i.e., a technique that harms the target model’s performance on benign scenes (scenes without an adversarial patch) would be ineffective. After the image processing techniques have been applied, the SPD feeds the processed scenes to the target model and receives the updated classifications. The classifications of each processed scene are aggregated by selecting the class with the highest probability sum. The SPD’s functionality is presented in Eq. (2):

$$\begin{aligned} SPD(s) = Arg_{m \in SM} \left(F\left(m(s)\right)\right) \end{aligned}$$
(2)

where \(m\in SM\) represents an image processing technique, and the main object’s classification probability is aggregated from each processed image output by Arg. The \(m\in SM\) and Arg are empirically selected by their effect on the benign scenes (Sect. 5.4). The SPD is considered explainable-by-design since it provides explanations for its alerts: Every alert raised is accompanied by the processed scenes, which are explanations-by-examples, i.e., samples that visually explain why X-Detect’s prediction changed.

5 Evaluation

5.1 Datasets

The following datasets were used in the evaluation: Common Objects in Context (COCO) 2017 (Lin et al., 2014)—an OD benchmark containing 80 object classes and over 120K labeled images.

PASCAL 2012 (Everingham et al., 2012)—an OD benchmark containing 20 object classes and  10K labeled images. A dataset’s subset was used to further analyze X-Detect ’s low FPR.

Superstore—a dataset that we created, which is customized for the retail domain and the smart shop** cart use case. The Superstore dataset contains 2200 images (1600 for training and 600 for testing), which are evenly distributed across 20 superstore products (classes). Each image is annotated with a bounding box, the product’s classification, and additional visual annotations (more information can be found in the supplementary material). The Superstore dataset’s main advantages are that all of the images were captured by cameras in a real smart cart setup (described in Sect. 5.2) and it is highly diverse, i.e., the dataset can serve as a high-quality training set for related tasks.

5.2 Evaluation space

X-Detect was evaluated in two attack spaces: digital and physical. In the digital space, X-Detect was evaluated under a digital attack with the COCO dataset. In this use case, we used open-source pretrained OD models from the MMDetection framework’s model zoo (Chen et al., 4a). In this setup, the frames captured by the cameras were passed to a remote GPU server that stored the target OD model. To craft the adversarial patches, we used samples of cheap products from the Superstore test set and divided them into two equally distributed data folds (15 samples from each class). Each of the adversarial patches was crafted using one of the two folds. The patches were crafted using the DPatch  (Liu et al., Footnote 3

5.3 Attack scenarios

Figure 4b presents the five attack scenarios evaluated and their corresponding threat models. The attack scenarios can be categorized into two groups: non-adaptive attacks and adaptive attacks. The former are attacks where the attacker threat model does not include knowledge about the defense approach, i.e., knowledge about the detection method used by the defender. X-Detect was evaluated on four non-adaptive attack scenarios that differ with regard to the attacker’s knowledge: white-box (complete knowledge of the target model), gray-box (no knowledge of the target model’s parameters), model-specific (knowledge on the ML algorithm used), and model agnostic (no knowledge on the target model). The patches used in these scenarios were crafted using the white-box scenario’s target model; then they were used to simulate the attacker’s knowledge in other scenarios (each scenario targets different models). For example, to simulate the model-specific threat model, we evaluated target models that use the same algorithm as the white-box target model, however different structures and weights.

We also carried out adaptive attacks (Carlini et al., 2019). In the context of adversarial ML, an adaptive attack is a sophisticated attack that is designed to deceive the target model (in our case mislead the object detector) and evade a specific defense method that is assumed to be known to the attacker (X-Detect in our case). This type of attack takes into account the defense parameters during the optimization process, intending to make the resulting patch more robust to the defense. It is important to note that in an adaptive attack since the attacker adds another component to the optimization process, i.e., the component that should make sure that the attack will not be detected by the known defense mechanism, such an attack is considered more difficult to perform. We designed three adaptive attacks that are based on the (LKpatch) Lee & Kolter (2019) attack of Lee & Kotler, which is presented in Eq. (3):

$$\begin{aligned} LK_{patch} = Clip\left(P - \epsilon *sign\left(\bigtriangledown _p L(P,s,t,O_c,O_b)\right)\right) \end{aligned}$$
(3)

where P is the adversarial patch, L is the target model’s loss function, and t are the transformations applied during training. Each adaptive attack is designed according to X-Detect ’s base detectors’ settings: (1) using just OED, (2) using just SPD, and (3) using an ensemble of the two. To adjust the LKpatch attack to consider the first setup (i), we added a component to the attack’s loss function that incorporates the core component of OED—the prototype-KNN. The new loss component incorporates the SIFT algorithm’s matching point outputs of the extracted object (with the patch) and the target class prototype. The new loss component is presented in Eq. (4):

$$\begin{aligned} {\begin{matrix}&OE_{SIFT} = -norm\left(SIFT\left(OE(s,P),p_{target}\right)\right) \end{matrix}} \end{aligned}$$
(4)

where \(p_{target}\) is the target class prototype and norm is a normalization function. In this setup, we do not consider the object extraction model since causing it to extract a wrong part of the image would omit the adversarial patch from the SIFT input, resulting in a low number of SIFT points with the target class and an unsuccessful attack.

Fig. 4
figure 4

Experimental evaluation settings: smart shop** cart setup (a), attack scenarios (b)

To adjust the LKpatch attack to incorporate the second setup (2), we added the image processing techniques used by the SPD to the transformations in the expectation over transformation functionality t. The updated loss function is presented in Eq. 5:

$$\begin{aligned} P_{ASP} = Clip\left(P - \epsilon *sign\left(\bigtriangledown _p L(p,s,t,sp,O_c,O_b)\right)\right) \end{aligned}$$
(5)

where sp are the image processing techniques used. To adjust the existing LKpatch attack to incorporate the last setup (3), we combined the two adaptive attacks described above, by adding \(OE_{SIFT}\) to the \(P_{ASP}\) loss.

5.4 Experimental settings

All of the experiments were performed on the CentOS Linux 7 (Core) operating system with an NVIDIA GeForce RTX 2080 Ti graphics card with 24GB of memory. The code used in the experiments was written using Python 3.8.2, PyTorch 1.10.1, and NumPy 1.21.4 packages. We used different target models, depending on the attack space and scenario in question. In the digital space, only the white-box, model-specific, and model-agnostic attack scenarios were evaluated, since they are the most informative for this evaluation. In the white-box scenario, the Faster R-CNN model with a ResNet-50-FPN backbone (Ren et al., 2015; He et al., 2016) in PyTorch implementation was used (weights were taken from torchvision library). In the model-specific scenario, three Faster R-CNN models were used—two with a ResNet-50-FPN backbone in Caffe implementation, in each one a different regression loss (IOU) was used, and the third used ResNet-101-FPN as a backbone (weights were taken from torchvision library). In the model-agnostic scenario, the Cascade R-CNN (Cai & Vasconcelos, 2019) model and Grid R-CNN (Lu et al., 2019) were used, each of which had a ResNet-50-FPN backbone (weights were taken from MMDetection library); and YOLOv8L  (https://github.com/ultralytics/ultralytics) model (weights were taken from Ultralitics library). In the physical space evaluation, all five attack scenarios were evaluated. In the adaptive and white-box scenarios, a Faster R-CNN model with a ResNet-50-FPN backbone was trained for 40 epoches with the seed 42 and initial weights were taken from torchvision library. In the gray-box scenario, three Faster R-CNN models were trained (with the seeds 38, 40, 44) with a ResNet-50-FPN backbone and initial weights were taken from torchvision library. In the model-agnostic scenario, a Cascade R-CNN model, Cascade RPN model (Vu et al., 2019), YOLOv3 model (Redmon & Farhadi, \(T_{OD}\)). When examining the expected inference time overhead, we can see differences between the two base-detectors. The OED’s object extraction model has the same inference time as the OD model and can be run in parallel, i.e., the object extraction model does not add additional time to the inference process. The OED’s subsequent classification model, theprototype-KNN classifier, cannot be run in parallel to the OD model and adds \(T_{KNN}\) time to the inference time. Note, that \(T_{KNN}\) is smaller than \(T_{OD}\) given the nature of its internal algorithms (the KNN classification and the SIFT feature extraction algorithms). Therefore, the OED’s added time for a single scene can be estimated as \(T_{KNN}\). On the other hand, the SPD adds a more substantial time for a single inference. The SPD repeatedly queries the OD model with the manipulated scenes, i.e., |SM| times where SM is the group of manipulated scenes. Since the application time of each image processing technique is negligible, the SPD’s added time for a single scene can be estimated as \(|SM| \times T_{OD}\). Note that the time for the predictions’ aggregation is relatively quick and negligible.

In the evaluation, four detection approaches were used: only OED, only SPD, an MV ensemble, and a 2-tier ensemble, however the question of which approach is the most suitable for adversarial detection remains unanswered. Each of the approaches showed different strengths and limitations: (1) the OED approach managed to detect most of the adversarial samples (high TPR) yet raised alerts on benign samples (high FPR); (2) the SPD approach detected fewer adversarial samples than the OED (lower TPR) yet raised fewer alerts on benign samples (lower FPR); (3) the MV ensemble reduced the gap between the two base-detectors’ performance (higher TPR and lower FPR) yet had a longer inference time; and (4) the 2-tier ensemble reduced the MV ensemble’s inference time and improved the identification of benign samples (higher TNR and lower FPR) yet detected fewer adversarial samples (lower TPR). Therefore, the selection of the best approach depends on the use case. In the retail domain, it can be assumed that: (1) most customers would not use an adversarial patch to shoplift; (2) wrongly accusing a customer of shoplifting would result in a dissatisfied customer, and harm the company’s reputation; (3) short inference time is vital in real-time applications like the smart shop** cart. Therefore, a company that prioritizes the customer experience would prefer the 2-tier ensemble approach, while a company that prioritizes revenue above all would prefer the OED or MV ensemble approach.

8 Conclusion and future work

In this paper, we presented X-Detect, a novel adversarial detector for OD models which is suitable for real-life settings. In contrast to existing methods, X-Detect is capable of: (1) identifying adversarial samples in near real-time; (2) providing explanations for the alerts raised; and (3) handling new attacks. Our evaluation in the digital and physical spaces, which was performed using a smart shop** cart setup, demonstrated that X-Detect outperforms existing methods in the task of distinguishing between benign and adversarial scenes in the four attack scenarios examined while maintaining a 0% FPR. Furthermore, we demonstrated X-Detect effectively under adaptive attacks. However, it is crucial to acknowledge the possible limitations of X-Detect: (1) the OED requires to set prototype samples for every class in advance, which might not always be possible; and (2) in cases where there is no clear prioritization between false alarms and missed alarms, all the detection approaches should be examined before selecting the most suitable one. Future work may include applying X-Detect in different domains and expanding it for the detection of other attacks, such as backdoor attacks.