Introduction

Eye-tracking systems were developed to provide estimates of a subject’s gaze, by processing data collected from a variety of devices, which can be grouped into three main categories: i) electrooculography (EOG); ii) scleral search coil; and iii) camera-based systems (Duchowski, 2007). EOG was the most commonly used device in the early years of eye-tracking and relies on a set of electrodes placed around subjects’ eyes to measure changes of skin electric potentials corresponding to eye movements. Although this technique offers high-frequency measurements, it usually suffers from poor spatial precision. On the other hand, scleral coils provide both high temporal (up to 1000Hz) and spatial resolution (with errors lower than 0.08). Search coils consist of a metallic wire, either embedded in a contact lens or surgically implanted on the sclera. Eye positions are estimated from the voltage signals induced by a set of orthogonal Helmholtz coils placed around the head (Robinson, 1963). Although this method is usually considered a gold standard, it is often not suitable due to its invasive nature and the complexity of the experimental set-up. Camera-based eye-trackers, in contrast, are non-invasive. The gaze direction is obtained by detecting the pupil shape and position in video recordings of the eye. With advances in hardware and image processing techniques, today, high-end camera-based eye-trackers can reach temporal and spatial precision comparable to search coils, thus becoming the most popular technique (Schmitt, Muser, Lanz, Walz, & Schwarz, 2007).

Most eye-tracking applications require a calibration procedure to estimate the model parameters and map the eye positions into the application space (typically a screen in front of the user). This step is critical to ensure accurate estimates and require the adjustment of individual parameters. Calibration methods usually rely on the presentation of fixation targets at different known screen positions, while the subject is required to maintain fixation at each of them (Majaranta & Bulling, 2014). A minimum of four calibration points is generally required to compute the map** function, but due to non-uniform variations across the visual field, a higher number of positions is often preferable to improve accuracy (Hansen & Pece, 2005; Nagamatsu et al., 2009; Hassoumi, Peysakhovich, & Hurter, 2019). Given the time and attentional demands, calibration procedures generally consist of 5 to 15 positions, as a trade-off between precision and subject effort.

For most studies with humans, calibration procedures are straightforward. Instructions can be given directly to the subjects, who will maintain fixation at the different calibration points appearing successively on the screen. Targets will generally be confirmed by pressing a key, or after the first detection of a fixation with a duration superior to given threshold (e.g., 400ms). With poorly or non-cooperative subjects on the other hand - such as infants, people with a mental disorder, or animals - calibration procedures can be challenging. Subjects may not understand or follow the instructions, leading to errors in the parameters adjustment and gaze estimation. To illustrate these issues, we show in Fig. 1 the gaze positions for a calibration task performed by a monkey. Figure 1A corresponds to the coordinates recorded during repeated presentations of a single calibration point. The presence of different clusters indicates that the monkey looked at several positions rather than maintaining fixation on the target (which corresponds, in this example, to the largest cluster located on the top-left). A similar behavior was observed for the whole calibration procedure with all five targets (Fig. 1B). Notice that, due to the high number of outliers (the gaze coordinates which do not correspond to target fixations), computing an average of eye positions, as usually done with cooperative subjects, would yield poor estimates.

Fig. 1
figure 1

Gaze positions recorded during a calibration task performed with a Capuchin monkey. The calibration consisted of 50 trials of 1700ms, with one of 5 targets presented in each, in a randomized manner. In Figure A, the gaze positions correspond to all the presentations of the top-left target. In Figure B, we show the eye-tracking recordings for all 5 targets of the calibration procedure (colors corresponding to the target displayed at the time). Gaze values are uncalibrated, in an arbitrary range of values specific to the eye-tracker (in our system, comprised within the interval [-32768, 32767]), reason why units were omitted in these plots

A simple solution when dealing with non-cooperative subjects is the input from an operator during the experiment, who will manually assign gaze positions to calibration points while looking at the eye-tracker’s live feed. However, as noted in Nyström, Andersson, Holmqvist, and Van De Weijer (2013), operator-controlled calibrations can suffer from inaccuracies, due to the subject’s behaviour and/or the operator experience, who has to monitor the stimulus presentation on the main screen, as well the subject, and the live feed from the camera, to check if the participant’s eye is being robustly tracked and if his gaze seems to be directed toward the target. Assessing if the subject is looking at the correct position can be particularly difficult when dealing with animals. Another downside of these procedures is the latency, as the authors pointed out in the following quote:

“A drawback of this calibration method is the response latency due to visually inspecting the eye, determining whether to accept the calibration target, and executing the final motor action to click the ’Accept’ button. We know from standard reaction time tests - for instance, lexical decision tasks - that a decision takes from around 250 ms to perform (Nebes, 1978). This time may be long enough for the participant to shift their eyes or to make anticipatory eye movements to the next probable calibration target. If the subject blinks or makes a saccade during this button press, data deriving from periods when the participant is not fixating may be recorded, and the calibration will be inaccurate.” fromNyström et al., (2013)

In principle, calibration-free procedures might overcome such issues. Different methods have been proposed, which essentially rely on the computation of a 3D model of the pupil. These techniques are generally based on multiple cameras (Kohlbecher et al., 2008; Smith et al., 2005; Nagamatsu et al., 2009; Shih, Wu, & Liu, 2000), or on a single camera and several light sources (Hammoud, 2008; Morimoto et al., 2002; Nagamatsu et al., 2010). A main downside of these approaches is their poor spatial precision and accuracy, with errors above 2 visual degrees (Model, 2011), and often as high as 5 to 10 degrees, which is not suitable for many applications. Another drawback of calibration-free methods is the inherent complexity of the experimental setup, requiring multiple cameras and/or light sources, as well as higher computational power since these methods often use more advanced computer vision technique, such as face detection and head pose estimation. Finally, it is worth noting that, so far, all the calibration-less systems involving multiple cameras and/or light sources encountered in the literature are experimental eye-trackers, custom-built prototypes, rather than mature technologies commercialized by professional or public-oriented eye-trackers manufactures.

Here, we propose an alternative calibration procedure, that does not require manual intervention from either the user or the experimenter. Our method relies on the assumption that subjects, even uncooperative ones, will fixate at the targets displayed for a sufficient part of the calibration procedure, as a result of a natural tendency to look at salient stimuli (in our experiments, colored targets shown on a black background).

A similar approach, qualified as “user-calibration-free”, was introduced by Model and Eizenman (2012). Subjects were not required to manually confirm each target, and the calibration parameters were estimated from all the gaze positions recorded during the presentation of each calibration point, by extracting the dominant bins from the histograms in both vertical and horizontal dimensions. The study, conducted in infants, was able to provide accurate results despite subjects attending the targets for a small fraction of the time (as low as 47%). The assumption of this method was that incorrect fixations would follow a uniform distribution across the visual field, while the actual fixations to the target would be grouped into a single cluster. In such conditions, the highest bin in every dimension would indeed coincide with the center of the cluster, allowing for the identification of the main fixation coordinates.

In our experiments with monkeys, however, we noticed that the eye positions recorded presented two types of incorrect coordinates. First, scattered samples, dispersed across the screen, which resulted from saccades, sampling artifacts, or transient failures in pupil tracking due to occlusions, blinks, and light reflections. Secondly, clusters that occurred during incorrect fixations, when the subject was not looking at the target. We noticed, for instance, that subjects tend to look systematically at positions shown in previous trials, probably reflecting an expectation created from the repeated presentations (every target was shown multiple times in a randomized manner). These incorrect gaze positions, rather than being uniformly distributed, as assumed by Model and Eizenman, were grouped in clusters, corresponding to each of the incorrect fixations.

An essential limitation of Model and Eizenman (2012) approach is its poor performance on multimodal distributions with high contamination, such as those observed in our monkey recordings. This problem is particularly noticeable in a scenario where all the incorrect gaze coordinates are grouped in clusters (as in the illustrative example in Fig. 2). If several clusters of outliers share an overlap** position in one of the two dimensions, extracting the the coordinate-wise median, or the highest bin for each component (which can be considered a coordinate-wise mode), will often result in coordinates outside of the cluster with the highest cardinality. Both approaches fail to take into account the relation between the two dimensions, and are sensitive to affine transformations. Considering incorrect positions exclusively distributed in clusters, Model and Eizenman’s method, as well as the coordinate-wise median, would offer a tolerance to outliers only up to 50%.

Fig. 2
figure 2

Coordinate-wise median and mode in multimodal distributions. Both figures present the same data points, rotated around the origin by an angle of 35. The distribution is composed of 500 samples grouped in three clusters (40% in the main cluster, in green, and 30% in each of the two outliers clusters, in red). These central measure tendencies are not affine equivariant and often fail to locate the main cluster when several groups share overlap** positions in one of the two dimensions

To overcome these issues, we developed a new approach to estimate the main mode in multivariate distributions. Our method, named BRILFootnote 1 (Bootstrap and Refine Iterative Locator), can withstand high levels of contamination, and consists of the following steps:

  1. 1.

    BOOTSRAP: An initial location is computed through the recursive trimming of low depth samples.

  2. 2.

    REFINE: The central location from Step 1 is refined through two successive filtering processes. First, we sequentially remove the furthest outliers, ordered by Euclidean distances to the first estimate, until reaching an unimodal distribution. The center of this temporary subset is then re-estimated as in Step 1, and the ordering of the remaining samples adjusted considering the new location and a robust estimate of scatter. The second filtering step consists in recursively trimming this subset until it fails to reject the normality hypothesis. The central location for this group is finally computed by averaging the samples of the last subset.

  3. 3.

    ITERATE: Steps 1 and 2 are re-applied after removing the samples selected in Step 2 from the global distribution. This procedure is repeated until all samples have been assigned to a group or until the distribution considered in Step 1 already appears unimodal before any filtering. In this case, a last execution of Steps 1 and 2 is performed, while not leading to new iterations if samples were to remain unassigned. The final estimate of the algorithm is obtained by selecting the center associated to the iteration with the group of highest cardinality.

The paper is organized as follows. In the next section, we detail the characteristics of the artificial and experimental datasets used to evaluate our approach, along with the testing procedure and metrics considered. In Section “Robust estimates of central tendency”, we introduce depth measures and study their performance in the context of multimodal distributions. Section “Recursive depth estimator with corrected location through filtering” describes the different components of our approach relying on these depth functions. We then show, in the fifth section, how convex body minimizers, such as the Minimum Volume Ellipsoid (MVE) and the Minimum Covariance Determinant (MCD) locators, can be used within our framework in place of depth measures. Finally, in Section “Results”, we present our results on both artificial and experimental data, showing that our approach is able to reliably estimate the center of the main cluster in multimodal distributions, even when facing a high quantity of outliers, as with eye-tracking calibrations performed by uncooperative subjects.

Methods

Synthetic dataset

The tests on synthetic data were performed using a Monte-Carlo design, creating mixtures of bivariate normals and uniform distribution, with 500 repetitions for each combination of the following parameters: i) the number of samples, set to 500 (with exception of the simulations in Sections “Effect of sample size” and “Execution time”, for which the sample size varied from 250 to 2000); ii) the uniform noise ratio, varying from 0 to 50%; iii) the number of clusters, chosen from the interval K = [2,5]; and iv) the proportion of inliers (i.e. the percentage of samples belonging to the main cluster over all the clustered data points, disconsidering uniform noise), set within δ = [100/K,100] percent.

The clusters centers were randomly drawn at every Monte-Carlo repetition in the \(\mathbb {R}^{2}\) space of coordinates ([0 − 50],[0 − 50]), with a spatial constraint to avoid overlap** clusters, by setting a minimum distance between any pair of cluster centers of three times the sum of their standard deviations. A bivariate normal distribution was then created for every cluster, centered on each of these coordinates, with a standard deviation σ = 1 for the main cluster, and a random value between 0.5 and 1.5 for each of the secondary clusters. The cardinality of the main cluster was defined by |kp| = N × (1 − ε) × δ, in which N is the total sample size, ε the noise ratio, and δ the inliers ratio. On the other hand, the cardinalities of the other clusters were defined by |ki| = N × (1 − ε) × (1 − δ)/(K − 1), with K corresponding to the number of clusters.

Real-world dataset

To assess the performance of our method in a real-world scenario, we analyzed eye-tracking calibrations procedures made in a series of experiments with monkeys. Data was obtained in the context of visual studies in capuchins, carried out at the Brain Institute of the Federal University of Rio Grande de Norte, Brazil.

Eye positions were recorded using the eye-tracking setup developed by Matsuda Keiji (2013, 2017). The system relies on a PointGrey Grasshopper3 infrared camera (GS3-U3-41C6NIR-C), with an image acquisition set at a frequency of 235Hz. The system computes horizontal and vertical gaze angles from the detection of the pupil in the video stream from the camera, and a geometric model of the eye using the method referred as “Head-fixed” in Matsuda, Nagami, Sugase, Takemura, and Kawano (2017). Gaze coordinates are uncalibrated, expressed in the camera system of coordinates, relying only on a passive procedure, detailed in Matsuda et al., (2017), to determine the rotation center of the pupil and the other parameters of the geometrical model as the subject spontaneously moves his eyes. These signals (vertical and horizontal raw positions), along with the pupil size and eye coverage, are then sent through a high precision AO PCI Express Card (digital-analog converter) on four channels with a ± 10V range.Footnote 2 The signals were then recorded on an external acquisition board (National Instruments E-Series board), at a sampling frequency of 1000 samples/s (oversampling was due to requirements of the experimental setup, involving additional analog inputs, such as intracranial local fields potentials). Monkeys had their head fixed during the recordings.

We analyzed three datasets of increasing contamination,Footnote 3 shown in Fig. 3, which reflected different levels of attention and cooperation. In every trial, a target was displayed for a duration ranging from 1200 to 3000ms, at one of the 5 following possible positions {(0,0); (100,− 100); (− 100,− 100); (− 100,100); (100,100)}.Footnote 4 The calibration sessions consisted of 50 trials, with every target presented multiple times, in a randomized manner. Table 1 summarizes the characteristics of each dataset, showing, in particular, the total number of samples collected for every calibration target, and the number of valid eye positions, after removing the incorrect values due to eye blinks, artifacts, or saturated values from the analog acquisition board.

Fig. 3
figure 3

These three real-world datasets, showing different amounts of noise, consist of the gaze coordinates recorded during eye-tracker calibrations performed with Capuchin monkeys. Colors indicate which of the five calibration targets was displayed at the time of the recording. The data contamination can reflect both a lack of attention from the subject, attending other parts of the screen, and technical artefacts. Gaze values are uncalibrated, in an arbitrary range of values specific to the eye-tracker (in our system, comprised within the interval [-32768, 32767]), reason why units were omitted in these plots

Table 1 Summary of the real-world datasets

Ground truth coordinates were annotated using an interactive visualization and selection procedure. Taking into account the geometrical configuration of the calibration points, i.e., the first target, at coordinates (0,0), being at the center of the square formed by the 4 others (at the positions (100,− 100), (− 100,− 100), (− 100,100), (100,100), respectively), experts visually assessed the overall calibration area, and the regions corresponding to each calibration point. For every session and target, the samples located inside the annotated rectangular bounding boxes were furtherly refined by a contour selection based on local density estimates,Footnote 5 in order to remove parts of spurious clusters which were sometimes included at the edges of the rectangular area, or scattered samples corresponding to incoming and outgoing saccades. Finally, the remaining data points were averaged and the resulting position used as ground truth. The source code of this procedure is also available on the repository https://adrienbrilhault.github.io/BRIL/, with an illustrative https://www.youtube.com/watch?v=10ZapuMvK1s.Footnote 6

The distance to the screen was 57cm, such that 1 cm on the screen plane corresponds to approximately 1 visual degree (\(atan(\frac {1}{57}) \times \frac {180}{\pi } = 1.005086\)). The screen being 40cm wide, for 30cm high, with a resolution of 1024 × 768 pixels, 1 visual degree therefore translates to around 25px (\(\frac {1024}{40 \times 1.005086} = 25.470\)). The data recorded was raw, uncalibrated. All of our methods were ran on uncalibrated data, and the errors, measured in this referential, were converted a posteriori to screen pixels (which, in turn, can be converted to visual degrees with the formula above). This conversion was based on the ground truth reference coordinates of each calibration target (in the same uncalibrated referential). Since the calibration screen positions were (0,0), (100,− 100), (− 100,− 100), (− 100,100), and (100,100), each of the four outer points having a distance of \(\sqrt {100^{2} + 100^{2}} = 141.4214\) pixels to the central one (the first calibration point, at (0,0)), the corrective factor applied to raw errors was therefore 141.4214 divided by the average of the distances from the first point to each of the 4 others.

Clustering algorithms

Clustering algorithms are an intuitive choice to consider when dealing with mixtures of ellipsoidal distributions. In Brilhault et al., (2019), we showed that clustering techniques can be indeed highly effective in solving calibration issues with uncooperative subjects. The identification of the main fixations to a calibration target was obtained by partitioning the gaze samples using a clustering algorithm, and selecting the centroid of the cluster with the highest cardinality. The performance of various clustering methods were assessed on artificial and experimental datasets similar to those in the present work. Five algorithms were evaluated, selected from the most common clustering techniques:

  1. a)

    Partitional: PAM (Partition Around Medoid, Rousseeuw and Kaufman 1987) and TClust (a robust adaptation of the K-means algorithm, García-Escudero et al., 2008).

  2. b)

    Hierarchical clustering: HClust (Murtagh, 1983).

  3. c)

    Density-based: DBSCAN (Density-Based Spatial Clustering of Applications with Noise, Ester et al., 1996).

  4. d)

    Model-based: MClust (Gaussian mixture model fitted via the Expectation-Maximization algorithm, Scrucca et al., 2016).

Among all these methods, Model-based clustering showed the best performances on artificial data, as expected, but poor results on experimental data. PAM and DBSCAN, on the other hand, presented satisfying results in both real and synthetic environments. To compare our new approach with the best candidates from our previous study (namely PAM, DBSCAN and MClust), we applied in this work the same methodology as in Brilhault et al., (2019). The clustering algorithms where applied to our datasets (both experimental and synthetic), and the main mode estimate was computed by averaging the coordinates of the samples belonging to the group with the highest count. PAM and DBSCAN parameters were selected through a grid search minimizing the average silhouette, while MClust used an initialization through agglomerative hierarchical clustering and an EM fitting of Gaussian mixtures based on the Bayesian Information Criterion (BIC).

Evaluation

To quantify the performance of different central tendency measures and partitioning algorithms in locating the main mode of a compound distribution, we considered the following evaluation metrics: i) MeanError: average Euclidean distance between the true center of the inliers group (or the ground truth coordinates for real-world data) and the location estimate (in the case of partitioning technique, the center of the largest group provided by the partitioning algorithm); ii) SD: the standard deviation of measure i) across all Monte-Carlo repetitions (or across all calibration targets, for real-world data); iii) SSE: the sum of squared error, from the Euclidean distances used in measure i); iv) HitRate: the proportion of simulations correctly identifying the largest cluster; v) MissRate: the opposite measure of iv), equal to (1 − HitRate). Regarding measures iv) and v), considering that the main cluster is drawn from a multivariate normal distribution with a standard deviation σ equal to 1, and therefore that 99.7% of its samples are expected to fall within 3σ, we labeled as correct identifications (hits) all the simulations with a location estimate showing an error to the true center inferior to 3.

Robust estimates of central tendency

In this section we review different measures of central tendency which could be applied to noisy multivariate distributions as those encountered in our context of application, and present their results on similar artificial mixtures.

Measures of central tendency and robust statistics have been extensively studied for univariate distributions. The mean, median, and mode are certainly the most common central estimates, and, when facing contaminated distributions, the mode and median are usually preferred due to their tolerance to outliers. The median can in fact sustain up to 50% contamination in regard to the breakdown property defined by Donoho and Gasko (1992). Univariate definitions of central tendency, location, scale, or outlyingness are however not trivial to generalize to multivariate data. For instance, the coordinate-wise median and mode, unlike the mean, are not affine equivariant and fail to consider the dependency between the different dimensions (Niinimaa & Oja, 1999). The extension of these concepts to higher dimensions has been the focus of numerous researches in the statistical community (Oja, 1983). John Tukey, for instance, introduced in 1975 the notion of data depth, as a new estimate of central tendency in multivariate distributions (Tukey, 1975; Zuo & Serfling, 2000a). Associated with a given distribution F on \(\mathbb {R}^{p}\), depth functions are designed to provide a F-based center-outward ordering (and thus a ranking) of points in \(\mathbb {R}^{p}\). High and low depths corresponding to “centrality” and “outlyingness”, respectively. The Multivariate Median (MM) associated to a depth measure is defined as the location that globally maximize depth (Serfling, 2006).

Desirable properties of depth functions include monotonicity, uniqueness, maximality at center, affine equivariance, and robustness to outliers (generally measured by the breakdown point or influence function) (Liu, 1990; Zuo & Serfling, 2000b). Several data depth functions and their corresponding MM have been proposed over the last decades, offering diverse definitions of centrality, and presenting differences in term of theoretical properties, robustness, or computational complexity. Among the most popular, we find the following:

  • The Halfspace depth defines the depth of a point \(\theta \in \mathbb {R}^{p}\) relative to a dataset X = x1,...,xn as the minimal number of observations in any closed halfspace (i.e. half-planes when p = 2) that contains 𝜃 (Tukey, 1975; Donoho & Gasko, 1992). The Halfspace median (also called Tukey median) is the location 𝜃 with maximal depth.

  • Oja’s depth, also known as Simplicial Volume depth: given a set X of n points in \(\mathbb {R}^{p}\), the Oja depth of a point 𝜃 is the sum of the volumes of the p −variate simplices formed by 𝜃 and all subsets of p + 1 elements from X (in bidimensional data, it corresponds to the sum of the areas of all the triangles having 𝜃 as vertex). The point in \(\mathbb {R}^{p}\) with the maximal depth is the Oja median (Oja, 1983; 2013; Fischer et al., 2016).

  • Simplicial depth corresponds to the probability of a point \(\theta \in \mathbb {R}^{p}\) to be contained inside a closed random simplex of p + 1 observations from the distribution (Liu, 1990; 1988). The simplicial median, or Liu median, is the point included in the highest number of unique simplices formed by p + 1 samples.

  • The spatial median, also known as median center or geometric median, corresponds to the point minimizing the sum of the euclidian distances to all observations (Möttönen, Nordhausen, & Oja, 2010; Small, 1990; Lopuhaa & Rousseeuw, 1991;Chaudhuri, 1996). Different outlyingness measures can be derived from this median, such as the L2-Depth from Mosler (2012), based on the L2 distance to the spatial median (or its affine invariant version using the covariance matrix of the samples, the one used in this work), and the spatial depth introduced in Vardi and Zhang (2000) and Serfling (2002a), which relies on spatial quantiles computed from the distances of a point to each sample from the distribution (Serfling, 2006;2002b Zuo and Serfling, 2000c). Note that the terminology used in the literature can often be ambiguous, the spatial multivariate median, despite relying on L2 distances, is sometimes referred to as L1-MM (and associated data depth functions called L1-Depth by extension), as it minimize the L1 norm of the vector of euclidean distances, see Zuo and Serfling (2000b) and Dodge and Rousson (1999) for discussion.

  • The Mahalanobis depth of a point 𝜃 with respect to a distribution F in \(\mathbb {R}^{p}\) is defined as MD(𝜃|F) = [1 + (xμ)tS− 1(xμ)]− 1, where μ and S correspond to the the mean and covariance matrix of F, respectively (Masse, 2004; Zuo & Serfling, 2000c).

  • Projection depth: depth is determined as the minimum outlyingness of a point in relation to the univariate median of any one-dimensional projection of the distribution (Zuo, 2003; Zuo & Serfling, 2000a).

  • Convex hull trimming/peeling methods (Donoho & Gasko, 1992; Eddy, 1982; Green, 2006), such as the zonoid depth (Dyckerhoff et al., 1996; Koshevoy & Mosler, 1997): these methods usually construct nested convex hulls recursively, starting with the smallest convex hull which encloses all samples, then strip** away outlying data (the points on the hull boundary) and reiterating the process. These successive hulls delimit depth regions, the more central the higher depth. The corresponding median is the final point remaining, or, if it is not unique, the centroid of the last convex hull.

To review the formal mathematical definitions of the depth measures listed above, their statistical properties, or their computation, the reader can refer to Mosler (2012), Rousseeuw and Hubert (2017), Small (1990), Zuo and Serfling (2000a), Aloupis (2001), Serfling (2006), and Mosler (2012). As an illustration of the behaviour exhibited by different depth measures on bivariate multimodal data, we provide in Fig. 4 examples of several depth functions applied to data taken from our eye-tracking calibration recordings.

Fig. 4
figure 4

Depth maps computed on experimental data collected during an eye-tracking calibration. Colors indicate the depth of each region (the warmer, the deeper)

Multivariate medians

When outliers are randomly distributed, depth-based MM typically show good tolerance to noise, as illustrated in Table 2 (errors ranging from 0.228 to 0.286σ for 75% contamination). If outliers are grouped in clusters, on the other hand, medians derived from depth measures still present better results than the mean or coordinate-wise median, but their robustness significantly drops. As the outliers ratio increases, the medians tend to shift away from the main cluster center, attracted by other high density regions. These errors grow dramatically when the contamination gets past 50%, as one can observe in Figs. 5 and 6. These medians can therefore not sustain the high quantities of outliers typically encountered in our experimental datasets.

Table 2 Depth Medians average error to the center of a multivariate normal (σ = 1) for an increasing amount of uniform noise
Fig. 5
figure 5

Tolerance of depth medians to clustered outliers. The four top-most figures show the average error to the true center as a function of inliers ratio (the variance of the inliers cluster being equal to 1, a distance of 3 corresponds approximately to the cluster outer limit, which includes about 99.7% of the samples). The four lower plots provide the percentage of successful identification of the main cluster (when the estimate error is inferior to 3). Results are reported for a number of clusters from 2 to 5

Fig. 6
figure 6

Comparison of the Oja classic multivariate median (Med-Oja), sample median (Max-Oja) and the two-steps method consisting in computing the sample median from the 10% deepest samples (Sup-Oja). The miss rates and errors are provided for different inliers ratios, on distributions composed of 3 clusters, and 0 to 25% uniform noise. The average of samples (Mean) and Coordinate-Wise Median (Med-CW ) are also reported as baselines

The computational complexity of MMs is usually high, involving for instance gradient research in vast regions to find the location with maximal depth (Aloupis, Langerman, Soss, & Toussaint, 2003). Furthermore, when facing multimodal distributions, the deepest regions are often located in the empty space in between clusters. For these reasons, an alternative to the standard MM consists in selecting the sample with the highest depth, which not only allows faster computation, but also guarantees falling within one of the clusters in the absence of uniform noise. This estimate is called the sample median (Liu, 1990). We suggest an extension of this approach, selecting a small fraction of the samples with the highest depths, rather than the deepest point alone. This subset might still contain outliers (samples from other clusters and/or uniform noise), but usually in much smaller proportions than in the original distribution. From this subset, a new measure of central tendency can then be applied. If the samples are averaged, this operator would correspond to a depth-trimmed mean with a very high α parameter (the fraction of extreme values rejected) (Massé & Plante, 2003; Donoho & Gasko, 1992; Rousseeuw & Leroy, 1987). Instead of the mean, taking the sample median showed, in our simulations, more robustness (it requires recomputing the depth values with respect to this sub-population and selecting the deepest sample). Note that the concept of trimmed median only applies to the multivariate setting. In a univariate distribution, the median will remain the same regardless of the amount of trimming, as samples are removed from both ends in equal numbers. However, unless the distribution is centrally symmetric, that property does not hold for most multivariate depth functions.

To evaluate this depth-trimmed median, we tested different depth functions with the following two-steps scheme: firstly, selecting the 10% deepest samples, and, then, estimating the depth-based sample median within this group. We show in Fig. 6 the results of this method (labelled as Sup), compared to the conventional MM (the location with the highest depth, labeled Med), the sample median (the deepest sample of the global distribution, labeled Max), as well as the arithmetic mean (Mean) and Coordinate Wise Median (CW-Med) used as baseline. It is worth noticing that with multimodal mixtures, the Sup method shows slightly better results than the sample median, which itself offers improvements over the standard MM. Yet, despite these results, none of these methods offer enough robustness for the highest outliers ratios. In the next section we introduce our new approach to overcome the limitations of multivariate depth medians when dealing with highly contaminated multimodal distributions.

Recursive depth estimator with corrected location through filtering

We describe in this section our new approach, named Bootstrap and Refine Iterative Locator (BRIL), which relies on three specific steps to identify the main mode of multivariate distributions: i) Boostrap:Footnote 7 a recursive estimate based on depth measures; ii) Refine: a sequential procedure of outlier filtering, through unimodality and normality tests; iii) Iterate: an iterative identification of each cluster, in order to select the main mode, i.e., the group with the highest cardinality. Our algorithm is distributed as an R package available at adrienbrilhault.github.io/BRIL/. The following sections detail every step, with a scrutiny on data to demonstrate the advantage of this method over the classic depth medians, the coordinate-wise median, and other techniques such as clustering.

BOOTSRAP: recursive estimate of location based on depth median

As shown in Section “Multivariate medians”, classical measures of central tendency based on depth can withstand a certain quantity of contamination (clustered outliers and/or uniform noise), unlike other locators such as the arithmetic mean, which are strongly and regularly affected by outliers, even in small proportions.

However, when the percentage of outliers increases above 50%, their performances drastically drop, and they fail at identifying the main cluster in most distributions (hit rates can get as low as 15%, as shown in Fig. 5). We should point out that while many studies, in robust statistics, consider 50% as an upper bound for the quantity of outliers, from the assumption that the true distribution should represent the majority of the data, this does not always hold for multimodal distributions. For example, in mixtures of Gaussians, the main sub-population (i.e. the inliers) can be defined as the group with the highest cardinality. In these conditions, the ratio of inliers can therefore drop as low as \(\frac {1}{K}+1\), where K is the number of components of the compound distribution. When considering mixtures which also includes uniform noise, this ratio can even drop further.

Examining depth maps from distributions with less than 50% inliers, we see that the deepest sample generally does not belong to the main cluster, but to the space in between clusters. Nonetheless, the main cluster often shows higher depth than secondary ones. Illustrative examples are provided in Fig. 7 for Oja and Projection depths. In both cases, the sample with the highest depth is located in between clusters, and, while the depth values of the uniform samples are spread, those from each of the clusters tend to be relatively compact, in a narrow range, with the main cluster showing values in average higher than the other groups.

Fig. 7
figure 7

Depth values from the Oja (top-most plots) and Projection depths (bottom-most plots) computed on the same distribution. The left-most figures show the depth values for each of the samples sorted by index (the first 120 points correspond to the main cluster, followed by 2 secondary clusters of 90 samples each, and finally 100 points of random noise). The right-most figures show these same samples in Cartesian coordinates, with their depth values represented on a color scale

From these observations, we see that the selection of a fraction of the samples with the highest depth can result in a subset with a relatively small proportion of outliers, from which the central tendency is easier to estimate, as suggested in Section “Multivariate medians” with the two-steps method labelled Sup. The choice of the optimal size for this subset is, however, sensitive, as it strongly depends on the respective sizes of clusters, their spatial configuration, and the separability of their depth values. By selecting too few samples, there is a risk of leaving out all the inliers, kee** a subset which does not intersect with the main cluster. On the other hand, if the group is too large, the proportion of inliers can be too low to compute a reliable estimate of central tendency. Selecting an optimal value for this parameter would require prior knowledge on the number of clusters, their size, and the quantity of uniform noise.

Instead of this two-step approach, we therefore suggest a more flexible methodology consisting in recursively computing the depth values, discarding a fixed fraction of the samples (those with the lowest depths), and re-iterating this process until reaching a sufficiently small number of samples. Kee** a subset too small at each iteration presents the risk of discarding too many inliers, as previously mentioned. Kee** too many samples leads to an increase in execution times, as more recursions are required. After preliminary simulation studies, we suggest using a 50% trimming as a trade-off between robustness and computational costs. Figure 8 exemplifies the execution of this recursive approach using Oja depths, kee** either 12.5% or 50% of the samples at each iteration. Note that when discarding a fraction of the samples with the lowest depths, ties are resolved using the same simplification as in Liu, Parelius, and Singh (1999) and Massé and Plante (2003), selecting the first samples encountered. Other options would be a random selection, or discarding all the samples sharing the same rank.

Fig. 8
figure 8

Recursive estimates of Oja depth median for two different distributions (the first in Figures A and C, the second in B and D). Both mixtures were drawn from the same clusters coordinates and parameters (30% inliers, centered at (0,0), 22.5% of clustered outliers in each of the two other groups, and 25% uniform noise). The samples selected at each iteration are depicted in different colors, kee** either the 12.5% (Figures A and B) or the 50% deepest (Figures C and D). Kee** a too small proportion of the samples at each iteration often fails to identify the largest cluster, since the first iteration might not contain any inlier (as in Fig. B)

The advantages of this recursive framework are shown in Fig. 9, for distributions with less than 50% inliers. The sample medians (sample with the highest depth, labelled as Max) are located within the main cluster in 18 to 41% of our simulations, depending on the depth function considered. The Sup method (the two-passes scheme where the top 10% deepest samples are selected and used to re-compute the depth-based sample median) offers more robustness in terms of hit rates and average error. Yet, the recursive version (prefixed Rec) significantly outperforms the other approaches for most of the depth measures evaluated. Among the ones showing the best results, such as Oja, L2, Mahalanobis, and Projection depth, the recursive estimate shows a miss rate more than twice lower than for the sample median. This recursive locator not only identifies the main cluster more frequently, but also improves the precision of the location among the correct identifications, as shown in the right part of Fig. 9, since the removal of low depth values avoids outliers dragging the median away from the cluster center.

Fig. 9
figure 9

Comparison of the sample median (Max), Sup method, and recursive median (Rec), on highly contaminated distributions. Left figure: proportion of estimates located outside the main cluster (with an error to its center above 3 SD), across all the simulations with a ratio of inliers inferior to 50%, a percentage of uniform noise from 0 to 25%, and a number of clusters from 3 to 5. Right figure: errors to the true center, considering solely the hits (estimates falling within the main cluster)

The results presented in Fig. 9 focused on highly contaminated data (above 50% outliers), situations in which most traditional methods show poor performances. As previously discussed, depth medians generally offer robust results when the overall quantity of inliers remains above half of the global distribution (see for instance in Fig. 5). In these more favorable conditions, with distributions containing 50 to 75% inliers, we show in Fig. 10 that most standard sample medians do find the main cluster in 80-85% of our simulations, and present a relatively accurate center estimate. Nonetheless, even with small quantities of outliers, our recursive method still provides better performances. Recursive depth locators not only achieve hit rates close to 100%, but also show smaller errors and variability when the main cluster was encountered. As an example, the mean error obtained by the Oja sample median over all simulations with 50 to 75% inlier was 2.96, compared to 1.52 for its recursive version, and the median error dropped from 1.651 to 0.26. While at different degrees, all depth measures tested showed a significant improvement for the recursive approach.

Fig. 10
figure 10

Comparison of the sample median (Max), Sup method, and recursive median (Rec) on low contamination distributions. Left figure: proportion of estimates located outside the main cluster, across all the simulations with a ratio of inliers between 50% and 75%, a percentage of uniform noise from 0 to 25%, and a number of clusters from 3 to 5. Right figure: errors to the true center, considering solely the hits

In conclusion, the results discussed in this section demonstrated that the recursive trimming of low depth samples provides great advantages over the conventional depth medians, especially when datasets are highly contaminated. If we look for instance at Oja depth estimates in distributions with less than 50% inliers, the sample median achieved hit rates of 39%, when those from its recursive version reached 75%. The mean error among hits also dropped from 1.880 to 0.327, and the overall median error from 9.260 to 0.386.

REFINE: improving the location by outliers filtering

Despite the improvements presented in the previous section, the recursive approach still fails to detect the main cluster in about a third of the simulations with high contamination, and the average distance of the estimate to the true center, when the main group is found, remains around half a standard deviation. These limitations lie with the fundamental nature of depth metrics, which were not intended for a multimodal scenario (Serfling, 2006). For example, if there are several clusters of large cardinalities, with one positioned in a central position, it will generally show higher depths than an outer cluster with a larger count.

The other steps of the BRIL algorithm, detailed in these two next sections, offer solutions to overcome these issues. While the recursive depth locator serves as Bootstrap of the procedure, providing a first estimate of central location, the Refine step objective is to identify the samples belonging to the same cluster as this estimate. This stage serves a dual purpose: firstly, to refine the location of the estimate, by considering solely the samples belonging to the same sub-population, with a minimal number of outliers; and, secondly, this partitioning can be used in the next step of the algorithm, the Iterate procedure, which recursively removes from the global distribution the samples identified as part of each cluster, in order to select the group with the highest cardinality at the end of the iterations. This will, therefore, avoid selecting a small cluster if it was picked by the first execution of the recursive depth locator due to its higher depth.

In order to refine the location computed in the Bootstrap step, we will employ a common outlier filtering methodology, consisting in recursively removing the more distant samples (Dempster & Gasko-Green, 1981; Rousseeuw & Leroy, 1987; Caroni & Prescott, 1992). This type of sequential procedure is generally recommended when the number of outliers is unknown (Thode, 2002). It relies on an outlyingness measure, used to select the most discrepant sample(s) to be removed at each iteration, and on a stop** criterion, generally under a Gaussian assumption. Our outliers removal strategy consists of two steps. The first based on Euclidean distances and a unimodality test. The second on robust distances and a normality test.

In the first filtering stage, since we do not have prior knowledge on the quantity and nature of the outliers, which can be uniformly distributed, organized in clusters, or both, Euclidean distances to the first estimate were used to select the furthest sample to be discarded at each iteration. If the mixture contains clusters of outliers, the histogram of distances will typically exhibits multimodal properties, with the first peak corresponding to the cluster selected, as illustrated in Fig. 11. In these conditions, progressively removing the most distant samples will eventually lead to a unimodal subset of distances, where the samples from the cluster represent the great majority of the data remaining.

Fig. 11
figure 11

Recursive filtering of outliers. Figure A shows an illustrative multimodal distribution. Euclidean distances from each sample to the estimate computed with the recursive depth method (indicated by the black cross) are presented in the form of histogram in Figure B, density in Figure D, and by index in Figure C (the first 400 samples corresponding to the main cluster, followed by 200 in the second and third clusters, and finally 200 points of uniform noise). The distances histogram and density function exhibit clear multimodal properties, but will converge to a unimodal setting once most of the outliers have been filtered by the progressive removal of the furthest samples

Tests for unimodality and multimodality have been extensively studied for univariate distributions (Cheng & Hall, 1999; Fischer, Mammen, & Marron, 1994; Hartigan, 2000). Most common techniques include: i) the Silverman bandwidth test (Silverman, 1981), which uses a normal kernel density estimate with increasing bandwidth; ii) the excess mass test proposed in Müller and Sawitzki (1991); and iii) the Hartigan’s Dip Test (Hartigan & Hartigan, 1985), which has been widely adopted due to its low computational complexity, its high statistical power, and the absence of tuning parameter. The extension of these tests to multivariate distributions is not straightforward, and several definitions of unimodality have been suggested for the multidimensional setting (Paez & Walker, 2018; Kouvaras & Kokolakis, 2007). The Hartigan’s SPAN and RUNT statistic (Hartigan & Mohanty, 1992), and the MAP test (Rozál & Hartigan, 1994) are often cited as multivariate alternatives, but these types of procedures are usually far more complex than univariate ones, both conceptually and computationally, relying for instance on the construction of several spamming trees (Siffer et al., 2018). They also require a delicate adjustments of tuning parameters, and, for these reasons, most studies resort instead to the projection of multivariate data into one dimension to test for unimodality (Johnson & Wichern, 2007). To transform multivariate samples into univariate values, rather than linear or principal curve projections as in Ahmed and Walther (2012), we suggest considering the Euclidean distances of the samples with respect to the current central estimate, in a way that relates to the method from Siffer et al., (2018), who used distances to fold the multivariate distributions around a pivot. Our filtering method consider the previous recursive depth estimate as pivot to compute the distribution of Euclidean distances, to which the DIP test is applied at each iteration, removing the most distant sample and repeating the procedure while the test rejects the hypothesis of unimodality with a sufficient confidence level (we set the significance level to 5%, i.e., stop** when the test p-value exceeds 0.05). This filtering process is illustrated in Fig. 12.

Fig. 12
figure 12

Iterative outliers removal based on Euclidean distances and the Dip test of unimodality. Samples are ordered by their Euclidean distance to the central location estimate obtained with recursive depth trimming (indicated by the red cross on the right figure). At each iteration, the Dip test is applied to these distances and the furthest sample is removed, until the test fails to rejects the unimodal hypothesis. The p-values at each iteration are provided on the left figure, while the right plot shows the samples in Cartesian coordinates, with the unimodal subset selected at the end of the procedure displayed in blue

Once the unimodal subset is extracted, the second filtering phase relies on a normality test. An outlyingness measure is again required to determine which sample to discard at each iteration, but the Euclidean distances used in the first filtering step are no longer advisable, since the cluster is not necessarily spherical. In such scenario, outliers can present smaller Euclidean distances than samples from the cluster. A common alternative when dealing with ellipsoidal distributions is the Mahalanobis distance. However, this measure is known to be sensitive to the presence of outliers, which strongly affect the data covariance matrix (Devlin, Gnandesikan, & Kettenring, 1981). This problem is generally referred to as the masking effect, and addressed by the use of robust estimators of location and scatter. Robust distances are traditionally defined by the Eq. 1 (Hadi, 1992; Huber, 2011; Rocke & Woodruff, 1996; Rousseeuw & Kaufman, 1987), such that T(X) is a central location estimator and C(X) a matrix of dispersion/scatter. Note that if we choose T(X) as the arithmetic mean and C(X) as the traditional sample covariance matrix, this equation becomes the Mahalanobis distance.

$$ RDi = \sqrt{(x_{i} - T(X))^{t} C(X)^{-1} (x_{i} - T(X))} $$
(1)

In place of the classic covariance matrix, the scatter parameter provides an estimate less sensitive to outliers. Suitable robust estimators of location and scatter include the Minimum Covariance Determinant (MCD) (Rousseeuw & Driessen, 1999), the Minimum Volume Ellipsoid (MVE) (Rousseeuw, 1985), the Orthogonalized Gnanadesikan-Kettenring (OGK) (Gnanadesikan & Kettenring, 1972; Maronna & Zamar, 2002), M-estimators (Maronna, 1976; Tyler, 1987), S-estimators (Lopuhaa, 1989; Rocke, 1996), or MM-estimators (Tatsuoka & Tyler, 2000). A survey is available in Rousseeuw and Hubert (2013).

Estimators such as MCD or MVE, do not only provide a scatter estimate but as well a center location, which can be used as location estimate, T(X), for the computation of robust distances. The recursive depth estimate presented in Section “BOOTSRAP: recursive estimate of location based on depth median”, provides a robust estimate of central location even in contaminated distributions, and is therefore another good candidate. To improve its precision, we recommend using this recursive depth-based locator on the unimodal subset selected in the first filtering step, instead of the previous estimate, which was computed from the global distribution.

We compared three types of robust distances in preliminary studies, based on MCD, MVE, and OGK, respectively. These 3 operators require a single parameter. For MCD and MVE it corresponds to the size of the subset used to compute the robust location and scale. This number was set to (n + p + 1)/2, where n correspond to the number of samples, and p the dimension of the data, which is the most commonly adopted value in the literature since it offers good properties in term of breakdown point (Lopuhaa & Rousseeuw, 1991; Hubert, Debruyne, & Rousseeuw, 2017). OGK, on the other hand, requires the choice of a robust univariate estimate of scale, as, for instance, the median absolute deviation (MAD), the τ scale (Yohai & Zamar, 1988), or Sn and Qn operators (Rousseeuw & Croux, 1993). Maronna and Zamar (2002) offer a comprehensive review and comparison of these alternatives. Our implementation of OGK follows their recommendations, using the τ estimate.

Simulations over a wide range of mixtures showed similar results for each of these robust distances. An example is provided in Fig. 13, using 200 samples in the main cluster, 120 in each of the two secondary clusters, and 200 samples of uniform noise. As one can notice, the classic Mahalanobis distances are not suited to deal with noisy data, even after filtering a great part of the outliers through the unimodal tests. Robust estimates of location and scatter, on the other hand, offer reliable results, as they can withstand up to 50% outliers and are being applied to the unimodal subset, whose contamination is typically far below.

Fig. 13
figure 13

Robust estimates of location and scatter (the covariance confidence ellipses indicate the regions containing 97.5% of the samples under a normal assumption). The samples rejected after the first step of outlier deletion based on Euclidean distance and unimodality tests are depicted as red points, while black ones constitute the unimodal subset selected. The red ellipse correspond to the Mahalanobis distances computed from the whole distribution, while the orange one was estimated from the unimodal subset only. The blue, green, and yellow ellipses correspond to the robust distances estimated on the unimodal subset, based on the MVE, MCD, and OKG, respectively. The left figure shows the whole distribution, the right one a close-up on the main cluster

Based on these evaluations and its shorter computing time, the MCD method was chosen as the scatter estimate. The other estimators would be nonetheless suitable. In term of location parameter, we used the corrected recursive depth estimate presented in Section “BOOTSRAP: recursive estimate of location based on depth median”, applied to the subset selected in the filtering step based on unimodality. From the ordering of the samples provided by these robust distances, the final step of the Refine procedure consists in filtering the remaining outliers in the unimodal subset. A possible approach for outlier filtering, detailed for instance in Hadi (1992), Hubert and Debruyne (2010), Rousseeuw and Kaufman (1987), and Rousseeuw and Van Zomeren (1990), consists in removing at once all the outliers with a robust distance superior to a given threshold. Different values have been suggested, such as considering the square root of the 97.5% quantile of the chi-square distribution with p degree of freedom (p being the data dimension) (Hubert & Debruyne, 2010; Korkmaz, Goksuluk, & Zararsiz, 2014; Etherington, 2019). The choice of this cut-off value remains however very sensitive, as it depends on the nature of the distribution, on the quantity of noise, and on the robust distances parameters (Filzmoser, Garrett, & Reimann, 2005). For these reasons, generic thresholds recommended in the literature can often lead to large errors, either false positives (outliers selected as part of the cluster) when set too high for the dataset considered, or false negatives (inliers rejected) when too low. Therefore, we suggest instead an approach which does not require prior knowledge on the quantity of outliers, and consists in an iterative removal of the most distant samples, as in our first filtering step, until reaching a stop** criterion. Note that the opposite procedure is possible as well, starting with a small subset and adding samples at each iteration, as in Atkinson (1994) and Hadi (1992). In regard to the stop** criterion, different methods have been suggested, relying for instance on tests of normality, on the distribution of residuals, or on the convergence of parameters such as the energy or the dispersion matrix.

In order to search for clusters or modes in multivariate distributions, one must first define what is considered a group, a sub-population. While some researches have been investigating atypical shapes such as rings, stars, or non-parametric models, most studies usually characterize clusters as groups of samples following a normal law, or presenting an ellipsoidal shape. As noted in Burman and Polonik (2009), the local distribution around a mode can be modeled as a multivariate normal under some smoothness assumptions. Adopting this widely used definition, which is consistent with our recordings of ocular fixations and with a great number of other measurements of natural phenomenons, we based this second outliers filtering step on a Gaussian assumption.

When considering multivariate samples, a rejection of the normality hypothesis in one of the components, by an univariate procedure, is enough to rule out multivariate normality (MVN), as all marginal distributions and linear transformations of a multivariate normal are themselves normal (Mardia, 1980; Looney, 1995). However, testing each variable separately is not sufficient to conclude on normality, as non-normal multivariate distributions can also have normally distributed marginals (Thode, 2002; Mecklin & Mundfrom, 2004). Therefore, MVN tests require the use of tailored statistics, accounting for the multivariate structure of the data. This topic was the object of numerous studies over the last decades, leading to the creation of more than 50 procedures (Mecklin & Mundfrom, 2005; Looney, 1995). Reviews of some of these tests, with discussions on their consistency, power, affine invariance, and other properties, are available in Mecklin and Mundfrom (2004), Looney (1995), Andrews et al., (1973), Gnanadesikan (1977), Cox and Small (1978), Henze (2002), and Romeu and Ozturk (1993). Among the methods most frequently cited, we find: i) tests based on the empirical characteristic function, such as the Henze-Zikler test (Henze & Zirkler, 1990) or the BHEP family (Baringhaus & Henze, 1988; Epps & Pulley, 1983); ii) the Royston extension of the Shapiro-Wilk univariate test (Royston, 1983; 1992); iii) tests based on skewness and kurtosis, for instance those from Mardia (Mardia, 1970), Kankainen (Kankainen, Taskinen, & Oja, 2007), or the Doornik-Hansen omnibus test (Doornik & Hansen, 1994); and iv) the energy test from Székely and Rizzo (2005).

There is no consensus on the best way to assess normality in the multivariate scenario, each method presenting advantages and drawbacks depending on the number of samples, the parameters of the tests, or the alternative distributions considered. Power studies of these tests often report inconsistent results, and usually conclude that there is no method universally superior to the others in all the situations evaluated (Ma, 2012; Bogdan, 1999; Joenssen & Vogel, 2014; Farrell, Salibian-Barrera, & Naczk, 2007; Mecklin & Mundfrom, 2005; Horswell & Looney, 1992; Cardoso De Oliveira & Ferreira, 2010; Hanusz, Enomoto, Seo, & Koizumi, 2017; Naczk, 2005). Nonetheless, methods based on multivariate skewness and kurtosis have grown very popular and are often recommended, as they can detect various types of departures from normality, and generally present good properties such as affine invariance.

In this work, we adopted the Mardia tests, widely used in the literature, which rely on the extension of the third and fourth moments to multivariate data (Mardia, 1970). There are, in fact, two Mardia tests, based on kurtosis and skewness, respectively. Each of these moments can characterize and diagnose different departures from normality, and it is therefore usually advised to use both tests, as discussed in Kankainen, Taskinen, and Oja (2007), Thode (2002), and Zhou and Shao (2014). While some authors suggest a direct combination of both statistics, through single omnibus tests (Mardia & Foster, 1983; Doornik & Hansen, 1994; Hanusz et al., 2017), this approach was questioned in several studies (Horswell & Looney, 1992). A simpler approach, that we adopted, is often preferred: the agreement of both measures, that is, rejecting the null hypothesis when any of the two tests indicates non-normality, as in Cain, Zhang, and Yuan (2017), Korkmaz, Goksuluk, and Zararsiz (2014), and Looney (1995). We also included another method in our experiments, that has been suggested for MVN testing, and consists in a comparison of the Mahalanobis distances and the chi-square distribution with p degree of freedom through a Kolmogorov-Smirnov test (Malkovich and Afifi, 1973; Joenssen & Vogel, 2014; Etherington, 2019; Thode, 2002; Brereton, 2015). For both of these methods, similarly to the first step of the Refine procedure, the filtering was performed through a sequential procedure. At each iteration, the most distant outlier, based on the robust distances, is removed, until the MVN test fails to reject the normality hypothesis (we used a significance level of 0.05 again). Figure 14 shows the results obtained on an illustrative distribution, presenting the p-values computed at each iteration and the final set of samples considered part of the cluster under this normality assumption. The two tests considered showed satisfactory results in both our experimental and synthetic datasets, and appear suitable for the outlier filtering procedure.

Fig. 14
figure 14

Iterative outliers filtering based on normality tests. Left-most figures: p-values of the tests after each outlier removal (ordered by their robust distance with respect to the unimodal subset). Right-most figures: the whole distribution in Cartesian coordinates, with the samples selected depicted in blue. The top-most figures show the filtering based on the Mardia test, whereas the bottom-most ones were obtained by comparing the KS test between Mahalanobis distances and the Chi-Squared distribution

Once the subset of samples belonging to the cluster has been identified through the successive normality tests, its center is computed using the arithmetic mean, which constitutes our final location for this mode. The results from this estimate, that we call “refined location”, are referred to as “BRL” (Bootstrap and Refine Locator). In Fig. 15, we present its results in the synthetic dataset, with the same measures as those used in Figs. 9 and 10. These values were computed for all the simulations with a quantity of inliers inferior to 50%.

Fig. 15
figure 15

Comparison of the sample median (Max), recursive median (Rec), and refined estimate (BRL). Left figure: proportion of estimates located outside the main cluster, across all the simulations with a ratio of inliers below 50%, a percentage of uniform noise from 0 to 25%, and a number of clusters from 3 to 5. Right figure: errors to the true center, considering solely the hits

Analyzing the performances of the refined locations, we notice a substantial improvement when compared to the estimate computed by the recursive depth method (REC). From the results presented in Fig. 15 (right), one can see that the errors were considerably reduced, with an average error around 0.1 standard deviation away from the true center in the case of hits. In turn, the percentages of estimates located outside the main cluster (i.e., the miss rates) were not statistically different between the BRL and REC methods. This finding was expected, as the refined location rely on the estimate obtained with the recursive depth method. Therefore, if the first recursive estimator selected a secondary cluster instead of the main one, the following refining steps using unimodality and normality tests will still apply to the same incorrect cluster. The slight improvements observed in some of the miss rates, while marginals, can be explained by the adjustment step applied after reaching an unimodal subset. When the initial estimate is located on the edge or close to the main cluster, as it sometimes happens with high outliers ratios, the location is considered a miss, since its distance to the true center is superior to 3σ. The application of the first filtering process using Euclidean distances and unimodality tests is usually sufficient to clear the initial distribution of most of the outliers, as long as the first estimate was closer to the main cluster than to secondary ones. Therefore, after reaching unimodality, the corrected estimate will likely fall inside this main cluster, and be considered a hit.

ITERATE: iterative identification and removal of clusters

With the refining step presented in the previous section, the precision of the center location reached highly satisfying levels, with mean errors inferior to 0.1 standard deviation, almost identically to the average computed from the ground-truth labels. However, there are still a number of simulations for which it fails to identity the main cluster, selecting instead a secondary one of smaller cardinality, as a result of the central preference of depth measures.

To overcome this limitation, we designed an iterative procedure removing the samples identified in the Refine step from the whole distribution, in order to search for new clusters with the same methodology: applying the Bootstrap and Refine steps again on the remaining samples. This design relates to search and selection strategies proposed, for instance, in Greene et al., (2018). At each iteration, a new cluster will be selected, labeled and removed, until all samples are assigned to clusters, or until the remaining distribution is already unimodal before any filtering. In the latter scenario, the Refine procedure will still be applied, yet, when it ends, the global iterative search will terminate, even if samples would remain unassigned, in order to avoid creating several small spurious clusters from unimodal data (generally the uniform noise remaining after having removed all the clustered samples).

Once all iterations have been processed, the main mode location is simply selected by choosing the center associated to the group with the highest count among all clusters encountered but the last. If there was only one iteration, then the center of this group is selected. Removing the last cluster from the candidates avoid situations where the uniform samples remaining in the final iteration would fail to reject the normality hypothesis during the Refine step, and, therefore, would be considered as a cluster. If the quantity of uniform noise is superior to the cardinality of the main cluster, then this uniform data could be erroneously considered as the main mode. Note that when there is no uniform noise, the depth-based sample median in the penultimate iteration of the procedure will guarantee finding the largest group out of the two remaining, since it will represent more than 50% of the distribution. In short, in any possible scenario, the last group encountered will not be the main cluster, and can therefore be removed from the candidates.

This procedure is exemplified in Figs. 16 and 17. Figure 16 shows the selections performed in the different iterations, leading to the identification of 3 groups in a sample distribution with an high quantity of outliers, both as clusters and uniform noise. Each row corresponds to an iteration of the algorithm, and is decomposed in the several successive steps we described. First, an estimate of the mode is computed through the depth recursive method (Bootstrap). Then, outliers are filtered with a recursive removal based on the Euclidean distance to this first estimate and the DIP test of unimodality. Once the unimodal subset has been identified, the center is corrected through the same recursive depth method as in Bootstrap, and robust distances to this location are computed with the MCD estimator. These distances are used to order samples in the second phase of outliers filtering, progressively removing the most distant samples until the Mardia tests point to normality. The final estimate for each group is obtained by averaging the positions of this last subset. These samples are then removed from the global distribution, and the whole procedure is repeated. In this illustrative distribution, the procedure stops at the end of the third iteration as the remaining samples are already unimodal before any filtering. Figure 17 shows all the groups identified at the end of the procedure. The largest one, depicted in green, is selected as the main mode due to its higher cardinality.

Fig. 16
figure 16

Iterative identification of multivariate modes. Each iteration consists in finding an initial seed though a recursive depth estimator (Bootstrap), then applying two successive filtering processes (Refine). The first is based on Dip tests computed from the Euclidean distances to the Bootstrap estimate, in order to extract an unimodal subset (as shown in the first two columns). The second, displayed in the last two columns, relies on robust distances and a multivariate test of normality (Mardia Test). The samples selected at the end of the two filtering steps are removed from the global distribution and the whole procedure is re-iterated on the remaining data. Each row of figures corresponds to an iteration. The stop** criteria was met when the remaining data showed unimodal properties at the beginning of the third iteration

Fig. 17
figure 17

Groups identified after 3 iterations of the BRIL-Projection method. The estimate of the main mode, represented by a black cross, correspond to the center of the group with the highest cardinality (405 samples in this example)

Similarly to the experiments presented in the previous sections, we analyzed the gains obtained on the synthetic dataset, which are summarized in Fig. 18. This final method, including the Iterate step, and referred to as BRIL, presents outstanding results when compared to the other approaches. Besides kee** a very accurate localization of the center in case of hits, as with the BRL version, this iterative mechanism significantly reduces miss rates (i.e., the number of simulations where the main cluster was not encountered). BRIL estimates indeed showed miss rates values close to zero for all the depth metrics tested. While a few samples can sometimes be miss-classified in the Refine step, our method shows robust results and is able to identify all the actual clusters in the great majority of our simulations, regardless of their position, proximity, variance, or the presence of uniform noise. Moreover, the center locations are particularly accurate due to the low quantity of noise remaining after the outliers filtering procedures.

Fig. 18
figure 18

Comparison of our final method (BRIL) with the sample median (Max), its recursive version (Rec), and Refined one (BRL), for different depth functions. The left figure shows the proportion of estimates located outside the main cluster (i.e., miss), across all the simulations with a ratio of inliers below 50%, a percentage of uniform noise from 0 to 25%, and a number of clusters from 3 to 5. The right figure the errors to the true center, considering solely the hits

Recursive MVE / MCD

Minimum Volume Ellipsoid (MVE) and Minimum Covariance Determinant (MCD) are popular robust estimates of location and scatter, due to their high breakdown point, and fast computation (Rousseeuw and Kaufman, 1987; Rousseeuw & Driessen, 1999; Rousseeuw, 1985). Their location and scatter parameters are defined as the mean and covariance matrix of the h observations among n presenting the smallest volume, in the case of MVE, or the minimal determinant of the sample covariance matrix in the case of MCD, with h usually chosen as (n + d + 1)/2, such that n is the number of samples and d the number of dimensions (Becker & Scholz, 2004). As such, they also belong to the class of convex body minimizers.

These methods are often encountered in the literature of robust statistics alongside depth measures as they are as well suited to deal with contaminated multivariate data, for purposes as diverse as noise filtering, classification, clustering, regression, fitting, etc. In the late 1970s, Sager also suggested the use of convex body minimizers to estimate the mode of multivariate distributions, through a recursive procedure selecting at each iteration the subset of samples of a given size with the smallest volume, and discarding the others, until reaching a group containing less than d + 1 elements (Sager, 1978; 1979). The estimate of the main mode is, then, obtained by averaging the samples from the last iteration. Unfortunately, the limited computational processing power at the time, combined with the absence of optimized algorithm to compute minimum volume subsets, kept this proposal only hypothetical. Recently, a new study revived Sager’s idea, and compared three different implementations of this procedure based on MCD, MVE, and Minimum Volume Convex Hull (MVCH) (Kirschstein, Liebscher, Porzio, & Ragozini, 2016).

The methodology we introduced in Section “BOOTSRAP: recursive estimate of location based on depth median”, estimating the mode through recursive filtering of a fixed fraction of the samples of lowest depth, presents similarities with the procedure theorized by Sager with convex body minimizers. We compare in Fig. 19 the results of the recursive MCD and MVE methods (Rec-MCD and Rec-MVE, respectively), with those from our recursive estimates based on depth measures. Rec-MCD and Rec-MVE show overall performances relatively similar to those from locators based on Oja, L2, Projection or Mahalanobis depths. While their miss rate are very close to each other (from 26 to 31%), it is interesting to notice a slight advantage for the depth-based estimates in term of precision of localization.

Fig. 19
figure 19

Comparison of recursive locators based either on convex body minizers (MCD and MVE) or depth measures. Metrics were computed across all the simulations with a ratio of inliers below 50%, a percentage of uniform noise from 0 to 25%, and a number of clusters from 3 to 5

Limitations

The intrinsic nature of depth measures and convex body minimizers differ on several aspects. Unlike depth functions, MCD and MVE operators do not directly provide an ordering of the samples constituting the global distribution, but instead simply select a given fraction of the samples presenting the minimum volume or covariance determinant. It is nonetheless possible to derive outlyingness values from these subsets, as detailed in Section “REFINE: improving the location by outliers filtering”, by computing robust distances based on an unique iteration of MCD/MVE (using the estimates of location and scatter from the samples selected). However, convex body minimizers being combinatorial methods, changes in the size parameter can lead to entirely different subsets, which will, in turn, radically change the ordering and outlyingness values. On the other hand, subsets obtained from selecting the n samples of highest depth will always be included in any subset of deepest samples of a size superior to n. For these reasons, the choice of the number of samples kept at each iteration of the recursive procedure is particularly important with convex body minimizers, as different values can results in non-intersecting sets, and therefore greatly affect the robustness of the method (Kirschstein et al., 2016). The commonly recommended value of (n + d + 1)/2 only guarantees the convergence to the true mode under the assumption that the number of outliers remains as well below (n + d + 1)/2. However, in the situations considered in this work, where the global distribution contains several clusters, this assumption does not always hold. When the principal cluster account for less than 50% of the whole distribution, the first iteration can often selects a group including several secondary clusters, without samples from the main cluster, resulting in an incorrect mode at the end of the successive recursions, as illustrated in Fig. 20a and b. Even when samples from the main cluster are selected in the first iteration, not all of them might be kept, depending on the spatial configuration of the different clusters. This can lead to two sorts of errors. First, selecting a secondary cluster in the next iteration(s), when too many samples from the principal cluster were discarded in the first step (see Fig. 20c). Secondly, even if the correct cluster is chosen, its final center estimate (the result of the last iteration) might be corrupted if it was selected from a non-centered subset of the cluster (see Fig. 20d).

Fig. 20
figure 20

Recursive MVE applied to four illustrative datasets. The colors represent the samples selected at each iteration (with a trimming factor of 0.5). The true mode (cluster with the highest cardinality) is centered at coordinates (0,0)

Application of the BRIL algorithm to Rec-MCD and Rec-MVE

Given the promising results of Rec-MCD and Rec-MVE summarized in Fig. 19, showing performances similar to our recursive estimates based on depth measures, we suggest the use of these convex body minimizers as bootstrap estimates in our algorithm described in Sections “REFINE: improving the location by outliers filtering” and “ITERATE: iterative identification and removal of clusters”. These additional steps aim at overcoming the limitations described when inliers represent less than half of the distribution. First, by refining the location estimate through recursive trimming based on unimodality and normality tests. Then, by iterating the global procedure of identification and removal of each cluster, in order to select the cluster with the highest cardinality as first mode, in the exact same way as we did with the depth-based estimates.

The results of this approach are presented in Fig. 21, using the same nomenclature as in the previous sections, the prefixes BRL and BRIL referring to the refined location and to the complete iterative procedure, respectively. Both of the convex body minimizers methods tested showed behaviour very similar to depth-based measures when integrated into the BRIL algorithm. We see, indeed, that the refining step increases the precision of the central location when the main cluster was correctly identified by the Rec-MCD and Rec-MVE first estimate. Whereas the iterative procedure allows to overcome the issues encountered in distributions where an outlier cluster was selected in the first iteration, through the search of the other clusters and the selection of the largest.

Fig. 21
figure 21

Comparison of BRL and BRIL procedures applied to recursive estimators based either on depth measures or on convex body minimizers, for acrossall the simulations with a ratio of inliers below 50%, a percent-age of uniform noise from 0 to 25%, and a number of clustersfrom 3 to 5

Our simulations showed no significant difference between MCD and MVE performances, consistently with the findings from Kirschstein, Liebscher, Porzio, and Ragozini (2016). The reader can refer to the following studies discussing their differences and properties (Becker & Gather, 2001; Becker & Scholz, 2004; Kirschstein et al., 2016; Woodruff & Rocke, 1994).

Results

General Results

We summarize in Table 3 the performances of our approach measured on the synthetic datasets. For all the depth functions and convex body minimizers considered, the results presented include: the standard multivariate median (Med, only applicable to depth measures); its recursive version (Rec); the BRL estimate (the recursive locator followed by filtering and correction, as presented in Section “REFINE: improving the location by outliers filtering”); and finally the complete BRIL procedure (with the selection of the iteration with the highest count, see Section “ITERATE: iterative identification and removal of clusters”). These results are compared with those from the best clustering algorithms identified in the study from Brilhault et al., (2019), i.e., PAM (partitional clustering), MClust (model-based), and DBSCAN (density-based), as well as the average position (Mean) and coordinate-wise median (Med-CW ) as baselines. The data included all simulations with k = 2 to 5 clusters, 0 to 25% uniform noise, and an inliers ratio (within the clustered samples) set between 1/k and 0.75. A total of 15900 different distributions were tested.

Table 3 Performance on all simulations with 2 to 5 clusters, 0 to 25% uniform noise, and an inliers ratio, among the clustered samples, set between 1/k and 0.75

Consistently with the results presented in the previous sections, we see that for every depth function, the Rec estimate does increase the percentage of hits over the standard MM, and also reduces the distance to the true center when the cluster was encountered (e.g., reaching 0.329 for Rec-Oja, 0.337 for Rec-Projection, or 0.332 for Rec-L2). In turn, the refined locator, while not affecting the hit rates significantly, offers a much more accurate estimate in case of hits (with average errors as low as 0.097 for BRL-Oja, or 0.096 for BRL-Projection and BRL-L2). Finally, The BRIL version shows the most reliable estimates, with hit rates ranging from 95.56% to 98.93%, and average errors, when the main cluster was found, from 0.100 to 0.101, as accurately as Model-Based Clustering (0.103), and better than PAM (0.164) and DBSCAN (0,475). Figures 22 and 23 illustrate these remarkable performances for different numbers of clusters, as a function of inliers ratio, and show that unlike conventional depth-median, our new method can offer high robustness even when the quantity of outliers exceeds 50%.

Fig. 22
figure 22

Overall results of our approach with multimodal distributions, as a function of inliers ratio. Dotted lines correspond to the standard MM median, dashed ones to the recursive estimate with correction (BRL), and finally the solid lines to the BRIL version (including both the Refine and Iterate steps). For clarity, the results are provided for only three depth measures (L2, Projection and Spatial). The four top-most figures present the mean error across all Monte-Carlo repetitions for different numbers of cluster, while the bottom-most ones show the percentages of estimates successfully falling within the main cluster. The final BRIL algorithm, regardless of the depth measure considered, reaches close to 100% hit rate as soon as the main cluster cardinality shows a number of samples slightly superior to the second biggest cluster cardinality

Fig. 23
figure 23

Average error and percentage of hits, similarly to Fig. 22, for the BRIL algorithm based on L2 and Projection depth, as well as on MVE convex body minimizer, compared with the three clustering algorithms (DBSCAN, PAN, and MClust, the coordinate-wise median (Med_CW ), and the mean

Noise influence

The tolerance to uniform noise of the three standard clustering methods as well as of our algorithm are presented in Fig. 24. The clustering algorithms results are consistent with the findings from Brilhault et al., (2019). While DBSCAN results significantly deteriorate with increasing contamination, model-based clustering and PAM successfully maintain an accurate estimate of the main mode. Other methods of clustering present much higher sensitivities to uniform noise, but were discarded from the present study. Interestingly, when the main cluster is encountered, the BRIL estimates show an error of localization as low as model-based clustering, a technique specifically designed for the type of synthetic distributions tested in our simulations. We see from Fig. 24 that our algorithm, no matter the depth measure or the convex body minimizer used, can withstand very high high quantity of noise, most of them showing a successful identification in more than 90% of the simulation even when the uniform noise reaches 40% (corresponding to an overall percentage of inliers of only 24%, since the inliers ratio among clustered samples was also 40%).

Fig. 24
figure 24

Effect of the quantity of uniform noise (from 0 to 250 samples, corresponding to 0 to 50% of the total mixture) on the global hit rate (percentage of simulations where the main cluster was found) and average error (when the cluster was encountered). Distributions were drawn with 3 cluster, 500 samples, and a proportion of inliers equal to 40% (with 30% in each of the two outliers clusters). Note that the proportion of inliers being relative to the clustered samples only, 40% inliers under a 50% noise setting actually means that inliers represent only 20% of the global mixture

Effect of sample size

In Fig. 25 we report the results of the different algorithms for sample sizes ranging from 250 to 2000. The simulations were run with mixtures composed of 3 clusters, including 30% of the samples in the main cluster, 22.5% in each of the two others, and 25% uniform noise. Sample size does not affect significantly the results of the BRIL algorithm in hit rates for all convex body minimizers and depth measures tested. In terms of precision of localization (when the main cluster was encountered), we can notice a slight improvement with higher samples sizes, yet of low magnitude (the maximum difference observed from 250 samples to 2000 was inferior to 0.1σ).

Fig. 25
figure 25

Effect of sample size on the average error and hit rate. Distributions were drawn with 25% uniform noise, 3 clusters, and 40% of the clustered samples belonging to the main cluster (i.e. 30% of the global mixture). Sample size ranged from 250 to 2000 samples

PAM and MClust clustering algorithms provide high invariance to sample size as well. DBSCAN, on the other hand, suffered from a mild deterioration of its performances as the sample size decreases (with an hit rate going from 79% for 250 samples, to 92% for 2000, and a mean error,in case of hits, drop** from 0.494 to 0.221). This sample size effect with DBSCAN could be explained, as noted in Brilhault et al., (2019), by: (i) the parameter related to the minimum number of points that, in our experiment, was constant, regardless of the sample size; and (ii) the decrease of the relative difference of densities between the clusters and the uniform noise regions, which were randomly drawn in a spatial area of fixed size.

Execution time

The average execution times measured on our synthetic dataset are presented in Fig. 26. Our tests were conducted on a personal workstation using the R environment. These measurements are only indicative, as they obviously depends on the computer’s hardware, as well as on each algorithm implementation, optimization, sets of parameters and on the distributions considered (some of these techniques having a higher computing cost when processing a larger number of clusters for instance, despite the same total sample size). Yet, these processing times still provide insights on the potential application of our algorithm in a real environment, and how it positions itself in regard to other methods. One can observe that the differences between the sample median and the recursive version are relatively small. The recursive method, although improving significantly the precision and percentage of hits, have a negligible impact on computing times for the sample sizes tested. The refined locator does increase consequently these processing times, but remain acceptable (under one second for 1000 samples with our experimental setup), and the final iterative version, identifying the other modes/clusters of the distribution, is in average two to three times more costly than the refined estimate. Results obtained with convex body minimizers present similar time ranges as those from depth measures, with the exception of Oja-based estimates, which show significantly longer processing times. Finally, looking at clustering techniques, DBSCAN is by far the fastest to compute, but shows poor performances in contaminated distributions. PAM and MClust algorithms, showing better precision of localization than DBSCAN in our synthetic dataset, have in turn much longer processing times. In the testing environment considered, and disregarding Oja depth, our method is generally slightly faster than MClust, significantly quicker than PAM, but much longer than DBSCAN.

Fig. 26
figure 26

Average computing times. In the first row we present the cumulative impact in processing time of the different steps of our algorithm, namely the recursive estimate (REC), the refined version (BRL) and the full iterative procedure (BRIL). These results are provided for 4 illustrative methods (Tukey depth, Spatial depth, Projection depth and MVE). In the second row, we report the average processing time for all BRIL methods, along with the clustering algorithms. The mixtures tested included from 3 to 5 clusters and 25% uniform noise. 500 repetitions were ran for each sample size, which ranged from 250 to 2000

Real-world application

As mentioned in the introduction, eye-tracking research with primates, or non-cooperative subjects such as babies or young children, can be quite challenging. In order to compute the user’s gaze accurately, and in a meaningful referential, a calibration procedure is generally required to estimate individual parameters and the map** function to the screen coordinates. To compute this projection, a set of reference points is necessary. When working with cooperative subjects, this data is generally acquired through a calibration routine where several calibration targets are presented successively on the screen while the user has to manually confirm his fixations. However, with non-cooperative subjects, the main issue resides in the fact that there is no direct way to assess the intervals during which they were actually looking at the calibration points. Therefore, among the full set of gaze coordinates collected during the presentation of each target, only a part of these samples will present valid positions, the rest corresponding to times when the subject was attending other parts of the screen. These situations will typically lead to multimodal distributions, composed of several clusters, corresponding to the different screen locations that the subject fixated, and scattered samples which result from ocular saccades (when the eye is moving from a position to another), eye blinks, or artefacts from the recording equipment.

Instead of the manual confirmation of a fixation to the target by the user, some eye-trackers also provide an automatic setting, where the system terminates each trial once it detects a fixation with a duration superior to a given threshold (e.g. 400ms), independently of its position (since the data is still uncalibrated, the gaze location can not be estimated in the screen referential at that point). To emulate this behaviour, and serve as baseline, we applied to our real-world datasets a series of additional techniques based on fixations and saccades detection. Until now, all the methods presented (our approach, the clustering algorithms, the coordinate-wise median, and the mean) ignored the temporal nature of eye-tracking signals, relying solely on the spatial coordinates of samples. Taking into account their temporal dimension, it is possible to segment each trial into a succession of fixation and saccade events. Many algorithms have been proposed in this intent. The most common ones rely on velocity or acceleration thresholds, others on dispersion, or on more complex techniques as clustering, Markov models, or Minimal Spanning Tree (Zemblys, Niehorster, Komogortsev, & Holmqvist, 2018; Salvucci & Goldberg, 2000; Komogortsev & Karpov, 2013; König & Buffalo, 2014). Discussing the characteristics, advantages, and weaknesses of each of these methods is beyond the scope of this work, however, an important point to note is that, to this day, none of these algorithms is considered universally superior to the others (Andersson, Larsson, Holmqvist, Stridh, & Nyström, 2017). The choice of a given technique and the reliability of its results will generally depends on the specificities of each dataset, as, for instance on: 1) the level of contamination (e.g., the frequency of failure in the pupil tracking); 2) the overall accuracy, precision and speed of the system; 3) the type of users (humans or animals, cooperative or not, ...); 4) the experimental protocol (free-viewing, saccadic choice paradigm, reading, smooth pursuits, etc). More over, most of these techniques require a set of parameters which are themselves dependent of the dataset.

For these reasons, relying on fixations detection for identifying reference coordinates in eye-tracking calibrations can raise multiple issues. Choices made regarding the detection algorithm and its parametrization might work in a given dataset, yet not transpose to another. Requiring the user to inspect the results obtained in each experiment, in order to tune the parameters, or even switch to a different algorithm, would defeat the purpose of having a generic and unsupervised method to process eye-tracking calibrations, which is one of the aim of our work. In summary, the fixations detection process might be unreliable, and likely to add another source of errors. Nonetheless, it seemed relevant to include these methods as baselines, to quantify how well a standard automatic calibration would fare with calibrations made by uncooperative subjects. We therefore implemented three fixation-based procedures, by: 1) averaging the positions of the samples part of the first saccade within each trial, and averaging these positions for each target (method labeled as “AvgFirstFixations”; 2) averaging the longest fixations in each trial sharing the same target (“AvgLongestFixations”); 3) taking the longest fixation across all the trials with the same given target (“LongestFixation”). For more details, the source code of these procedures as well as a technical report are available on the repository https://github.com/adrienbrilhault/BRIL.Footnote 8

We show in Fig. 27 our three experimental datasets, consisting of eye-tracking calibrations performed by Capuchin monkeys, with increasing levels of contamination. The estimates for each of the calibration targets computed with the BRIL algorithm based on spatial depth are indicated by bold circles in Fig. 27, along with the reference positions considered as ground-truth, represented by black cross-hairs, which were manually annotated.

Fig. 27
figure 27

Eye-tracking recordings of three calibration procedures with Capuchin monkeys. Five targets were successively presented on the screen at coordinates (0,0), (100,-100), (-100,-100), and (100,100), for 1200 to 3000ms, with multiple repetitions of each target in a randomized manner. The gaze coordinates collected by the eye-tracker in these intervals are displayed in different colors, based on the target displayed at the time. Black crosses correspond to the ground-truth coordinates for each target, while bold circles indicate the reference positions computed by the BRIL algorithm using projection depth. Our method is able to accurately locate the main fixation in each dataset, even when it reaches very high levels of noise as in the third example

The complete results, for each of the methods, including the clustering algorithms and fixations-based methods, are provided in Table 4. Each procedure included five calibration targets, presented for 1200 to 3000ms, with multiple repetitions of each target in a randomized manner. The gaze coordinates from all the repetitions of a specific target in a given calibration were pooled, and filtered (see Table 1). Out of the remaining samples, 20 random draws of 1000 data points were performed to compute the average, deviation, and sum of squared errors of each algorithm for every target,Footnote 9 which were then grouped by session. Errors are given in screen pixels, which based on the monitor size, resolution and distance to the viewer, can be converted in visual field angles, 1 degree corresponding to 25 pixels in our experimental setup.

Table 4 Errors measured on three experimental datasets of increasing contamination

While the mean position, commonly used with calibrations realized by cooperating subjects, performs poorly in all situations, we see that the coordinate-wise median, related to the procedure from Model and Eizenman (2012), is able to provide reliable results under low contamination (in the first dataset, for instance), but is not suitable as the quantity of outliers increases (showing an error of 25.422px in the third dataset). In agreement with the conclusions from Brilhault et al., (2019), we can observe that model-based clustering very high performances on synthetic datasets do not translate to real-world data. In the overall results across the three datasets, MClust shows the largest errors among all the methods tested, even higher than the mean. PAM clustering, which showed the best results in our previous study, appear again as the best candidate, our new method aside. With an average error of 1.488px (0.059 visual degrees), it offer a robust estimate of calibration parameters. Nonetheless, the estimates from our BRIL algorithm achieve even better results, with average errors below 1.0 for seven of them (less than 0.040 visual degrees). In real world datasets with high contamination, our method therefore appears as the best choice for estimating the reference calibration coordinates. In regard to fixation-based methods, our analysis showed that in favorable conditions - e.g., experiments with cooperative subjects, or datasets with small contamination - the extraction of the longest fixations would be a viable option to estimate calibration parameters. However, uncooperative subjects behavior is too unpredictable for fixations-based approaches to be sufficiently robust, even assuming a flawless fixation detection method. This was clear in the errors measured. On the less contaminated dataset, the method based on the longest fixation across all trials of a given target provided an average error of 5.258px (0.2064 visual degrees), while the average of longest fixations gave 29.181 (1.1457), and the average of first fixations 27.058 (1.0624). Across all datasets on the other hand, including the more contaminated ones, the longest fixation across all trials of a given target showed a mean error of 25.990px (1.0204), 33.714px (1.3237) for the average of longest fixations, and 27.977 (1.0984) for the average of first fixations, that is, errors more than 25 times superior to those from the BRIL estimates.

Conclusion

In this study, we presented an analysis of the use of depth measures in identifying main modes of multivariate distributions. We characterized their robustness and behavior with respect to the sample size, number of clusters, percentage of inliers, and uniform noise. We showed that standard depth medians can provide reliable estimates when contamination levels are relatively low. However, in multimodal distributions with proportions of outliers beyond 50% (the breaking point of most depth functions analyzed), these locators are not suited to estimate the center of the main component. Our method, on the other hand, performs well in such critical conditions, withstanding up to \(100\times \frac {(K-1)}{K}\) percent of clustered contamination (with K being the total number of clusters). The overall proportion of outliers tolerated by our algorithm can be even higher if additional contamination in a non-clustered form, such as uniform noise, is considered. Our method consists of the following steps: i) a recursive estimate based on depth measures or convex body minimizers; ii) outlier filtering; and iii) an iterative identification of clusters. The recursive application of depth functions, or convex-body minimizers, aims at converging toward a local density peak, more reliably than the traditional multivariate medians. The purpose of the next filtering steps is to refine this location by discarding as many outliers as possible to guarantee high robustness and precision. We first eliminate the more distant outliers based on their Euclidean distance and a test of unimodality. The remaining samples are, then, filtered using robust distances and a test of multivariate normality. Besides improving the accuracy of the estimate, our filtering procedure allows to identify the samples belonging to a particular component of the multimodal distribution. By removing the cluster found from the overall distribution, and applying the same process recursively, we are able to extract the different components of the mixture. We can finally select the cluster with the highest cardinality as the main mode, thus, avoiding errors that commonly occur when the first depth estimate identified a cluster of outliers.

Our method proved to be successful when applied to a real-world scenario, the calibration of an eye-tracking system in the context of visual experiments with monkeys. Despite the datasets being highly contaminated, we still obtained precise estimates for the reference coordinates, with errors below 0.04 visual degrees. By using this new central tendency measure, the user or experimenter are not required to confirm the correct fixations manually. As long as the cumulative durations of the fixations to the actual targets exceeds the time spent on a given irrelevant position, our estimate will reliably identify the calibration parameters.Footnote 10 For these reasons, our method shows great benefits in eye-tracking studies with inattentive or poorly cooperative subjects, such as infants and animals, where data typically suffer from high contamination. It would also present interest in the case of experimental paradigms which do not include an explicit calibration routine. Beyond its use in eye-tracking calibrations, our multivariate mode estimate can find application in various other scenarios requiring a robust measure of central tendency. Moreover, because our algorithm also provides a partition of the data, it may also contribute to the development of new powerful clustering techniques. Compared to current clustering methods, our approach presents several advantages. First, it does not require an explicit number of clusters, and therefore avoids the costly iteration over a range of potential k values in order to select the optimal number (maximizing a criteria as for instance the average silhouette or gap statistics). Our method is also computationally fast, and offers a very high tolerance to noise. Many robust clustering techniques, such as TClust (Fritz et al., 2012), require an alpha parameter, which can strongly affect the final partitions and require prior knowledge on the amount of noise, or a careful visual analysis of the data. Our method, on the other, is non-parametric, and will handle any quantity of noise seamlessly. The use and performances of the BRIL algorithm as a clustering technique, named BRIC (Bootstrap and Refine Iterative Clustering), will be investigated in subsequent studies.