Background

Mosquito-borne diseases present a significant risk to human health, with nearly 700 million cases and 750,000 deaths reported globally each year [1]. To combat these diseases, it is crucial to understand the behaviour of mosquitoes. Tracking mosquitoes produces trajectories that can return valuable insights into their flight behaviour and has already led to significant advances in disease prevention. For instance, early studies on mosquito trajectories led to the development of an improved insecticide-treated net (ITN) design that provides better protection against disease transmission [2]. Further research on mosquito behaviour is likely to lead to other such improvements.

Previously, many tracking studies involved manual processing to capture behaviours, with a number of examples concerning mosquitoes [3,4,5]. However, advancements in high-resolution cameras, computational power and computer vision technology have enabled automated tracking of behaviour [20] used 2D imaging, tracking and feature extraction with supervised learning models to differentiate sandflies from other insects with accuracies of circa 88% for support vector machine and artificial neural network models on an optimised feature set.

In this article, we explore the relative merits of 2D and 3D mosquito tracking when classifying and interpreting behaviours via machine learning. We present a comparative analysis among 3D trajectories, 2D telecentric (removing one orthogonal component) and 2D single-camera data with perspective distortion, all derived from the same dataset, to assess the advantages and limitations of these tracking approaches. Analogous features are determined for each of these datasets, and the accuracy of the machine learning classifier provides a useful quantitative metric to assess the outcomes and XAI enables interpretation of behaviours. We hypothesise that 3D tracking and 2D telecentric tracking will return similar results, despite the loss of the additional information in the third dimension. We further hypothesise that a single-camera tracking system will return lower performance due to perspective effects and lens distortion. A deeper understanding of the strengths and weaknesses of 2D and 3D mosquito tracking will enable researchers to make informed decisions regarding experiment design. Overall, our research endeavours to advance the field of mosquito tracking and behaviour analysis via XAI, ultimately aiding in the development of more efficient and targeted mosquito control measures, leading to significant public health benefits.

Methods

A machine learning classifier has been established to classify male to non-male mosquitoes using 3D trajectories from mating swarms [18]. From this 3D dataset, corresponding 2D telecentric and 2D angular field of view information is derived to simulate the data obtained from these tracking systems. The sections below detail how the single-camera 2D telecentric and 2D angular field-of-view trajectories are determined and the corresponding features derived for the 2D data.

Dataset description

The trajectories of the mosquitoes utilised in this investigation were produced by Butail et al. and were provided as 3D tracks following the processing steps outlined in [8]. The data were collected in Doneguebogou, Mali, for the years 2009–2011, during which wild Anopheles gambiae mosquito swarms were observed.

The dataset contained 191 male mosquito tracks over 12 experiments as well as 743 mating couple tracks (where male and female mosquitoes mate in flight and are tracked together) over 10 experiments (Table 1). The male mosquito tracks were captured in swarms where no females were present, whereas couple tracks were generated from swarms that contained mating events. Prior to analysis, tracks were filtered based on duration, excluding those < 3 s. This decreased the size of the dataset but effectively eliminated tracks with low information content.

Table 1 Numbers of experiments and tracks for each class of mosquito

The experiments used to track the mosquitoes utilised a stereo-camera set up using phase-locked Hitachi KP-F120CL cameras at 25 frames per second. Each camera captured 10-bit images with a resolution of 1392 × 1040 pixels. On-site calibration of the cameras was performed using a checkerboard and the MATLAB Calibration Toolbox [21]. The relative orientation and position of the cameras were established through extrinsic calibration, which involved capturing images of a stationary checkerboard in multiple orientations and positions. The camera's height, azimuth, and inclination were recorded to establish a reference frame fixed to the ground.

Two-dimensional projection of 3D trajectory data

To conduct a comparative analysis between 3D and 2D trajectories, two methods were employed to convert the 3D dataset to a 2D one.

  1. 1.

    The first involves the omission of depth information, resulting in the plane of view parallel to the camera (YZ for this dataset). This method emulates a well-calibrated 2D setup that uses telecentric imaging [19], i.e. the separation of the two lenses on the imaging side generates the telecentric condition and any lens distortion effects have been removed by appropriate calibration.

  2. 2.

    The second transformation method utilises a single lens camera model placed a distance away from the swarm to project the trajectories onto a 2D plane (the camera detector plane), simulating the transformation that occurs through a single-camera setup including perspective and lens distortion.

To perform the second transformation, the camera was modelled using OpenCV [22] requiring focal length, principal points, distortion coefficients, and the camera location and rotation. The 3D trajectories were projected onto the image plane using a perspective transformation, utilising the projectPoints function (https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html), represented by the distortion-free projection equation (Eq. 1):

$$\begin{array}{c}sp=A\left[R|t\right]{P}_{w}\end{array}$$
(1)

where \({P}_{w}\) is a four-element column vector in 3D homogeneous coordinates representing a point in the world coordinate system, \(p={\left[\begin{array}{ccc}u& v& 1\end{array}\right]}^{T}\) is a three-element column vector in 2D homogeneous coordinates defining the corresponding position \(\left(u,v\right)\) of a pixel in the image plane, \(R\) and \(t\) refer to the rotation and translation transformations between the world and camera coordinate systems, \(s\) is a scaling factor independent of the camera model, and \(A\) is the camera intrinsic matrix given by (Eq. 2).

$$\begin{array}{c}A=\left[\begin{array}{lll}{f}_{x}& 0& {c}_{x}\\ 0& {f}_{y}& {c}_{y}\\ 0& 0& 1\end{array}\right]\end{array}$$
(2)

with \({f}_{x}\) and \({f}_{y}\) the focal lengths expressed in pixel units, and \({c}_{x}\) and \({c}_{y}\) are the principal points on the detector in pixel units. Under these definitions the coordinates of the imaged point on the camera \(\left(u,v\right)\) are in pixels. Radial, tangential, and prism distortions are included by modifying the 3D point in camera coordinates, given by \(\left[R|t\right]{P}_{w}\) [22]. The camera intrinsic matrix values and distortion coefficients were based on the specifications provided by one of the camera models employed during the dataset generation process. These include the focal lengths (\({f}_{x}=1993.208\) and \({f}_{y}=1986.203\)), principal points (\({c}_{x}=705.234\) and \({c}_{y}=515.751\)) and distortion coefficients (\({k}_{1}=-0.088547\), \({k}_{2}=0.292341\), and \({p}_{1}={p}_{2}=0\)) [8]. To ensure accurate representation of the swarm, the translation vector was adjusted such that the optical axis aligns with the centre of a cuboid enclosing the swarm, while the camera model was positioned at a predetermined distance from the swarm centre. As detailed by Butail et al. [8], the camera was positioned between 1.5 m and 2.5 m away from the swarm. Therefore, in our simulated experiment, the camera model was positioned at 2 m from the swarm centre. Simulations were conducted with and without the lens distortion terms which showed that the vast majority (> 98%) of the distortion observed in the image was due to perspective at this range (for the camera intrinsic matrix values given above and a cuboid object extending 1 m in each axis). For the single-lens 2D camera model, the coordinates of the image points in pixels (from Eq. 1, corresponding to the 3D trajectory coordinates) were used directly for feature calculation and classification.

To investigate the impact of different distances between the camera and the swarm on classifier performance from a single-lens 2D measurement, adjustments in the focal length of the camera model were accounted for such that the swarm occupied the same extent in the image. The thin lens equation (Eq. 3) was used to approximate the distance from the lens to the image plane as the object distance is varied. This equation relates the focal length, \(f\), to the distance of the object to the camera lens, \(u\), and the distance of the camera lens to the image plane, \(v\). Subsequently, by applying the magnification equation (Eq. 4), the magnification factor, \(M\) , was determined [23]. Based on the new distance between the object and the lens, the corresponding focal length was calculated and utilised in the camera intrinsic matrix (Eq. 2).

$$\begin{array}{c}\frac{1}{f}=\frac{1}{u}+\frac{1}{v}\end{array}$$
(3)
$$\begin{array}{c}M=\frac{v}{u}\end{array}$$
(4)

Thereby, datasets for 3D, 2D telecentric, and 2D single camera at varying object distances were derived; an example is provided (Fig. 4).

Fig. 4
figure 4

Plots displaying the effect of the transformation methods on a single mosquito trajectory. a Original 3D track. b Two transformation methods applied to the trajectory: the 2D telecentric transformation with depth information ignored (blue) and the 2D camera model developed in OpenCV at 2 m (orange) and 15 m (green), respectively, whilst utilising the distortion coefficients from Butail et al. [8]

Machine learning framework

This study employs an anomaly detection framework, as detailed in [18], to classify male and non-male mosquito tracks. Track durations are unified by splitting them into segments of equal duration, and flight features are extracted per segment. In [18], tracks shorter than double the segment length were removed. However, this restriction is removed as filtering was unified to remove tracks < 3 s in duration. This unifies the datasets and makes downstream comparison like-for-like. Features are selected using the Mann-Whitney U test and highly correlated features are removed. Classification is performed using a one-class support vector machine (SVM) model trained on a subset of the male class. The model forms predictions on track segments, and then a voting method is employed to return the final class prediction of whole tracks (Fig. 5).

Fig. 5
figure 5

Diagram outlining the machine learning pipeline used to classify male and non-male mosquito tracks

The 3D trajectory feature set is detailed in [18]. For 2D trajectory data, an equivalent feature set was employed resulting in 136 features of flight, with most feature calculations remaining consistent, albeit with the exclusion of the third axis. For instance, straightness (also referred to as tortuosity) is computed as the ratio between the actual distance travelled and the shortest path between the start and end positions. For 3D trajectories, this was calculated as:

$$\begin{array}{c}S=\frac{{\sum }_{i=0}^{N}\sqrt{{\left({x}_{i+1}-{x}_{i}\right)}^{2}+{\left({y}_{i+1}-{y}_{i}\right)}^{2}+{\left({z}_{i+1}-{z}_{i}\right)}^{2}}}{\sqrt{{\left({x}_{N}-{x}_{0}\right)}^{2}+{\left({y}_{N}-{y}_{0}\right)}^{2}+{\left({z}_{N}-{z}_{0}\right)}^{2}}}\end{array}$$
(5)

However, in the 2D trajectory case, this was now calculated as:

$$\begin{array}{c}S=\frac{{\sum }_{i=0}^{N}\sqrt{{\left({x}_{i+1}-{x}_{i}\right)}^{2}+{\left({y}_{i+1}-{y}_{i}\right)}^{2}}}{\sqrt{{\left({x}_{N}-{x}_{0}\right)}^{2}+{\left({y}_{N}-{y}_{0}\right)}^{2}}}\end{array}$$
(6)

The calculations for the remaining features were originally devised for 2D trajectories. However, in the context of 3D trajectories, projections onto the X–Y, Y–Z, and X–Z planes were computed, resulting in the derivation of a single value. An example of this is the calculation of curvature, which requires a single plane:

$$\begin{array}{c}{k}_{i}=\frac{\dot{{x}_{i}}\ddot{{y}_{i}}-\dot{{y}_{i}}\ddot{{x}_{i}}}{{\left({\dot{{x}_{i}}}^{2}+{\dot{{y}_{i}}}^{2}\right)}^\frac{3}{2}}\end{array}$$
(7)

The study employed K-fold cross-validation. Two male trials were reserved for testing, while the remaining trials were used in training. All remaining classes (couples, females, and focal males) were used in testing. In the K-fold cross-validation process, different combinations of male trials were systematically rotated into the training set in each iteration, which is referred to as a ‘fold’. Performance metrics such as balanced accuracy, ROC AUC (area under the receiver operator curve), precision, recall, and F1 score were calculated with males and non-males considered as the positive class for metric computation.

The framework had various parameters that can be tuned including the machine learning model hyperparameters and the window size used to split tracks into segments. These were tuned together in a cross-validated grid search attempting to maximise balanced accuracy. An independent tuning set containing three male trials and two couple trials, distinct from the dataset used to report the classification performance and named the modelling set, was used to obtain the best parameters. The grid search utilised in this study encompassed a more refined range of values with a smaller step size compared to [18], which is detailed in the supplementary material. The hyperparameter, ν, described as “an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors”, was set to 0.2. This value was chosen to make strong regularisation of the model to allow large errors on the male class (the only class that is seen during training) to reduce overfitting.

Evaluation of transformed data

Various methods were used to assess the different datasets. The machine learning pipeline provides quantitative metrics for evaluating performance on the 3D and 2D trajectory feature sets. Analysing feature correlations between 3D/2D datasets can reveal insights into the preservation of flight features within 2D trajectories. Correlations were computed by calculating the average absolute Pearson’s correlation coefficient across features between two datasets. Even though each dataset has specific window parameters that are identified during hyperparameter tuning, a fixed segment size and overlap were used to determine the correlation matrix to generate paired samples.

An alternative technique for analysing and comparing features is to visualise them through an embedding. Here, an embedding is a lower dimensional space that condenses the information content from a higher dimensional space. Uniform manifold approximation and projection (UMAP) [24] creates a visualisation that shows how the 2D/3D datasets cluster within the embedded feature space. Notably, UMAP is a dimensionality reduction technique that preserves the local relationships and global structure of the data, making it particularly suitable for this purpose.

Most importantly, it is necessary to deduce whether the machine learning models are utilising features correctly and behavioural insights gathered are consistent with those from 3D trajectories. By using SHapley Additive exPlanations (SHAP) values [25], it was possible to visualise and explain how the model made its predictions.

From [18], classification of male and non-male trajectories based on 3D trajectory features was demonstrated, alongside XAI to interpret the machine learning model. The SHAP plots have increased noise due to using field data and may exhibit a slight skew in the colour scale. To ensure robust interpretations, SHAP scatter plots were also used to visualise the SHAP value distribution as a function of feature value.

Results

The 3D dataset was transformed into 2D telecentric and single-camera datasets at various distances from the swarm. Evaluating the machine learning framework’s performance at these distances (Fig. 6), the single-camera model closely matches the telecentric dataset as the camera moves farther from the object. For each distance, the tuned pipeline returns differing segment sizes and overlaps, which are also displayed.

Fig. 6
figure 6

Classification performance of the 2D single-camera model as the distance varies. Solid lines: 2D single-camera model; dashed lines: data from 2D telecentric model. a Balanced accuracy as distance varies. b ROC AUC score as distance varies. Both graphs also display the optimised segment size and overlap from hyperparameter tuning at each distance

Comprehensive results for the performance when using tuned pipeline parameters of the 3D dataset, 2D telecentric dataset, and 2D single-camera datasets with the camera placed at 2 m and 15 m are provided (Table 2). Across all datasets, the best performance obtained was from the 3D tracks with a balanced accuracy and ROC AUC score of 0.656 and 0.701. This performance may seem low, but the classifier is attempting to distinguish small differences in features of flight over segments of a few seconds and the data were captured in the field under various conditions; hence, such performance in this application is notable. Generally, the single-camera model performs worse than the telecentric and 3D methods in both cases. At 2 m, the single-camera model fares 6.8% and 6.9% worse in balanced accuracy and ROC AUC compared to the 3D dataset, primarily because of perspective distortion. The telecentric and 3D methods exhibit similar performance with absolute percentage differences of 2.1% and 1.0% in balanced accuracy and ROC AUC, respectively. This indicates preserved tracking accuracy with a 2D telecentric dataset, i.e. two orthogonal displacement components quantified, despite a loss of depth information. Similarly, when the single-camera model is placed farther away, its performance closely mirrors that of the 2D telecentric dataset with absolute percentage differences of 0.7% and 1.3% for balanced accuracy and ROC AUC respectively. Note that the performance metrics for the female and focal male classes are not conclusive as these results are based on a limited number of tracks.

Table 2 Performance metrics of each 3D/2D dataset when passed into the machine learning pipeline with the 95% confidence interval provided in brackets

A closer analysis of individual fold performance across the datasets revealed additional understanding. As reported in [18], poorly performing folds were those that were tested on abnormal trials where mosquito type was different (Mopti form instead of Savannah form) and swarm location differed (over bundles of wood rather than bare ground). These conditions could alter mosquito trajectory features, potentially causing them to fall outside the decision boundary of the single-class model. Conversely, folds including abnormal trials in training consistently performed best. This indicates potential overfitting to the variability within their features, leading to accurate classifications for male mosquitoes but reduced accuracy for non-male. This trend held for both the 2D telecentric and single-camera models at 15 m. However, the 2 m single-camera model displays the opposite behaviour, with the best performance on folds containing abnormal trials in testing. This implies that the perspective distortion introduced by the camera at this distance is affecting the feature values and their variability, resulting in unexpected performance variations across different trials.

The performance of these models can be visualised through confusion matrices (Fig. 7) and receiver-operator characteristic (ROC) curves (Fig. 8). The confusion matrices display the predictions of all folds with the percentage of predictions labelled in each section of the matrix. The ROC curves depict the performance of a binary classifier by plotting the trade-off between true- and false-positive rate.

Fig. 7
figure 7

Confusion matrices of each dataset: a original 3D dataset, b 2D telecentric dataset, c 2D single-camera model at 2 m, and d 2D single-camera model at 15 m

Fig. 8
figure 8

Receiver-operator characteristic (ROC) curves of each dataset. The dark blue line displays the average ROC curve across all folds, the light blue lines show the ROC curve at each fold and the grey shadow depicts the standard deviation. Within the figure, (a) displays the original 3D dataset, (b) the 2D telecentric dataset, (c) the 2D single-camera model at 2 m, and (d) 2D single-camera model at 15 m

Analysing the correlation between features from different datasets can reveal insights into the preservation of flight features in 2D trajectories. Datasets were generated at various distances using the same segment size and overlap, such that correlation can be computed between paired samples. To compute the correlation, each of the datasets was pairwise correlated to produce the matrix (Fig. 9). Overall, utilising a 2D telecentric setup preserves more features compared to the 3D dataset, with an average correlation of 0.83. Shape descriptors show the lowest correlation because of depth loss, which is expected. Conversely, a single-camera setup compromises tracking accuracy, resulting in lower feature correlation compared to a 3D stereoscopic system. The average correlation between the 2D single camera at 2 m with the 3D dataset and the 2D telecentric system is 0.72 and 0.87, respectively. However, positioning the single camera at 9 m significantly improves correlation. Average correlation values increase to 0.80 and 0.96 compared to 3D and 2D telecentric datasets, respectively. These results are expected as increasing camera-swarm distance reduces the perspective distortion effect, thereby resembling telecentric setup data and enhances feature preservation.

Fig. 9
figure 9

Pairwise correlation matrix between each dataset. The Pearson correlation between the same features for each pair of datasets is computed, with the average of the correlations taken to return a final value for the dataset pairs

The UMAP representation (Fig. 10) provides a clear visualisation of the disparities between the datasets. The SHAP plots of the best performing folds for each model were generated and are provided in the supplementary material. This includes SHAP summary plots for the best performing folds for the 3D, 2D telecentric, and 2D single-camera model at 2 m and 2D single camera model at 15 m datasets, respectively (Additional file 1: Figs. S1-S4). The supplementary material also includes SHAP summary plots where only the common features across each model are selected and sorted alphabetically (Additional file 1: Figs. S5–S8). SHAP scatter plots for the third quartile of angle of flight feature are provided for each dataset (Fig. 11). This feature was chosen as an example to illustrate the impact that each camera system has on SHAP and feature values. In this figure, each point represents a segment, with its corresponding normalised feature value on the x-axis and its SHAP value on the y-axis. A histogram of the segment feature values is provided as a grey shadow.

Fig. 10
figure 10

UMAP representation of each of the datasets

Fig. 11
figure 11

SHAP scatter plots for the third quartile of angle of flight feature. Within the figure, (a) displays the original 3D dataset, (b) the 2D telecentric dataset, (c) the 2D single-camera model at 2 m, and (d) 2D single-camera model at 15 m

The feature selection process for each dataset selects slightly different types of features. Among the datasets, the numbers of selected features are as follows: 61 for the 3D dataset, 34 for the 2D telecentric dataset, 42 for the 2D single-camera dataset at 2 m, and 35 for the 2D single-camera dataset at 15 m. Notably, the 3D dataset contains more features as it includes some feature calculations projected in the X–Y, Y–Z,, and X–Z planes which are not present with 2D data. Despite these differences, a significant portion of features is shared between them. Specifically, 85% of the features are common between the 2D telecentric and 2D single-camera datasets at 2 m, while 97% of the features are common between the 2D telecentric and 2D single-camera datasets at 15 m. These observations further reaffirm that the 2D single-camera dataset at 15 m can effectively emulate a 2D telecentric system. It is important to note that across all datasets, only a few shape descriptors are selected, consistent with the findings from [18].

Discussion

This study compares 3D and 2D trajectory datasets simulating various imaging techniques. Performance metrics were obtained via a one-class machine learning classifier on field data of male and non-male mosquitoes in a mating swarm. Generally, the 3D and 2D telecentric datasets performed best, with the exception of some metrics from the 2D single-camera model at 15 m. Performance with a single camera at a great distance (with a suitable focal length lens) approached that of the telecentric dataset. However, at a typical distance for insect tracking of around 2 m, performance showed an average decrease of about 0.05 across all metrics on the test datasets.

Earlier, we hypothesised that 2D telecentric imaging data would perform similarly to stereoscopic 3D data despite the loss of one axis of information. We anticipated that a single-camera model would be less effective at short distances compared to larger distances, where trajectory data align more closely with telecentric imaging (with larger focal length imaging lenses). The machine learning classifier performance metrics confirm both hypotheses. The implication of the first hypothesis is that the necessary features to differentiate the behaviour of male compared to non-male mosquitoes are present in two orthogonal components of motion as well as in a complete three-dimensional measurement. This is different from what we normally consider to be the accuracy of a measurement. In terms of metrology, accuracy is the difference between a measurement and the true value. The speed of a mosquito requires all three velocity components for accurate determination. The findings demonstrate that features extracted from 2D orthogonal, i.e. independent axes, measurements can characterise behaviour comparably to 3D measurements (Table 2).

Single-camera 2D data are typically obtained without calibrating for geometric distortions introduced by the imaging lens. Distortion increases linearly with radial distance from the optical axis and is a power law with respect to numerical aperture [26] and the angular field of view increases as the camera is moved closer to the scene of interest. Hence, close range imaging yields higher distortion compared to distant imaging for the same field of view. Perspective effects at close range mean that the two components of position measured at a detector are also a function of object position along the optical axis and the magnitude increases with radial distance (as for lens distortion). Hence, it appears that classifier performance is impacted by perspective and distortion aberrations, particularly noticeable at closer distances. Conversely, positioning the camera further away reduces perspective distortion, leading to more reliable interpretations akin to 2D telecentric data. However, by tuning the parameters for the machine learning pipeline at each distance, the pipeline partially accommodates for the distortion effects introduced at smaller distances. The changing segment size for each distance, as determined by the tuning dataset, thus plays a strong role in the classification performance leading to some of the variations in balanced accuracy and ROC AUC as distance increases. The figure (Fig. 6) captures these variations and also depicts the tuned segment size and overlap at each distance. Differing segment sizes capture different scales of behaviour and would lead to variations in feature values and thus differences in classification performance. Intriguingly, this study found that achieving comparable performance between a single-camera 2D measurement and the corresponding 2D telecentric assessment occurs at a range of 9–10 m. The pipeline parameters after 9 m remain consistent and are equivalent to the telecentric system, displaying its effectiveness at emulating a telecentric camera system.

The correlation analysis highlights differences between the camera systems. The single-camera model at 15 m correlates strongly with 2D telecentric data (0.96), while 2D telecentric data correlate well with the 3D dataset (0.83). In the UMAP representation (Fig. 10), features from the 2D single camera at 2 m cluster towards the upper left corner, suggesting less reliable and inconsistent object tracking at close range. SHAP scatter plots for the angle of flight, third quartile, feature (Fig. 11) corresponding to the four different imaging setups demonstrate similarity among the 3D, telecentric, and single camera at 15 m, whereas the single camera at 2 m has increased noise and overlap between the classes across some feature values. This feature describes the upper quartile of the change in angle of flight distribution within a track segment, where high values indicate a large deviation. It can be argued that this feature for the 3D dataset shows the clearest separation between the male and non-male classes, while overlap occurs in other setups. In the single camera at 2 m SHAP scatter plot, the histogram displays a distorted distribution of normalised feature values compared to the other histograms, further illustrating the impact the distortion that camera systems at close distance bring. SHAP summary plots in the supplementary information confirm these trends indicating subtle differences in feature contributions and a slight skew towards male predictions with close-range single-camera models. This phenomenon can be attributed to the perspective distortion introduced in trajectories that are constructed by a single-camera model, resulting in highly variable features across all classes. Consequently, the distinct separation between classes diminishes for 2D imaging at close range.

The study primarily focused on the Y–Z view directly imaged on the camera detector, but the other two orthogonal views were assessed (Additional file 1: Figs. S12-S13). Notably, the machine learning model performance of these additional views were higher than that of the original view that has been discussed. Specifically, the overhead view X–Y, which captures the distinctive circular motions of swarming male mosquitoes and the more erratic behaviours of mating couples, likely contributed to its higher effectiveness. The X–Z plane, observing the swarm from the other side view, may perform better because of the increased uncertainty of X-positional data in combination with perspective, which may amplify the depth information (e.g. through increased variability in certain features). Mating couple tracks move less in the depth plane and thus lead to bias towards one of the classes. Both these views utilise the depth axis that, while derived, introduces significant noise, rendering these findings less reliable. During the generation of the dataset, the camera system is placed 1.5–2.5 m away from the swarm and the baseline is 20 cm [8], meaning the angle subtended by the cameras at the swarm in the stereoscopic setup varies between 4.6 and 7.6 degrees. According to [27] for a related stereoscopic imaging setup, with an angle of 5 degrees, the uncertainty in depth displacements is > 11 times the uncertainty parallel to the detector plane. With an angle of 7.5 degrees, this uncertainty is > 7 times the uncertainty parallel to the detector plane. As a result, the accuracy of the depth component (X) is 7–11 times worse than the other measurement components, and thus these results from the other views are unreliable.

Overall, the 3D dataset demonstrates superior performance, followed by the telecentric dataset. Both setups can be configured in a small experimental footprint compatible with experimental hut trials in sub-Saharan Africa. Stereo 3D setups require alignment of the two cameras on the same field of view and in situ calibration. Two-dimensional telecentric setups require large aperture optics typically achieved with plastic Fresnel lenses [19], the same size as the required field of view, and careful alignment of the separation between the camera and large aperture lenses. Single-camera 2D imaging is experimentally simpler and can be done with lower camera to object distance within the size of a typical experimental hut but then generate the distortions described above and lower performance in machine learning classification and hence difficulties in behaviour interpretation. Two-dimensional imaging at longer range becomes problematic for practical reasons, the image path would extend outside a typical dwelling, and it is difficult to prevent occlusion by people and animals during recordings that can take several hours. Also, with large focal length lenses, outdoor implementation in low light conditions can be particularly problematic as the optical efficiency reduces, an effect that has not been investigated here. It is also recognised that the calibration process for stereoscopic imaging naturally means that trajectories are obtained in physical distance units, e.g. mm; telecentric setups can also be relatively easily calibrated as position data parallel to the camera detector remains the same irrespective of an object’s position along the optical axis. The resulting machine learning models can therefore be applied to results of other, equally well calibrated experiments that attempt to elicit similar behaviours. Two-dimensional single-camera measurements are obtained in pixels from the detector—whilst known artefacts could be placed in the field of view for calibration, manual assessment of whether trajectories are in the appropriate depth plane would need to be made. Hence, the machine learning models from 2D single-camera measurements are less useful than the calibrated data from stereoscopic 3D or telecentric 2D setups.

There are certain limitations with this study that should be acknowledged. First, the datasets used for comparing the performance of different tracking systems were all simulated, except for the 3D dataset. The 3D data used for simulating the other tracking systems were gathered from mosquito swarms, where their movement revolves around a central point, resulting in generally symmetric trajectories (especially in both horizontal axes). As a result, these findings may not be applicable in studies that have unsymmetrical movements (e.g. mosquito flight around bednets [19]). The orientation of the 2D datasets is to primarily capture the vertical axis, with respect to the ground, and one horizontal axis. It is probably important for 2D datasets to include the effect of gravity and one other orthogonal axis. Were a trajectory to be along a linear axis not captured by a 2D imaging system, then clearly it would fail to provide useful information. However, the mating swarm data used here [8], data from field tests tracking mosquitoes around human baited insecticide treated nets [28] or in odour stimulated wind tunnels tests [29], mosquitoes do not exhibit straight line flight behaviours. The 3D data itself were gathered from wild mosquito swarms and as such the trajectories may already contain noise that may reduce performance across all tracking simulations. To further validate these findings, future trials of the various tracking systems should be tested by generating new experimental data from each system in diverse scenarios and then comparing their trajectories to determine whether the same behaviours and trends between the 3D and 2D datasets are observed.

Conclusions

Accurately tracking mosquitoes, or more generally insects, is a difficult task that requires care to be taken at many stages. This includes considering the experimental conditions, the video recording equipment, and the software used to identify insects from videos. Nonetheless, accurate tracking of mosquitoes could lead towards improved understanding of their behaviours that may influence disease transmission intervention mechanisms. The results of this study imply that 2D telecentric and 3D stereoscopic imaging should be the preferred imaging approaches to adequately capture mosquito behaviour for machine learning analysis. Both of these approaches are compatible with laboratory and field-based studies, but it should be recognised that 2D telecentric imaging is less complex and the data more straightforward to process. Single-camera 2D imaging over large, metre-scale field of view, although experimentally easier and needing less expensive equipment, should be avoided because of the distortion in the results and subsequent difficulty in interpretation. Nonetheless, if a single camera is placed at a considerable distance from the object of interest, achieving accurate interpretations of behaviour may be feasible. However, this demands expensive long focus lenses and a strong light source to effectively record trackable mosquitoes.