1 Introduction

Plankton, including phytoplankton, mixoplankton and zooplankton, is a fundamental component of aquatic ecosystems (Flynn et al. 2019; Glibert and Mitra 2022; Mitra et al. 2023). They form the basis of the food web and are essential for global biogeochemical cycles (Arrigo 2005; Hays et al. 2005). Plankton comprises a diverse array of life forms, which are associated with a variety of functions and possess strong interspecific associations (De Vargas et al. 2015). Aquatic ecosystems have been subjected to changes forced by climate and anthropogenic drivers, which already led to species loss affecting the provision of critical ecosystem services, such as water quality in some regions and productivity (Worm et al. 2006). In order to improve management practices of aquatic ecosystems, it is essential to understand functioning of planktonic communities, the distribution of different life forms and how those are affected by anthropogenic and climate changes (Rogers et al. 2022).

Phytoplankton blooms are observed when favorable conditions trigger algae growth and accumulation in the environment. Although blooms are part of natural productive cycles in aquatic ecosystems (e.g. the increased production during spring time in temperate systems), harmful blooms that are nuisance to recreational use and even hazardous also occur (Anderson et al. 2019; Zohdi and Abbaspour 2019). Due to their importance and potential adverse effect, understanding blooms is essential and efforts towards develo** effective observation networks and predictive models have been made (Zhou et al. 2023). Blooms have been traditionally monitored by analyzing fixed samples under the microscope, but despite the high taxonomical information provided by the method, the high costs and time required by the method limit the number of analyzed samples (Zingone et al. 2015). Remote sensing has been employed for increasing coverage of bloom observations, showing that algal blooms have been expanding and intensifying in many coastal areas due to environmental changes (Dai et al. 2023), even though the methods yield limited taxonomical information. Thus, methods that can quickly identify different species at high speed, such as imaging, can enhance our knowledge of the bloom-forming species dynamics for understanding plankton communities and to provide more accurate data for model validation information on a higher time scale is needed (Kraft et al. 2021).

Studying and monitoring plankton is hindered by their microscopic size, fast turnover rates and close interaction with the multiscale hydrodynamics (Benfield et al. 2007). Recent advances in plankton imaging systems have led to their popularization and integration into monitoring programs, collectively accumulating information on plankton systems and simultaneously gathering massive amounts of image data (Benfield et al. 2007; Cowen and Guigand 2008; Lombard et al. 2019; Olson and Sosik 2007; Picheral et al. 2010). The major constraint to the use of these datasets lies in the expert annotation of plankton images, which is expensive, time-consuming, and error-prone. To fully benefit from the technological development and to properly explore the gathered information, there is a clear need for automated analysis methods. During recent years, significant research effort has been put into exploring and develo** automated methods for performing plankton recognition based on computer vision techniques and machine learning methods (e.g. Lumini and Nanni 2019a; Orenstein and Beijbom 2017).

The research on automatic plankton image recognition has matured from early works based on hand-engineered image features combined with traditional classifiers such as support vector machine (SVM) (Cortes and Vapnik 1995) and random decision forest (RDF) (Ho 1995) (see e.g. Tang et al. 1998; Sosik and Olson 2007) to feature learning-based approaches utilizing deep learning and especially convolutional neural networks (CNNs) (Lee et al. 2016; Orenstein and Beijbom 2017; Lumini and Nanni 2019a; Kloster et al. 2020). Various custom methods and modifications to general-purpose techniques have been proposed to address the special characteristics of plankton image data. However, despite the high recognition accuracies reported in the literature, these methods have not been widely adapted to the operational use. Many instrument users do not possess computation skills and/or the resources required for implementing custom methods for image recognition and often rely on the default methods that come with the instruments, which typically follow rather simple approaches and do not fully exploit the latest advances in computer vision and machine learning. Deploying deep learning based methods for new environments often requires notable amounts of labeled training data and expert knowledge while publicly available feature engineering based plankton recognition libraries are accessible for non-experts.

Some survey papers on more general microorganism recognition, as well as utilizing machine learning for marine ecology already exist. Zhang et al. (2022) presented a review of machine learning approaches for microorganism image analysis including history, trends, and applications. The paper covers the segmentation, clustering, and classification of various types of microorganism data. Rani et al. (2021) described and compared existing microorganism recognition methods. While the challenges are briefly discussed, the discussion remains on a general level and does not go deeply into the solutions. Li et al. (2019a) provided a review on microorganism recognition for various different application domains with the focus on traditional feature engineering approaches. The survey by Goodwin et al. (2022) covers an even larger scope by addressing the utilization of deep learning methods in marine research. A similar survey was provided by Mittal et al. (2022), who presented existing methods on underwater image classification including fish, plankton, coral reefs, seagrass, and submarines. Bachimanchi et al. (2, the plankton imaging, i.e., imaging instruments and existing image datasets are reviewed. In Sect. 3, automatic plankton recognition including feature engineering and CNNs are discussed. In Sect. 4, the most notable challenges in plankton recognition are identified. In Sect. 5, the existing solutions for each challenge are described. Finally, the paper concludes with a direction for future research in Sect. 6.

2 Plankton imaging

2.1 Imaging instruments

A fundamental understanding of how plankton species composition is regulated requires frequent and sustained observations. As plankton communities are diverse and dynamic, monitoring plankton is challenging. Different types of plankton imaging and analysis systems have been developed to identify and enumerate living (plankton) and non-living particles in natural waters (Benfield et al. 2007). Instruments designed for monitoring plankton communities are briefly discussed next (see review by Lombard et al. (2019) for more detailed information). The specifications of the main imaging instruments are summarized in Table 1.

Microscopy has been widely employed for analysis of plankton, with most of the standard monitoring of plankton organisms based on brightfield microscopy (Zingone et al. 2015). With the possibility of easy magnification change, microscopy can cover the whole size range of plankton. Added to the potential to be combined with other technologies, such as fluorescence, it can provide a flexible array for visualizing planktonic organisms. When combined with a digital camera, it can generate high quality images at relatively low operational costs, although the amount of images is limited in comparison to other devices. Imaging flow cytometry (IFC) combines fluidics, optical characterization and the imaging of cells/colonies. The Imaging FlowCytobot (IFCB) (Olson and Sosik 2007) and the CytoSense/Cytobuoy (Dubelaar et al. 1999), as well as simpler flow systems such as the FlowCam (Sieracki et al. 1998) and the ZooCAM (Colas et al. 2018) are among the imaging devices most frequently used within aquatic research. The IFCB is a fully automated, submersible instrument with built-in design features that routinely operate during deployments imaging each particle triggering the camera. The CytoSense, available either as a bench top or submersible versions, records forward scatter (FSC), side scatter (SSC) and multiple fluorescence signals of each particle, additionally it can image a subset of the analysed particles. Unlike the IFCB and CytoSense, the FlowCam does not have sheath fluid and it is not an automated in situ instrument. Particle detection in IFCB and CytoSense is triggered by one of the optical sensors (scatter or fluorescence), while FlowCam captures images of a field of view at regular intervals where particles can be identified (autotrigger mode). If the FlowCam is equipped with a laser, particle imaging can be triggered by fluorescence properties, such as the presence of chlorophyll-a. The imaging resolution of the IFCB and CytoSense is targeted for a size range of approximately from larger nanoplankton to smaller mesoplankton. The targeted size range for the FlowCam vary according to the combination of flowcell and objective used and instrument versions for imaging of smaller and larger objects and organisms, FlowCam-Nano and FlowCam-Macro, respectively are currently available and image capture is based on autotrigger. The ZooCAM uses an imaging principle similar to that of FlowCam autotrigger.

For obtaining quantitative information from plankton larger than 100 μm, larger volumes of water are needed to be examined than is possible with IFC (Lombard et al. 2019). For imaging of larger particles different types of instruments have been developed utilizing slightly distinct techniques. There are many commercially available instruments such as the In-situ Ichthyoplankton Imaging System (ISIIS) (Cowen and Guigand 2008), Continuous Plankton Imaging and Classification Sensor (CPICS) (Grossmann et al. 2015), ZooScan (Gorsky et al. 2010), Video Plankton Recorder (VPR) (Davis et al. 2005), Underwater Vision Profiler (UVP) (Picheral et al. 2010), and Lightframe On-sight Keyspecies Investigation (LOKI) (Schulz et al. 2010) which are mostly in situ imaging systems and their operational principles as well as capabilities are reviewed by Lombard et al. (2019). Some instruments have been developed through research purposes but are not commercially available such as the ZooCAM and Prince William Sound Plankton Camera (PWSPC) (Campbell et al. 2020).

Some of the more recent imaging instruments include the SPC (Scripps Plankton Camera) system (Orenstein et al. 2020b), a submersible Digital Holographic Camera (DHC) instrument for temporal and spatial plankton measurements (Dyomin et al. 2020, 2019), and its modification, the miniDHC (Dyomin et al. 2021, 2019). Also HOLOCAM (Nayak et al. 2018), HoloSea (Walcutt et al. 2020; MacNeil et al. 2021), and LISST-Holo are utilized for underwater microscopy using digital holographic imaging (DHI). SPC utilizes an underwater dark-field imaging microscope combined with an onboard computer that allows real-time processing of the images, while the four latter instruments produce 3-D holograms of the imaged volume. The core principal of DHI is in the optical interference phenomenon. A coherent light source, typically a laser, produces the optical interference pattern between undeviated portion of the beam and light diffracted by the object which is recorded on the sensor, and then holograms are reconstructed with pre-/post-processed computer-based algorithms (Watson 2018). The main reasons of emerging DHI microscopy are a wide depth-of-field and field-of-view, i.e., larger sampling volume, and mechanically simpler optical configuration compared to lens-based devices (Walcutt et al. 2020; Watson 2018).As the focus of this review is on image recognition, we stress that the instrument list is not exhaustive and focus only on the most used methods found in the publications surveyed. Some instruments not detailed also include underwater microscopes, scanning electron microscopy and the capacity to image different fluorescent channels, such as the Amnis ImageStreamX Mk II Imaging Flow Cytometer (Cytek) and environmental high content fluorescence microscopy (Colin et al. 2017).

Table 1 Plankton imaging instruments

2.2 Publicly available image datasets

Publicly available image datasets are crucial on the development of the automatic plankton recognition methods since the most labor intensive part of the process is to create large labeled training and testing datasets. The available datasets are also important for the traceability and comparability of the developed methods. There are several publicly available datasets to be utilized in the research for develo** the machine learning methods of plankton recognition. However, it is not always clear from the reported results if there are differences in the classification performance among the classes, and if so, which classes perform better than others. This becomes relevant to understand if there are potential class-specific biases in classifiers, which could be associated with specific size classes and robustness of the organisms. The details of the publicly available and commonly used datasets are summarized in Table 2, and example images from the datasets are shown in Fig. 1. The most frequently used datasets are ZooScanNet (Elineau et al. 2018), Kaggle-Plankton (PlanktonSet-1.0) (Cowen et al. 2015), WHOI-Plankton (Orenstein et al. 2021) and their manifold task specific subsets. They all comprise grayscale images collected with a single plankton imaging instrument. UVP5/MC dataset (Kiko and Simon-Martin 2020) consists of data collected in the EcoTaxa application (Picheral et al. 2017). A part of the UVP5/MC dataset has been annotated by an expert and part with an automated tool. More recently collected datasets include PMID2019 (Li et al. 2019b), miniPPlankton (Sun et al. 2020), DYB-PlanktonNet (Li et al. 2021b), Lake-Zooplankton (Kyathanahally et al. 2021a), and the one collected by Plonus et al. (2021b). They are acquired with modern imaging instruments and characterized by the presence of color and a higher resolution. SYKE-plankton_IFCB_2022 (Kraft et al. 2022c) and SYKE-plankton_IFCB_Utö_2021 (Kraft et al. 2022a) datasets consist of IFCB images of phytoplankton collected from the Baltic Sea. There are also references to some older commonly used plankton datasets that are not available any more. One example is Automatic Diatom Identification And Classification (ADIAC) database (Du Buf et al. 1999).

Table 2 Existing plankton image data sets
Fig. 1
figure 1

Example images from the publicly available data sets: a Kaggle-Plankton (Cowen et al. 2015); b WHOI-Plankton (Orenstein et al. 2019b); d ZooScan (Elineau et al. 2018); e DYB-PlanktonNet (Li et al. 2021b); f SYKE 2022 (Kraft et al. 2022c)

3 Automatic plankton recognition

3.1 Feature engineering

A traditional solution for image classification including plankton recognition is to divide the problem into two steps: image feature extraction and classification (Blaschko et al. 2005; Bueno et al. 2017; Ellen et al. 2015; Grosjean et al. 2004; Sosik and Olson 2007; Zetsche et al. 2014; Barsanti et al. 2021). Ideally, image features form a lower-dimensional representation of the image content that contains relevant information for the classification. The main challenge is to design and select good features that are both general and provide good discrimination between the classes. As a result of feature extraction, the obtained feature vectors are used to train a classifier that can then classify unseen images. The most commonly used classifiers for plankton recognition are support vector machine (SVM) (Bernhard et al. 1992; Cortes and Vapnik 1995) and random decision forest (RDF) (Ho 1995). SVM in its most simplistic form is a binary linear classifier that works by map** the data points in the feature space in such way that the margin between two classes is maximised. It can be extended to multi-class case, for example, by utilizing multiple binary classifiers and to non-linear classification by using a kernel trick. The RDF is a widely used classification method that is based on the observation that combining several classifiers to form an ensemble typically provides better classification performance than any of the individual classifiers. In a typical RDF, a large number of decision tree classifiers are constructed and the final classification is obtained by computing the mode of individual classifications. This way, the typical problem of overfitting in the case of decision trees is avoided.

The first work on automatic plankton image classification was presented by Tang et al. (1998). The image data were produced using a video plankton recorder (VPR) (Davis et al. 1992) and the proposed method combined texture and shape information of plankton images in a descriptor that is the combination of traditional invariant moment features and Fourier boundary descriptors with gray-scale morphological granulometries. It should be noted that some papers on automatic plankton recognition based on non-image data have been published even earlier. For example, Boddy et al. (1994) utilized light scatter and fluorescence data obtained by flow cytometry to train an artificial neural network (ANN) to classify plankton species.

Finding good image features is essential for any plankton classification system (Cheng et al. 2018; Corgnati et al. 2016). Various feature extraction technologies have been proposed and put into practice for different underwater imaging environments (Sosik and Olson 2007; Zetsche et al. 2014). Frequently used plankton features include texture features (e.g. Mosleh et al. 2012), geometric and shape features (e.g. Tan et al. 2014), color features (e.g. Ellen et al. 2015), local features (e.g. Zheng et al. 2017), and model-based features (e.g. Rivas-Villar et al. 2021). Table 4 in Appendix A categorize and summarize various features used for plankton recognition.

The most commonly used image feature type in plankton recognition is shape features (see e.g. Sosik and Olson 2007; Zetsche et al. 2014) that characterize either the contour or binary mask of the object (plankton). In their simplest form geometric features are numerical descriptors of generic geometric aspects such as major and minor axis length, perimeter, equivalent spherical diameter and area of an object computed from binarized image. Another common approach is to utilize image moments to describe the shape. Both Hu moments (Hu 1962; Thiel et al. 1995; Liu et al. 2021a; Zhao et al. 2005, 2010) and Zernike moments (Khotanzad and Hong 1990; Blaschko et al. 2005) have been proposed for plankton recognition. Also, various advanced features quantifying the shape of the contour have been proposed for plankton data. These include boundary smoothness (e.g. Tang et al. 2006; Liu and Watson 2020), affine curvature descriptors (Liu and Watson 2020), Freeman contour code features (Rodenacker et al. 2006), and elliptical Fourier descriptors [Sánchez et al. 2019a; Beszteri et al. 2018). Further geometric features applied for plankton recognition include symmetry measures (e.g. Hausdorff distance (Guo et al. 2021c; Sosik and Olson 2007)] and granulometries (Kingman 1975) utilizing morphological operations (Luo et al. 2005; Kramer 2005; Tang et al. 2006; Wu and Sheu 1998).

Other frequently used type of features in plankton recognition systems are texture features that quantify spatial distribution of intensity or color values in local image regions. While shape features consider only the boundary of plankton, texture features describe the region inside the boundary. The simplest texture features commonly applied in plankton recognition are first-order statistical descriptors that compute simple statistical values directly from the intensity values (see e.g. Lisin 2006; Zetsche et al. 2014; Guo et al. 2021c). These are sometimes called color features and include, for example, mean intensity, variance of intensity, as well as, skewness and kurtosis that quantify the shape of the color or intensity histogram. The first order statistics only provide information on how the intensity or color values are distributed in the image. To obtain further spatial information on texture, various second-order statistical descriptors have been proposed. The most common second-order statistical descriptor used in plankton recognition is the co-occurrence matrices (Hu and Davis 2005; Liu et al. 2021a; Shan et al. 2020; Wei et al. 2018) used YOLO to detect and recognize plankton. Wang et al. (2022b) compared multiple CNN-based object detection methods including Faster R-CNN (Ren et al. 2017), SSD (Liu et al. 2016), YOLOv3 (Redmon and Farhadi 2018) and YOLOX (Ge et al. 2021) on imaging flow cytometer data. YOLOX achieved the best accuracy. Chen et al. (2023) explored a family of YOLOv5 (Jocher 2020) architectures in the automated video-oriented plankton detection and tracking workflow.

While typically detection methods are applied on multi-specimen images, they have been proposed for recognition of single-specimen focused images. Li et al. (2021c, 2021d) proposed an improved YOLOv3-based model for plankton detection on IFCB images. The proposed model contains two YOLOv3 networks fused with DenseNet architecture. Kosov et al. (2018) applied CNN-based images, features and conditional random fields for plankton localization and segmentation.

Similar to modern detection methods, also semantic and instance segmentation methods can be applied to simultaneously detect and recognize plankton. Ruiz-Santaquiteria et al. (2020) compared a semantic segmentation model called SegNet (Badrinarayanan et al. 2017) and instance segmentation model called Mask R-CNN (He et al. 2017) on algae detection and recognition.

3.2.5 Comparisons

Many papers utilize in-house datasets and most publicly available datasets do not provide standardized evaluation protocol meaning that different papers utilize different train-test splits and performance metrics. This makes comparison of the performance of different solutions challenging before the principles of making the science findable, accessible, interoperable, reusable (FAIR) are fully adopted (Schoening et al. 2022). Table 3 summarizes some published results obtained on publicly available datasets. However, the provided accuracies are not directly comparable due to the reasons mentioned above. One notable comparison of plankton recognition methods is The National Data Science Bowl (Aurelia et al. 2014) from 2015. The winning team used an ensemble of over 40 convolutional neural networks.

Table 3 Example accuracies on publicly available datasets from various sources

4 Challenges in plankton recognition

Based on the literature on automatic plankton recognition various challenges can be identified. The most notable challenges are as follows:

  1. 1.

    The amount of labeled data for training is limited. This challenge can be divided into two subchallenges: (1) expert knowledge is required for data labeling, and (2) certain plankton species are notably less common producing a small amount of example images. Plankton species are inherently difficult to identify, requiring prior expertise. Labeling image data for training and evaluation purposes must be done by experts (e.g. plankton taxonomists) ruling out crowdsourcing tools such as Amazon Mechanical Turk commonly used for labeling large datasets. This makes labeling expensive limiting the amount of labeled data. It also takes years to accumulate enough data to cover rare species. Collecting a labeled training set is essential for deep learning models. Considering that morphological plasticity can be found for all planktonic organisms, larger amount of labeled training data increases the model’s capacity to generalize to new data while training a large model with a small number of examples increases the risk of overfitting, i.e. learning the noise in training data causing the model to perform poorly on unseen images.

  2. 2.

    There is a large imbalance between classes. Image classification with datasets that suffer from a greatly imbalanced class distribution is a challenging task in the computer vision field. Data of plankton species naturally exhibit an imbalance in their class distribution, with some plankton species occurring naturally more commonly than others. This results in highly biased datasets and makes it difficult to learn to recognize rare species, having a serious impact on the performance of classifiers. Furthermore, with highly unbalanced datasets the overall classification accuracy (e.g. percentage of images that were correctly classified) provides little information about the classes with a small number of samples which may bias the evaluation of the goodness of the classification methods.

  3. 3.

    Visual differences between certain classes are small. Certain plankton species, especially those that are taxonomically close to each other and/or have reduced size, resemble each other visually, which renders the recognition task a fine-grained classification problem. Limitations in the amount of labeled training data make it challenging to ensure that the recognition model learns the subtle differences between the classes reducing the recognition accuracy.

  4. 4.

    Imaging instruments vary between datasets. If two datasets have been obtained with different imaging instruments producing visually different images (domain shift) the classification model trained on one dataset does not provide sufficient classification accuracy on the other dataset when applied directly. This makes it challenging to develop general-purpose classifiers that could be applied to new datasets limiting the applicability of the existing publicly available large image datasets. There is a need for approaches that allow the adaptation of the trained models to new imaging instruments.

  5. 5.

    Labeled training sets do not contain all the classes that can be captured. When deploying a recognition model in operational use, it should be able to handle images from the classes that were not present in the training phase. Different datasets often have different sets of plankton species due to, for example, the geographical distance between the imaging locations or the particle size range of the imaging instruments. Moreover, imaging instruments capture images of unknown particles. Typical CNN-based classification models trained on one dataset tend to classify the images from a previously unseen class to one of the known classes often with high confidence, which not only makes the models incapable to generalize to new datasets and analyze noisy data but makes it difficult to recognize when the model fails. This calls for methods that can identify when the image is from a previously unseen class (species).

  6. 6.

    There are uncertainties in expert labels. Due to limited imaging resolutions and low image quality, recognizing plankton species is often difficult even for an expert. Manually labeling large amounts of images is tedious work increasing the risk of human errors. Moreover, due to the high costs of labeling work, it is typically not possible to obtain opinions from multiple experts for each image. These reasons cause inaccuracies (uncertainty) in labels to the training data decreasing the classification performance of the trained models. Furthermore, this uncertainty is often highly imbalanced since some of the classes are easier to identify than others.

  7. 7.

    Variation in image size and aspect ratio is very large. Most CNN architectures require that the input images have fixed dimensions and a typical approach in image classification is to first scale the images into a common size. This is not ideal in plankton recognition due to a very large variation in both the size and aspect ratio of plankton. Scaling images into a common size may cause either small details to be lost in the large images (downscaling) or very large and computationally heavy models (upscaling). Furthermore, the size is an important cue for recognizing the plankton species and this information is lost in scaling.

  8. 8.

    Image quality can be low or have extensive variation. Plankton imaging requires high magnification and the (natural) water might contain other particles, cause unwanted optical distortions, as well as limit the visibility. More importantly, due to the limited depth-of-field, automated imaging instruments often fail to capture particles in focus and the focus may drift away from optimal setting. These reduce the quality of images. The low image quality makes both manual labeling (Challenge 6) and automatic classification considerably more challenging. Therefore, there is a need for plankton recognition solutions that are robust to image distortions such as blur and noise.

  9. 9.

    The amount of image data is massive. Modern plankton imaging instruments produce massive amounts of image data, e.g. FlowCam Macro and ISIIS have the ability to take 10,000 images per minute and 64,000 images per hour respectively. Computationally efficient solutions are needed to perform the analysis in real-time (MacLeod et al. 2010; Orenstein et al. 2015).

All the nine challenges are visualized in Fig. 4.

Fig. 4
figure 4

The nine main challenges that complicate the introduction of automatic plankton recognition methods to operational use

5 Existing solutions

5.1 Challenge 1: Limited amount of labeled training data

The two main reasons limiting the amount of labeled training data, the requirement of expert knowledge for the very laborious labeling task and rarity of certain plankton species, require different solutions.

Active learning has been utilized to minimize the effort of expensive human experts in labeling plankton image data (Luo et al. 2005; Li et al. 2021a). The basic idea behind active learning is to select only the most informative samples for labeling. A classifier is first trained on a small initial training set and the method iteratively seeks to find the most informative samples from an unlabeled dataset. These samples are then labeled by a human expert and the model is re-trained. A simple active learning technique for plankton images called "breaking ties" was proposed by Luo et al. (2005). The method utilizes probability approximation for SVM-based classifier and ranks the unlabeled images based on the differences between the largest and the second largest class probabilities (the smaller the difference the less confident the classifier is). Images with the smallest confidence were labeled by an expert. Drews et al. (2013) studied semi-automatic classification and active learning approaches for microalgae identification. A Gaussian mixture model (GMM) model is estimated from the image feature data and three different sampling strategies are used for the active learning. The experimental results show the benefit of using active learning to improve the performance with few labeled samples. Bochinski et al. (2018) proposed Cost-Effective Active Learning (CEAL) (Wang et al. 2016) for plankton recognition. In contrast to traditional active learning where only the manually annotated samples are used in the model training, CEAL utilizes also the unlabeled high-confidence samples for training with class predictions as pseudo labels. Haug et al. (2021b); Haug (2021); Haug et al. (2021a) proposed Combined Informative and Representative Active Learning technique (CIRAL) to minimize the human involvement in the plankton image labeling process. The main idea behind the method is to find the images with minimal perturbations that are often miss-classified and ignore the images that are far from the decision boundary. The DeepFool algorithm is used to compute small perturbations to the images. The finding of the representative images is formulated as a min-max facility location problem and solved using a greedy algorithm.

While active learning helps to reduce manual work, it is often still a time-consuming process. Typically, there is a need to obtain more training data in a completely automated manner. A traditional approach to increase the amount of training data is to utilize data augmentation. By augmenting the existing labeled image data with various image manipulations, the diversity of the training data, and therefore, the generalizability and accuracy of the trained model can be improved. The most commonly used data augmentation techniques for plankton image recognition include various geometric transformations (Orenstein and Beijbom 2017; Vallez et al. 2022) including rotation (e.g. Cheng et al. 2019; Correa et al. 2017), shearing (e.g. Dai et al. 2016a; Geraldes et al. 2019), flip** (e.g. Ellen et al. 2019; Geraldes et al. 2019), and rescaling (e.g. Li and Cui 2016; Luo et al. 2018). Also, additional noise (e.g. Correa et al. 2017; Geraldes et al. 2019), blurring (Geraldes et al. 2019), contrast normalisation (Geraldes et al. 2019), as well as adjusting brightness, saturation, contrast, and hue (Dunker et al. 2018) have been utilized. Some works augment images using translation (e.g. Dai et al. 2016a; Li and Cui 2016). However, it should be noted that, unless the translation is used to cut an image, CNNs are invariant to translation by design, and therefore, this is typically unnecessary when CNNs are used for recognition. Augmentation has been shown to increase the plankton recognition accuracy even with relatively large training sets (see e.g. Song et al. 2020). Examples of augmented images are shown in Fig. 5.

Fig. 5
figure 5

Examples of data augmentation methods

Another commonly used approach to address a small amount of labeled training data is transfer learning. Transfer learning is a machine learning method that utilizes knowledge gained from the source domain, where labeled training data are abundant, to the target domain, where labeled training data are scarce (Pan and Yang 2009; Shao et al. 2014; Weiss et al. 2016) (see Fig. 6). In the context of plankton recognition, this typically means that the model is first trained using either general image datasets ([e.g. ImageNet (Deng et al. 2009)] or a large publicly available plankton dataset and then fine-tuned for the target plankton dataset with typically a limited number of labeled images. Using general image databases as source data is justified by the fact that the learned low level image features are often useful despite the classification problem. In the simplest case transfer learning can be done by simply replacing and training the classification layer and kee** the feature extraction layers unchanged (see e.g. Mitra et al. 2019). However, it is often beneficial to use the pre-trained network only for initialization and retrain (or fine-tune) the whole network with the target dataset (Lumini et al. 2020). Existing studies on WHOI-Plankton dataset suggest that using pre-trained models and fine-tuning them for plankton data (see e.g. Lumini and Nanni 2019a) can achieve significantly higher accuracy than training the models from scratch on plankton data (see e.g. Liu et al. 2018a).

Fig. 6
figure 6

The difference between Traditional machine learning and Transfer learning

One way to apply transfer learning for plankton images is to use trained CNNs only for feature extraction and utilize general classification methods such as SVM or RDF for the recognition (see e.g. Rodrigues et al. 2018; Rawat et al. 2019). However, the results by Orenstein and Beijbom (2017) suggest that better accuracy is obtained by utilizing end-to-end CNN with classification layers. Lumini and Nanni (2019a); Lumini et al. (2020) evaluated various strategies for transfer learning on plankton images. The first strategy was to initialize the model with ImageNet weights and fine-tune the whole model with plankton data. In the second strategy (two rounds tuning), a second pre-training step utilizing out-of-domain plankton image data was added before the fine-tuning. In the third strategy, ensembles of multiple different models were used. Based on the experiments the two rounds tuning did not provide a notable improvement in accuracy. Similarly, Guo et al. (2021b) explored and compared multiple transfer learning schemes on several biology image datasets from various domains. Various underwater and ecological image datasets are utilized for multistage transfer learning, where ImageNet pretraining is first improved by fine-tuning on an intermediate dataset before, finally, training on the target dataset consisting of plankton images. The experimental results show the potential of cross-domain transfer learning even on the out-of-domain data when the number of samples in the target domain is insufficient.

Large models with more parameters typically require a large amount of data to be trained without overfitting the model. To avoid this and allow the training with a smaller amount of data, shallower CNN architectures have been proposed for plankton recognition. For example, the 18-layer version of ResNet architecture has been shown to achieve a high plankton recognition accuracy on IFCB data (Kraft et al. 2022b). Most custom CNN architectures developed especially for plankton recognition including ClassyFireNet (**dal and Mundra 2015), TANet (Li et al. 2019c), and ZooplanktoNet (Dai et al. 2016a) are relatively shallow with 8, 8 and 11 layers, respectively. It has been shown that a good classification accuracy could be obtained with a shallow architecture and by using suitable data augmentation methods even with as few as 10 images per class (Kraft et al. 2022b).

In addition to data manipulation and custom recognition models, also model training approaches have been considered to address the limited data amounts. Learning techniques developed for training the classifier with a minimal amount of samples are called few-shot learning methods. Typically, the idea is to utilize some prior knowledge to allow the generalization to new tasks (in this case classification of new plankton species) containing only a few labeled training examples. Common ways to address few-shot learning is to utilize generation (Hariharan and Girshick 2017), embedding or metric learning. The basic idea is to learn such embeddings that the images from the same class are close to each other in the metric space and images from the different classes are far. This allows performing the plankton recognition using distances to the images with known plankton species. Embedding and metric learning have been successfully applied to plankton recognition (Teigen et al. 2020; Badreldeen Bdawy Mohamed et al. 2022).

Schröder et al. (2018) employed a low-shot learning technique called weight imprinting (Qi et al. 2018) for plankton recognition with a limited amount of labeled training data. The main idea of weight imprinting is to divide the set of all classes into base classes with enough training data and smaller low-shot classes. During the representation learning phase, a CNN is trained to distinguish the base classes with a large amount of labeled training data. In the second phase (low-shot learning), the classifier is then updated with calculated weights to distinguish the smaller low-shot classes. This is done by using appropriately scaled class features of the low-shot classes as their weights, directly allowing the inclusion of classes with only one training image. Guo and Guan (2021) addressed the few-shot learning by supplementing the softmax loss with center loss term (Wen et al. 2016) that forces the samples from the same class close to each other in the deep feature space. The loss function is a weighted sum of the two loss terms and a regularization parameter is used to control the weights.

In the extreme case, the labeled training data are completely absent and unsupervised learning methods are required. Image clustering is the most commonly used unsupervised technique for plankton image analysis. Ibrahim (2020) carried out preliminary experiments on common clustering algorithms such as k-means with phytoplankton data. Image features for clustering were extracted using pretrained CNN models. Coltelli et al. (2014) used various handcrafted image features and self-organizing maps (SOM) for plankton image clustering. Schmarje et al. (2021) proposed a framework for handling semi-supervised classifications of fuzzy labels due to experts having different opinions. The approach is based on overclustering to identify substructures in the fuzzy labels and a loss function to improve the overclustering. The performance surpassed the one of a state-of-the-art semi-supervised method on plankton data. Salvesen (2021); Salvesen et al. (2022) studied deep learning for plankton classification without ground truth labels. The improved feature learning was implemented using DeepCluster, a Generative Adversarial Network (GAN) and a rotation-invariant autoencoder. Despite the potential in unsupervised methods, the gap to supervised learning is still significant.

Hierarchical clustering methods are preferred on plankton data as they have the potential to mimic the taxonomic hierarchy of plankton. Dimitrovski et al. (2012), classification of diatom images is considered as a hierarchical multi-label classification problem and solved by constructing predictive clustering trees that can simultaneously predict all different levels in the taxonomic hierarchy. These trees are then used as an ensemble forming a random forest (RF) to improve the predictive performance. Morphocluster (Schröder et al. 2020) utilizes a semi-automated iterative approach and hierarchical density-based HDBSCAN* (Campello et al. 2015) for plankton image data analysis. To compute image features for the clustering a CNN trained with UVP5/EcoTaxa dataset in a supervised manner was used. The method works iteratively in a semi-automated manner so that clusters are validated by an expert. An improved version of Morphocluster was presented by Schröder and Kiko (2022). Multiple CNN-based feature extractors were trained using different labeled datasets to allow the selection of the most suitable feature extractors for the target data. In addition, an unsupervised approach to learn the plankton image features based on the momentum contrast method (He et al. 2020) was proposed. The idea is to use data augmentation to generate two different instances of the same image and use a loss function that forces the model to learn similar feature representations for both instances. Moreover, two custom clustering methods were proposed: (1) shrunken k-Means, and (2) Partially Labeled k-Means. Due to the iterative clustering process of Morphocluster, only part of the images needs to be clustered in each iteration. Shrunken k-Means utilizes distances to cluster centers provided k-means to discard images that are far from the centers. Partially Labeled k-Means utilizes the label information from the earlier iterations to guide the clustering.

Autoencoders have also been proposed for learning plankton image features for clustering without the label information. The basic idea is to utilize encoder-decoder network architecture where the encoder generates an embedding vector from an image and the decoder tries to reconstruct the original image based on the embedding vector. Such a network can be trained without any labels. Ideally, the encoder learns to compress the essential information from the image into an embedding vector that can then be used for clustering. For example, Salvesen et al. (2020) applied an autoencoder-based approach called Deep Convolutional Embedded Clustering (DCEC) plankton image data. The method employs the CNN-based autoencoder architecture by Guo et al. (2017) and uses k-means to cluster the obtained embeddings. Alfano et al. (2009) proposed an approach for balancing the trade-off between the classification performance and number of classes. The model automatically suggests merging of classes based on the statistics evaluated after the classification. The results from taxa recognition of macroinvertebrates by Ärje et al. (2020) showed that humans performed better when a hierarchical classification approach commonly used by human taxonomic experts was used, but when a flat classification approach was used, the CNN was close to human accuracy. To improve the automatic approaches, a few methods focusing especially on the attention mechanism to address the fine-grained nature of the recognition task have been proposed.

Sun et al. (2020) considered fine-grained classification of plankton by proposing an attention mechanism based on Gradient-weighted Class Activation Maps (Grad-CAM) (Selvaraju et al. 2017) to force the CNN to focus on the most informative regions in the image. Grad-CAM was originally developed for visualizing the CNN-based models. It highlights important image regions which correspond to the decision of interest (in this case plankton recognition). Sun et al. (2020) utilized Grad-CAM to detect the regions to focus on, and a feature fusion approach utilizing high-order integration (Cai et al. 2017) is applied to obtain stronger features for those regions. This approach shares similarities with the self-attention module used in the TANet architecture (Li et al. 2019c) for plankton recognition. However, the self-attention module puts larger weights on the important regions, i.e. those regions in the feature map with high activation values. Ito et al. (2023) proposed to use Attention Branch Network model for hierarchical classification of plankton images. This was motivated by the hierarchical structure of the plankton taxonomy. Successful classification at higher levels of the taxonomy simplifies the fine-grained recognition task at the lower levels.

Also other approaches for fine-grained plankton recognition have been proposed. Du et al. (2020) applied Matrix Power Normalized CO-Variance (MPN-COV) pooling layer for second-order feature extraction. The aim is to model the complex class boundaries more accurately than in traditional pooling (e.g. softmax). There is some evidence (Li et al. 2021) and animal re-identification (Nepovinnykh et al. 2020), as well as content-based image retrieval (Dubey 2021), but has been also successfully applied to plankton classification (Teigen et al. 2020; Badreldeen Bdawy Mohamed et al. 2022). A simple approach to implement a recognition method is to construct a gallery set of known species and use the learned similarity metric to compare query images to the gallery images. The similarity in this context corresponds to the likelihood that the images belong to the same class. This further allows defining a threshold value for similarity enabling open-set classification: if no similar images are found in the gallery set, the query image is predicted to belong to an unknown class. Furthermore, new classes can be added by simply including them in the gallery set as the model does not necessarily need to learn class-specific image features.

The most common approaches for deep metric learning include triplet-based learning and classification-based metric learning. The first approach learns the metric by sampling image triplets with anchor, positive, and negative examples (Hoffer and Ailon 2015). The loss function is defined in such a way that the distances (similarity) from the embeddings of the anchors to the positive samples are minimized, and the distances from the anchors to the negative samples are maximized. The second approach approximates the classes using learned proxies (Movshovitz-Attias et al. 2017) or class centers (Deng et al. 2019) that provide the global information needed to learn the metric. This makes it possible to formulate the loss function based on the softmax loss and allows to avoid the challenging triplet mining step.

Teigen et al. (2020) studied the viability of few-shot learners in correctly classifying plankton images. A Siamese network was trained using the triplet loss and used to determine the class of a query image. Two scenarios were tested: the multi-class classification and the novel class detection. A model trained to distinguish between five classes of plankton using five reference images from each class was able to achieve reasonable accuracy. In the novel class detection, however, the model was able to filter out only 57 images out of 500 unknowns.

Badreldeen Bdawy Mohamed et al. (2022) utilized the angular margin loss (ArcFace) (Deng et al. 2019) instead of triplet loss to address the high cost of the triplet mining step. Furthermore, Generalised Mean pooling (GeM) (Radenović et al. 2018) was applied to aggregate the deep activations to rotation and translation invariant representations. ArcFace uses a similarity learning mechanism that allows distance metric learning to be solved in the classification task by introducing the Angular Margin Loss. This allows straightforward training of the model and only adds negligible computational complexity. The metric learning-based method was shown to outperform the model utilizing OpenMax (Bendale and Boult 2016) layer in open-set classification of plankton. One of the main benefits of the method is that it generalizes well to new classes added to the gallery set without retraining. This makes it straightforward to apply the model to new datasets with only partly overlap** plankton species composition. Similar approach was proposed by Yang et al. (2022) who proposed to use supervised contrastive (SupCon) loss instead of ArcFace loss.

Plankton species vary in different locations and seasons, thus, it is common that a recognition model should be adapted to or retrained for the new situation at some point. Retraining a separate model for each situation is infeasible, and continual or online training of the model would be challenging for online monitoring applications. Therefore, an effective remedy would be to treat it as an open-set recognition problem, solve it with the modern methods anomaly detection or metric learning, and take care of the model’s capability to generalize to new data without the need to retrain the whole model.

5.6 Challenge 6: Label uncertainty

The plankton image label uncertainty is caused by the difficulty of manually recognizing the species from low-quality images with limited resolution, human error, and high costs preventing the repetition of the manual annotation by multiple experts. Culverhouse et al. (2003) identified four main reasons for the incorrect labeling of plankton images: (1) the limited short-term memory of humans, (2) fatigue, (3) recency effects, i.e., labeling is biased towards the most recently seen labels, and (4) positivity bias, i.e., labeling is biased by the expert’s expectations to the content of sample. Labels provided by sixteen human experts (marine ecologists and harmful algal bloom monitoring specialists) on microscopy images of dinoflagellates (6 classes) were analyzed. The results showed that only 67–83% self-consistency and 43% consensus between experts was obtained. Experts who where routinely labeling the selected classes were able to achieve 84–95% labeling accuracy. Culverhouse (2007) brought up several important points related to labeling algae. The presented performance figures do not represent the state-of-the-art of automatic approaches, but improvements would be beneficial for both alternatives. Human expert judgements would benefit from peer review and inter-expert calibration to remove human bias. To improve the automatic solutions, the errors of both man and machine would require further attention. Global reference databases with validated samples and representative coverage of the morphological and physiological characteristics in nature would be beneficial for training and evaluation purposes. In addition, Solow et al. (2001) noted that the taxonomic counts of classified individuals are biased when there are errors in classification. A straightforward method for correcting for the bias was proposed based on the classification probabilities of the classifier.

Image filtering has been proposed to address label uncertainty in plankton image data. The idea is to discard images for which the recognition model is uncertain, and therefore, more likely to produce erroneous labels. For example, Faillettaz et al. (2016) utilized a probabilistic RF for classification, and obtained class probabilities were used to detect and ignore images for which the classifier is uncertain. Luo et al. (2018), Plonus et al. (2021a), and Kraft et al. (2022b) utilized similar approach for CNN-based recognition models. Luo et al. (2018) used a separate fully annotated validation set to set class-specific probability thresholds for filtering. Plonus et al. (2021a) proposed a pipeline for tailoring filtering thresholds to the research question of interest by allowing to select between high precision and high recall. Kraft et al. (2022b) evaluated a CNN-based model with class-specific probability thresholds on operational use.

Schanz et al. (2023) proposed a novel loss function that measures the Kullback-Leibler divergence between the model’s output distribution over classes and the distribution of expert labels. This allows for training on multiple expert labels that can be conflicting, leading to a model that can estimate the label uncertainty.

Related to the label uncertainty, quantification methods have been proposed for plankton image data analysis. The basic idea is to estimate the class distribution directly. While mislabeled samples cause noise to the training data for classification methods, the class distributions are often close to correct. Sosik and Olson (2007) used a quantification method to estimate the abundance of different taxonomic groups of phytoplankton. Utilizing a combination of image feature types including size, shape, symmetry, and texture characteristics, plus orientation invariant moments, diffraction pattern sampling, and co-occurrence matrix statistics proposed. Statistical analysis was used to estimate category-specific misclassification probabilities for accurate abundance estimates and for quantification of uncertainties in abundance estimates. Beijbom et al. (2019) partially solved this problem by providing the size information as metadata (additional features) for the classifier while still using resized versions of images as the main input for the CNN. Metadata is used as an input for the network besides image data, and they are processed independently by separate parts of the network. The outputs of both subnetworks are concatenated together and processed by fully connected layers. Results showed that metadata was useful for classification accuracy.

To truly solve the problem with the varying image size and aspect ratio, the CNN architecture needs to be modified so that it can process images with multiple sizes. This can be achieved, e.g. by combining scale-invariant and scale-variant features to devise a multi-scale CNN architecture (Van Noord and Postma 2017). Py et al. (2016) proposed an inception module that allows to use multiple scaled versions of the original image with different sizes as the input for CNN. By selecting different strides for each scale, the computed feature maps have the same size for all scales and can be concatenated to a single set of multi-scale features. The proposed method was shown to outperform the method with a single fixed-size input.

Bureš et al. (2021) compared various modifications of the baseline CNN on plankton recognition with high variation in image size. These include Spatial Pyramid Pooling (SPP) (He et al. 2015), using image size as metadata, patch crop** and multi-stream CNNs. SPP allows the training of a single CNN with multiple image sizes in order to obtain higher scale invariance by pooling the features produced by the convolutional layer to a fixed-length vector required by the fully connected layers. The metadata was used as described by Ellen et al. (2019). The patch crop** technique divides images into fixed-size patches that are classified separately. The final recognition is done by averaging the resulting score vectors. Multi-stream CNN utilizes a similar approach but uses multiple different networks trained for different image sizes and aspect ratios. The best plankton recognition accuracy was obtained using a multi-stream network combining two models with different input aspect ratios and patch crop**.

Most plankton datasets have significant variation in image sizes and aspect ratios. Common CNN-based image classifiers require that the input images have a constant size. In this case, image resizing is used and it is necessary to consider what to do with the aspect ratio and whether metadata about the image size provides an advantage when complementing the fixed-size images. However, a more general remedy would be to use a multi-scale CNN with an appropriate architecture as the recognition model.

5.8 Challenge 8: Low or varying image quality

To improve the classification accuracy on low-quality images various preprocessing steps have been proposed. These include discarding bad quality images (Raitoharju et al. 2016), image segmentation (Keçeli et al. 2017), and denoising (Cheng et al. 2019).

Low quality images can be discarded in different ways. Raitoharju et al. (2016) manually removed low-quality images from the dataset before training the recognition model. Moreover, the remaining images were cropped to remove artifacts mainly appearing close to image borders. Coltelli et al. (2014) filtered out out-of-focus images before the feature extraction. The out-of-focus detection was done by fitting color histograms in a GMM. If the distribution contained two components (background and plankton), the image was considered to be in-focus.

Some studies suggest segmenting the images as a preprocessing step to discard non-plankton pixels from the images. For example, Keçeli et al. (2017) used Otsu’s thresholding method (Otsu 1975) for segmentation and pixels outside the obtained segmentation map are set to zero.

Cheng et al. (2019) applied texture enhancement together with background suppression before the classification step. Enhanced images were shown to produce a slightly higher recognition accuracy than the images without enhancement. Ma et al. (2021) proposed to use modern CNN-based super-resolution techniques to improve the plankton image quality. The EDRN super-resolution architecture (Lim et al. 2017) was combined with the contextual loss (Mechrez et al. 2018), and was shown to produce high-quality images. Guo et al. (2022a) proposed a deep learning-based colorization method to address the loss of the critical color information due to imaging. However, the effect of improved image quality on plankton recognition accuracy was not assessed in neither of the studies. Also contrast limited adaptive histogram equalization (CLAHE) has been proposed to improve the contrast of plankton image data Geronimo et al. (2023). Lang et al. (2022) addressed the image quality issues on holographic imaging via image fusion.

Many real-world computer vision applications have to deal with low-quality images and plankton recognition is no exception. A wealth of image preprocessing approaches exist and in the case of plankton images, at least exclusion of bad images, denoising and image segmentation have been proposed. A more profound way would be to adopt image reconstruction methods, but from the practical perspective of plankton recognition, the simpler methods can be considered as sufficient and data augmentation is commonly used to introduce additional variation to the data.

5.9 Challenge 9: Massive amount of data

While most of the challenges are connected to training and develo** plankton recognition models, the modern imaging devices with high output rates introduce a challenge also for the model deployment phase. Massive data volumes obtained by modern imaging instruments motivate to develop computationally efficient solutions that are able to analyse data in real time. However, the computation time is rarely considered in plankton recognition literature. Most works related to the challenge consider lightweight CNN architectures. For example, shallow TANet (Li et al. 2019c) was shown to outperform competing methods in computing time without sacrificing accuracy on the Kaggle dataset.

Zimmerman et al. (2020) proposed an embedded system for in situ deployment of plankton microscope with real-time recognition system. Due to the limited computation resources and computation time limitations, CNN-based recognition methods were considered unsuitable and a faster feature-engineering based approach was proposed with reduced recognition accuracy. Yuan et al. (2023) applied the edge computing with an AI chip to establish real-time on-site analysis of IFCB data.

The computation time is an especially big issue with holographic imaging that traditionally relies on computationally heavy reconstruction operations to process the raw data. To address this end-to-end CNN methods for plankton recognition that take the raw holographic data as input have been proposed (Guo et al. 2021a; Zhang et al. 2021; Barua et al. 2023). This way the reconstruction step can be completely avoided. Guo et al. (2021a) and Zhang et al. (2021) showed that CNNs are able to learn the image features for the plankton recognition from the raw data speeding up the processing significantly.

Online monitoring of plankton with modern imaging equipment produces huge amounts of images. The related image analysis requires either high-performance computing (HPC) resources in the cloud or local (edge) computing with shallow CNN architectures. In most cases, the recognition model training has to be performed in a HPC environment after which at least the lightweight models can be deployed for local execution.

6 Summary and future directions

In this paper, a comprehensive survey of challenges and existing solutions for automatic plankton recognition was provided. We identified nine challenges that complicate the introduction of automatic plankton recognition methods to operational use: (1) the limited amount of labeled training data for less common species, (2) large class imbalance, (3) fine-grained nature of the recognition task, (4) domain shift between imaging instruments, (5) presence of previously unseen classes and unknown particles, (6) uncertainty in expert labels, (7) large variation in image size, (8) low or varying image quality, and (9) massive data volumes. While most of the considered challenges are common in a wide variety of machine learning applications, plankton recognition has its specific characteristics including highly imbalanced image datasets, extreme variation in image size, limitations in image quality, and a shortage of qualified experts to visually annotate the images.

Figure 8 shows a flowchart summarizing the challenges and approaches to solve them. Given a new plankton image dataset, the flowchart provides a simple pipeline to identify the problems related to the dataset as a series of yes-no questions. Furthermore, references to the sections in this paper providing the detailed descriptions are provided to find the existing techniques to tackle the problems and to automate the analysis of the dataset.

Fig. 8
figure 8

Summary flowchart of challenges and solutions

Some of the challenges, especially the limited amount of labeled training data, have been rather extensively studied. While this problem cannot be considered solved, relatively high classification accuracies have been obtained with limited amounts of training images for certain classes. On the other hand, some of the other challenges have not been widely considered in plankton recognition literature. These include the domain shift between different image sets, presence of previously unseen classes and unknown particles, uncertainty in expert labels, and massive data volumes. The reasons for this vary. Most of the research has focused on improving classification accuracy and computation time has not been seen as an issue. Furthermore, the majority of the method development has been done for a fixed set of species and one imaging instrument, thus, there has been no need to address the domain shift or open-set problem.

The large variation in size and appearance of plankton has a notable effect on how challenging the recognition task is depending on what type of plankton is considered. While the type of plankton should be taken into account when designing handcrafted image features for the recognition task, modern feature learning approaches (CNN and ViT) are more general and can often be applied without the need for customized solutions for different plankton types. A notable exception for this are species that are taxonomically close to each other and/or have reduced size, for which fine-grained recognition techniques are needed. The larger size groups are somewhat overpresented in plankton recognition studies, but the existing literature covers a wide variety of different size groups and plankton types. Table 6 in Appendix A summarizes the prevalence of different plankton types and size groups considered in plankton recognition literature.

One notable problem in plankton recognition is the lack of publicly available general-purpose plankton image datasets with an evaluation protocol making it possible to compare different plankton recognition methods in a fair and reliable manner. The vast majority of the research either has focused on private in-house datasets or is based on custom evaluation protocol and dataset splits on publicly available datasets. This makes it impossible to compare the accuracies between different studies making it challenging to select the best practices for future research. This slows down the progress in the plankton recognition method development. Therefore, there is a need for a publicly available plankton dataset with a predetermined evaluation protocol and preferably multiple subsets captured with different imaging instruments to allow quantitative evaluation of the advances in general (device-agnostic) plankton recognition.

Another important problem limiting the wider utilization of automatic plankton recognition is the difficulty of collecting training images to exhaust all the possible classes. It is not realistic to construct a labeled training set consisting of all the plankton species and non-plankton particles that the imaging instrument is capable of capturing in a certain location. Moreover, varying plankton species composition between different geographical regions and ecosystems limits the possibility to apply traditional recognition models to new locations and datasets. Even a classification model developed and trained for one imaging instrument and one geographic location struggles if new species appear, for example, due to seasonal changes. The remedy for this is open-set recognition together with new class discovery methods. Open-set models are able to identify when the images belong to previously unseen classes and either reject them or process them further by, for example, clustering. Such techniques have potential to enable robust open-world plankton recognition systems. Open-set recognition is an active research topic in machine learning [see, for example, Geng et al. (2020)].

The massive volumes of unlabeled data produced using the modern imaging instruments motivate the use of semi-supervised learning techniques to tackle the challenges related to the limitations in labeled training data. One way to achieve this is to utilize unsupervised and self-supervised learning for pre-training of image features on unlabeled data. In self-supervised learning, the data itself is used to generate the supervisory information to guide the training. Typically, this is done by generating augmented versions of the images to obtain image pairs that have the same label. Image features learned this way can then be fine-tuned for the target dataset with a small amount of labeled training data using transfer learning.

Large variation between plankton image datasets with different species compositions and imaging instruments can be considered not only a challenge but also an opportunity. While it is very difficult to develop one general-purpose algorithm for imaging instrument-agnostic plankton recognition, modern domain adaptation methods have the potential to enable the joint utilization of different datasets. This would allow adapting the classification model to new datasets with a reasonable amount of manual work. Domain adaptation has already been successfully applied to various other machine learning applications, such as general object recognition (Wilson and Cook 2020). Domain adaptation can be considered a special case of transfer learning that mimics the human vision system and utilizes a model trained in one or more source domains to a different (but related) target domain. Domain adaptation can be utilized to reduce the effect of a large domain shift between different datasets and the lack of labeled training data.

The relatively large pool of different plankton image datasets motivates to further utilize domain generalization and meta-learning to obtain an imaging instrument agnostic recognition model. In meta-learning, multiple datasets and tasks are used to “learn how to learn” the recognition model. The idea is to automate the creation of the entire machine learning pipeline end-to-end including the search for the model architecture, hyperparameters, and learning the model weights. Domain generalization refers to learning domain-independent (in this case imaging instrument-independent) feature representations that can be then applied to any dataset. Domain generalization has a wide variety of different applications and it has become an increasingly studied problem in machine learning [see the recent survey in Wang et al. (2022a)]. Recent progress in such methods has opened novel possibilities to aim towards a universal plankton recognition system that is able to adapt to different environments, with dramatically different plankton populations and varying imaging instruments, promoting the wider utilization of automatic plankton recognition for aquatic research.