Keywords

1 Introduction

Instance segmentation, the automatic delineation of different objects appearing in an image, is a problem within computer vision that has attracted a fair amount of attention. Such interest is motivated by both its potential applicability to a whole range of scenarios, and the stimulating technical challenges it poses.

Regarding the former, segmenting at the instance level is useful for many tasks, ranging from allowing robots to segment a particular object in order to grasp it, to highlighting and enhancing the outline of objects for the partially sighted, wearing “smart specs” [1]. Counting elements in an image has interest in its own right [2] as it has a wide range of applications. For example, industrial processes that require the number of elements produced, knowing the number of people who attended a demonstration, and counting the number of infected and healthy blood cells in a blood sample, required in some medical procedures such as malaria detection [3].

Instance segmentation is more challenging than other pixel-level learning problems such as semantic segmentation, which deals with classifying each pixel of an image, given a set of classes. There, each pixel can belong to a set of predefined groups (or classes), whereas in instance segmentation the number of groups (instances) is unknown a priori. This difference exacerbates the problem: where in semantic segmentation one can evaluate the prediction pixel-wise, instance segmentation requires the clustering of pixels to be evaluated with a loss function invariant to the permutation of this assignment (i.e. it does not matter if a group of “person” pixels is assigned to be person “1” or person “2”). That leads to further complexities in the learning of these models. Hence, instance segmentation has remained a more difficult problem to solve.

Most approaches proposed for instance level segmentation are based on a pipeline of modules whose learning process is carried out independent of each other. Some of them, such as [4, 5], rely on a module for object proposal, followed by another one implementing object recognition and segmentation on the detected patches. A common problem with such piecewise learning methods is that each module does not learn to accommodate itself to the outputs of other modules. Another drawback is that in such cases, it is often necessary to define an independent loss function for each module, which places a burden on the practitioner to decide on convenient representations of the data at intermediate stages of the pipeline.

These issues led us to take a different route and develop a fresh and end-to-end model able to learn the whole instance-segmentation process. Due to the magnitude of the task, in this study we focus on the problem of class-specific instance segmentation. That is, we assume that the model segments instances that belong to the same class, excluding classification stages. The solution to this problem is useful in its own right, for example to segment and to count people in images [6], but we consider that this is also the first step towards general semantic learning systems that can segment and classify different kinds of instances.

The approach we propose here is partially inspired by how humans count elements in a scene. It is known, [7, 8], that humans count sequentially, using accurate spatial memory in order to keep track of the accounted locations. Driven by this insight, our purpose is to build a learning model capable of segmenting the instances of an object in an image sequentially, kee** current state in an internal memory. In order to achieve this, we rely on recurrent neural networks (RNNs), which exhibit the two properties discussed: the ability to produce sequential output, and the ability to keep a state or memory along the sequence.

Fig. 1.
figure 1

Diagram of Recurrent Instance Segmentation (RIS).

Our primary contributions, described in this paper, are 1. the development of an end-to-end approach for class specific instance segmentation, schematized in Fig. 1, based on RNNs containing convolutional layers, and 2. the derivation of a principled loss function for this problem. We assess the capabilities of our model by conducting two experiments; one on segmentation of multiple people, and the other on plant-leaves segmentation and counting.

2 Background

The work presented here combines several research areas. In this section we summarize the main developments in these areas.

2.1 Instance Segmentation Models

Instance segmentation can be formulated as the conjunction between semantic segmentation and object detection, for example in [9], the authors proposed a model that integrates information obtained at pixel, segment, and object levels. This is because instance segmentation requires the capacity of object detection approaches to separate between instances, and the ability of semantic segmentation methods to produce pixel-wise predictions, and hence a delineation of the shape of the objects. The progress of instance segmentation methods is thus limited by the advances made in both object detection and semantic segmentation.

A recent breakthrough in object detection is the Region-based CNN (R-CNN) [10]. This approach consists in using a region proposal method to produce a large set of varied sized object proposals from an image, then extracting features for each of them by means of a CNN, and finally classifying the resultant feature vectors. Several approaches for instance segmentation build on this method. Two of them are [4, 5], which both use multiscale combinatorial grou** [11] as a region proposal method to extract candidates, followed by a region refinement process. In the former work [4], the authors perform non-maximum suppression on the candidates, in order to remove duplicates, and then they combine the coarse information obtained by the CNN with superpixels extracted from the image in order to segment the instances. In the latter work [5], the output is refined by means of exemplar-based shape prediction and graph-cut. These approaches have produced state of the art results in instance segmentation. However, they suffer a common drawback that we want to avoid: they consist in an ensemble of modules that are trained independently of each other.

Semantic segmentation methods have seen a significant improvement recently, based first on the work in [12] which proposed a fully convolutional network that produces pixel-wise predictions, followed by the works in [13, 14] that improve the delineation of the predicted objects by using Conditional Random Fields (CRFs). The work in [\(\mathbf {h_{1}}\), to account for the recent segmented instance. Then, having again as inputs \(\mathbf {B}\), and as inner state \(\mathbf {h_{1}}\), the model outputs another segmented instance and its confidence score. This process keeps iterating until the confidence score drops below a certain level in which the model stops, ideally having segmented all instances in the image.

The sequential nature of our model allows to deal with common instance segmentation problems. In particular it can implicitly model occlusion, as it can segment non-occluded instances first, and keep in its state the regions of the image that have already been segmented in order to detect occluded objects. Another purpose of the state is to allow the model to consider potential relationships from different instances in the image. For example, if in the first iteration an instance of a person embracing something is segmented, and this information is somehow kept in the state, then in subsequent iterations, it might increase the plausibility of having another person being embraced by the first one.

The structure of our approach is composed of a fully convolutional network, followed by an RNN, a function to transform the state of the RNN into the segmentation of an instance and its confidence score, and finally the loss function that evaluates the quality of the predictions and that we aim to optimize. The first of these components, the fully convolutional network, is used in a similar manner as explained in [12]. In the following we describe in detail the remaining three components.

3.1 Convolutional LSTM

Long short-term memory (LSTM) networks [34] have stood out over other recurrent structures because they are able to prevent the vanishing gradient problem. Indeed, they are the chosen model in most works reviewed in Sect. 2.2 in which they have achieved outstanding results. In this section we build on top of the LSTM unit, but we perform some changes in its structure to adapt it to the characteristics of our problem.

In the problem we have described, we observe that the input to the model is a map from a lattice (in particular, an image), and the output is also a map from a lattice. Problems with these characteristics, such as semantic segmentation [12, 14] and optical flow [35], are often tackled using structures based on convolutions, in which the intermediate representations of the images preserve the spatial information. In our problem, we can see the inner state of recurrent units as a map that preserves spatial information as well. That led us to convolutional versions of RNNs, and in particular to convolutional long short-term memory (ConvLSTM) units.

A ConvLSTM unit is similar to an LSTM one, the only difference being that the fully connected layers in each gate are replaced by convolutions, as specified by the following update equations:

$$\begin{aligned} \begin{array}{c} \mathbf {i_{t}}=\mathrm{Sigmoid}~\left( \mathrm{Conv}~\left( \mathbf {x_{t}};\mathbf {w_{xi}}\right) +\mathrm{Conv}~\left( \mathbf {h_{t-1}};\mathbf {w_{hi}}\right) +\mathbf {b_{i}}\right) \\ \mathbf {f_{t}}=\mathrm{Sigmoid}~\left( \mathrm{Conv}~\left( \mathbf {x_{t}};\mathbf {w_{xf}}\right) +\mathrm{Conv}~\left( \mathbf {h_{t-1}};\mathbf {w_{hf}}\right) +\mathbf {b_{f}}\right) \\ \mathbf {o_{t}}=\mathrm{Sigmoid}~\left( \mathrm{Conv}~\left( \mathbf {x_{t}};\mathbf {w_{xo}}\right) +\mathrm{Conv}~\left( \mathbf {h_{t-1}};\mathbf {w_{ho}}\right) +\mathbf {b_{o}}\right) \\ \mathbf {g_{t}}=\mathrm{Tanh\,\,\,\,}\left( \mathrm{Conv}~\left( \mathbf {x_{t}};\mathbf {w_{xg}}\right) +\mathrm{Conv}~\left( \mathbf {h_{t-1}};\mathbf {w_{hg}}\right) +\mathbf {b_{g}}\right) \\ \mathbf {c_{t}}=\mathbf {f_{t}}\odot \mathbf {c_{t-1}}+\mathbf {i_{t}}\odot \mathbf {g_{t}}\\ \mathbf {h_{t}}=\mathbf {o_{t}}\odot \mathrm{Tanh}(\mathbf {c_{t}}) \end{array}, \end{aligned}$$
(1)

where \(\odot \) represents the element-wise product operator, \(\mathbf {i_{t}}, \mathbf {f_{t}}, \mathbf {o_{t}}, \mathbf {g_{t}} \in \mathbb {R}^{h'\times w'\times d}\) are the gates, and \(\mathbf {h_{t}},\mathbf {c_{t}}\in \mathbb {R}^{h'\times w'\times d}\) represents the memory of the recurrent unit, being d the amount of memory used for each pixel (in this paper we assume that the number of channels in the recurrent unit is the same as the number of channels produced by the previous FCN). Each of the filter weights (\(\mathbf {w}\) terms) has dimensionality \(d \times d \times f \times f\), where f is the size of the filter, and each of the bias terms (\(\mathbf {b}\) terms) is a d-dimensional vector repeated across height and width. We refer to the diagrams and definitions presented in [34] for a detailed explanation of the LSTM update equations, from which Eq. (1) follows. We also provide a diagram illustrating Eq. (1) in the Appendix Sect. B. Note that a primary aspect of kee** a memory or state is to allow our model to keep account of the pixels of the image that have already been segmented in previous iterations of the process. This can be naturally done by applying convolutions to the state. We can also stack two or more ConvLSTM units with the aim of learning more complex relationships.

The advantages of ConvLSTM with respect to regular LSTM go hand in hand with the advantages of convolutional layers with respect to linear layers: they are suitable for learning filters, useful for spatially invariant inputs such as images, and they require less memory for the parameters. In fact the memory required is independent of the size of the input. A similar recurrent unit has been recently proposed in [\(r(\cdot )\), which is given in Fig. 2.

The function that produces the first output can be described as a sequence of layers which have the aim of discriminating one, and only one instance, filtering out everything else. Firstly we use a convolutional layer which maps the d channels of the hidden state to 1 output channel, using \(1\times 1\) filters. This is followed by a log-softmax layer, \(f_{\mathrm{LSM}}(\mathbf {x})_{i}=\mathrm{log}\left( \frac{\mathrm{exp}(x_{i})}{\sum _{j}\mathrm{exp}(x_{j})}\right) \), which normalizes the input across all pixels, and then applies a logarithm. As a result, each pixel value can be in the interval \((-\infty ,0]\), where the sum of the exponentiation of all values is 1. This leads to a competing mechanism that has the potential of inhibiting pixels that do not belong to the current instance being segmented. Following that, we use a layer which adds a learned bias term to the input data. The purpose of this layer is to learn a threshold, b, which filters the pixels that will be selected for the present instance. Then, a sigmoid transformation is applied pixel-wise. Hence, the resultant pixel values are all in the interval [0, 1], as required. Finally, we upsample the resultant \(h' \times w'\) map back to the original size of the input image, \(h \times w\). In order to help understand the effect of these layers, we visualize in Sect. A in the Appendix, the inner representations captured by the model at different stages of the described pipeline.

Fig. 2.
figure 2

Diagram of the spatial inhibition module.

The function which encodes the relationship between the current state \(\mathbf {h_{t}}\) and the confidence of the predicted candidate consists simply of a max-pooling and a linear layer, followed by a sigmoid function.

3.3 Loss Function

Choosing a loss function that accurately reflects the objective we want to achieve is key for any model to be able to learn a given task. In order to present the loss function that we use, let us first add some notation.

At training stage we are provided with the training set composed of labeled images. We denote image i as \(\mathbf {I^{(i)}}\in \mathbb {R}^{h\times w\times c}\), where for simplicity we consider the same size (\(h\times w\times c\)) for all images. Its annotation \(\mathbf {Y^{(i)}}=\left\{ \mathbf {Y_{1}^{(i)}},\mathbf {Y_{2}^{(i)}},\ldots ,\mathbf {Y_{n_{i}}^{(i)}}\right\} \), is a set of \(n_{i}\) masks, \(\mathbf {Y_{t}^{(i)}}\in \{0,1\}^{h\times w}\), for \(t\in \{1,\ldots ,n_{i}\}\), containing the segmentation of each instance in the image. One point to note about the labels is that the dimension of the last index, \(n_{i}\), depends on the image i, because each image may have a different number of instances.

Our model predicts both a sequence of masks, \(\mathbf {\hat{Y}^{(i)}}=\left\{ \mathbf {\hat{Y}_{1}^{(i)}},\mathbf {\hat{Y}_{2}^{(i)}},\ldots ,\mathbf {\hat{Y}_{\hat{n_{i}}}^{(i)}}\right\} \), for image i, where \(\mathbf {\hat{Y}_{\hat{t}}^{(i)}}\in [0,1]^{h\times w}\), \(\hat{t}\in \{1,\ldots ,\hat{n}_{i}\}\), and a confidence score associated to those masks \(\mathbf {s^{(i)}}=\left\{ s_{1}^{(i)},s_{2}^{(i)},\ldots ,s_{\hat{n_{i}}}^{(i)}\right\} \). At inference time, the number of elements predicted, \(\hat{n_{i}}\), depends on the confidence values, \(\mathbf {s^{(i)}}\), so that the network stops producing outputs after time t when \(s_{t}^{(i)}<0.5\). At training time we can predefine the length of the predicted sequence. Given that we know the length, \(n_{i}\), of the i-th ground truth annotation, we set the length of the predicted sequence to be \(\hat{n_{i}}=n_{i}+2\), so that the network can learn when to stop. In any case, the number of elements in the predicted sequence, \(\hat{n}_{i}\), is not necessarily equal to the elements in the corresponding ground truth set, \(n_{i}\), given that our model could underestimate or overestimate the number of objects.

One way to represent the scenario is to arrange the elements in \(\mathbf {Y}\) and \(\mathbf {\hat{Y}}\) (where we omit hereafter the index of the image, i, for the sake of clarity) in a bipartite graph, in which each edge between \(\mathbf {Y_{t}}\), \(t\in \{1,\ldots n\}\), and \(\mathbf {\hat{Y}_{\hat{t}}}\), \(\hat{t}\in \{1,\ldots \hat{n}\}\), has a cost associated to the intersection over union between \(\mathbf {Y_{t}}\) and \(\mathbf {\hat{Y}_{\hat{t}}}\). A similarity measure between \(\mathbf {Y}\) and \(\mathbf {\hat{Y}}\) can be defined as the maximum sum of the intersection over union correspondence between the elements in \(\mathbf {Y}\) and the elements in \(\mathbf {\hat{Y}}\):

$$\begin{aligned} \underset{\delta \in \mathcal {S}}{\mathrm{max}}\,f_\mathrm{Match}(\mathbf {\hat{Y}},\mathbf {Y},\mathbf {\delta }), \end{aligned}$$
(2)

where

$$\begin{aligned} f_\mathrm{Match}(\mathbf {\hat{Y}},\mathbf {Y},\delta )=\overset{\hat{n}}{\underset{\hat{t}=1}{\sum }}\left( \overset{n}{\underset{t=1}{\sum }}f_\mathrm{IoU}\left( \mathbf {\hat{Y}_{\hat{t}}},\mathbf {Y_{t}}\right) \delta _{\hat{t},t}\right) , \end{aligned}$$
(3)
$$\begin{aligned} \mathcal {S}=\left\{ \begin{array}{cc} \delta \in \{0,1\}^{\hat{n}\times n}: &{} \overset{\hat{n}}{\underset{\hat{t}=1}{\sum }}\delta _{\hat{t},t}\le 1,\,\forall t\in \{1\ldots n\}\\ &{} \overset{n}{\underset{t=1}{\sum }}\delta _{\hat{t},t}\le 1,\,\forall \hat{t}\in \{1\ldots \hat{n}\} \end{array}\right\} , \end{aligned}$$
(4)

and \(f_\mathrm{IoU}(\mathbf {\hat{y}},\mathbf {y})=\frac{\left\langle \mathbf {\hat{y}},\mathbf {y}\right\rangle }{\left\| \mathbf {\hat{y}}\right\| _{1}+\left\| \mathbf {y}\right\| _{1}-\left\langle \mathbf {\hat{y}},\mathbf {y}\right\rangle }\), used in [37], is a relaxed version of the intersection over union (IoU) that allows the input to take values in the continuous interval [0, 1].

The elements in \(\delta \) determine the optimal matching between the elements in \(\mathbf {Y}\) and \(\mathbf {\hat{Y}}\), so that \(\mathbf {\hat{Y}_{\hat{t}}}\) is assigned to \(\mathbf {Y_{t}}\) if and only if \(\delta _{\hat{t},t}=1\). The constraint set \(\mathcal {S}\), defined in Eq. (4), impedes one ground truth instance being assigned to more than one of the predicted instances and vice versa. It may be the case that prediction \(\hat{Y}_{\hat{t}}\) remains unassigned if and only if \(\overset{n}{\underset{t=1}{\sum \nolimits }}\delta _{\hat{t},t}=0\), or that the ground truth \(Y_{t}\) is not covered by any prediction if and only if \(\overset{\hat{n}}{\underset{\hat{t}=1}{\sum \nolimits }}\delta _{\hat{t},t}=0\). The optimal matching, \(\delta \), can be found out efficiently by means of the Hungarian algorithm, in a similar vein as in [27]. The coverage loss described in [38] has a similar form, where the predictions were discrete, and the problem was posed as an integer program.

End-to-end learning is possible with this loss function, as it is the point-wise minimum of a set of continuous functions (each of those functions corresponding to a possible matching in \(\mathcal {S}\)). Thus, a direction of decrease of the loss function at a point can be computed by following two steps: 1. Figuring out which function in the set \(\mathcal {S}\) achieves the minimum at that point. 2. Computing the gradient of that function. Here, the Hungarian algorithm is employed in the described first step to find out the function that achieves the minimum at the point. Then, the gradient of that function is computed. The details of this process are shown in Sect. D in the Appendix.

We now need to account for the confidence scores \(\mathbf {s}\) predicted by the model. To do so, we consider that the ideal output is to predict \(s_t=1\) if the number of instances t segmented so far is equal or less than the total number of instances n, otherwise \(s_t\) should be 0. Taking this into account, we propose the following loss function:

$$\begin{aligned} \ell (\mathbf {\hat{Y}},\mathbf {s},\mathbf {Y})= \underset{\delta \in \mathcal {S}}{\mathrm{min}}-\overset{\hat{n}}{\underset{\hat{t}=1}{\sum }}\overset{n}{\underset{t=1}{\sum }}f_\mathrm{IoU}\left( \mathbf {\hat{Y}_{\hat{t}}},\mathbf {Y_{t}}\right) \delta _{\hat{t},t} + \lambda \overset{\hat{n}}{\underset{t=1}{\sum }} f_\mathrm{BCE}\left( [t\le n],s_t\right) , \end{aligned}$$
(5)

where \(f_\mathrm{BCE}(a,b)=-\left( a\mathrm{log}(b)+(1-a)\mathrm{log}(1-b)\right) \) is the binary cross entropy, and the Iverson bracket \([\cdot ]\) is 1 if the condition within the brackets is true, and 0 otherwise. Finally, \(\lambda \) is a hyperparameter that ponders the importance of the second term with respect to the first one.

4 Experiments

We perform two kinds of experiments, in which we study the capabilities of our approach to both segment and count instances. In the first experiment we focus on multi-instance subject segmentation, and in the second we focus on segmenting and counting leaves in plants. Before presenting those results, we first describe the implementations details that are common to both experiments.

4.1 Implementation Details of Our Method

We have implemented our approach using the Lua/Torch deep learning framework [39]. The code and models are publicly availableFootnote 1.

The recurrent stage is composed of two ConvLSTM layers, so that the output of the first ConvLSTM acts as the input for the second one. This stage is followed by the spatial inhibition module which produces a confidence score together with an instance segmentation mask. The resultant prediction is evaluated according to the loss function defined in Eq. (5), where we set \(\lambda =1\).

At training stage, the parameters of the recurrent structure are learned by backpropagation through time. In order to prevent the exploding gradient effect, we clipped the gradients so that each of its elements has a maximum absolute value of 5. We use the Adam optimization algorithm [\(AP^r (0.5)\); and averaging the \(AP^r\) for different degrees of IoU overlap**, from 0.1 to 0.9, denoted as \(AP^r Ave\). We show the results in Table 1. We observe that RIS achieves comparable results to state of the art approaches. When using CRF as a post-processing method the results improve, outperforming the competing methods. We also provide some qualitative results in Fig. 3, and more extensively in Sect. C in the Appendix.

Table 1. Multiple person segmentation comparison with state of the art approaches on the PASCAL VOC 2012 validation set. First row: using \(AP^r\) metric at 0.5 IoU. Second row: averaging \(AP^r\) metric from 0.1 to 0.9 IoU (gaps indicate unreported results).
Fig. 3.
figure 3

Instance segmentation for detecting people using RIS+CRF. Input images taken from the VOC Pascal 2012 dataset. Best viewed in colour. (Color figure online)

4.3 Plants Leaf Segmentation and Counting

Automatic leaf segmentation and counting are useful tasks for plant phenoty** applications that can lead to improvements in seed production and plant breeders processes. In this section we use the Computer Vision Problems in Plant Phenoty** (CVPPP) dataset [43, 44]. In particular, we utilize the A1 subset of plants, which is the biggest subset available, and contains 161 top-down view images, having \(500\times 530\) size each, as the one shown in Fig. 1, Left. The training set is composed of 128 of those images, which have annotations available. The remaining 33 images are left out for testing purposes. All these images are challenging because they present a high range of variations, with occasional leaf occlusions, varied backgrounds, several slightly blurred images due to a lack of focus, and complex leaf shapes.

The limited number of training images has driven us to augment the data by considering some valid transformations of the images. We apply two transformations: rotating the image by a random angle, and flip** the resultant image with a probability of 0.5.

We learn the fully convolutional network from scratch. Its structure is composed by a sequence of 5 convolutional layers, each of them followed by a rectified linear unit. The first convolution learns 30 \(9\times 9\) filters, and the following four learn 30 \(3\times 3\) filters each. This sequence of convolutions produces a \(100\times 106\times 30\) representation of the image, which is the input to the recurrent stage of the model. In the stack of two ConvLSTMs we have set all gates to have \(3\times 3\) convolutional layers.

We compare our method with several approaches submitted to the CVPPP challenges on both leaf segmentation and counting. These are:

  • IPK Gatersleben [45]: Firstly, it segments foreground from background by using 3D histograms and a supervised learning model. Secondly, it identifies the leaves centre points and leaves split points by applying unsupervised learning methods, and then it segments individual leaves by applying graph-based noise removal, and region growing techniques.

  • Nottingham [46]: Firstly, SLIC is applied on the image in order to get superpixels. Then the plant is extracted from the background. The superpixels in the centroids of each leaf are identified by finding local maxima on the distance map of the foreground. Finally, leaves are segmented by applying the watershed transform.

  • MSU [46]: It is based on aligning instances (leaves) with a given set of templates by using Chamfer Matching [47]. In this case the templates are obtained from the training set ground truth.

  • Wageningen [48]: It performs foreground segmentation using a neural network, followed by a series of image processing transformations, including inverse distance image transform from the detected foreground, and using the watershed transform to segment leaves individually.

  • PRIAn [49]: First, a set of features is learned in an unsupervised way on a log-polar representation of the image. Then, a support vector regression model is applied on the resultant features in order to predict the number of leaves.

These competing methods are explicitly designed to perform well in this particular plant leaf segmentation and counting problems, containing heuristics that are only valid on this domain. On the contrary, the applicability of our model is broad.

Table 2. Results obtained on the CVPPP dataset according to the measures: Difference in Count (DiC), absolute Difference in Count (\(\left| DiC\right| \)), and Symmetric Best Dice (SBD). Reported mean and standard deviation (in parenthesis).

The results obtained are shown in Table 2, using the measures reported by the CVPPP organization in order to compare the submitted solutions. These are Difference in Count (DiC) which is the difference between the predicted number of leaves and the ground truth, \(\left| DiC\right| \), which is the absolute value of DiC averaged across all images, and Symmetric Best Dice (SBD), which is defined in [46], and provides a measure about the accuracy of the segmentation of the instances. Regarding leaf counting, we observe that our approach significantly outperforms the competing ad hoc methods that were designed for this particular problem. With regard to segmentation of leaves, our approach obtains comparable results with respect to the competitors, yet it does not outperform any approach. We hypothesize that despite the data augmentation process we follow, the amount of original images available for training is too small as to learn how to segment a wide variety of leaf shapes from scratch. This scarcity of training data has a smaller impact in the competing methods, given that they contain heuristics and prior information about the problem.

We have also visualized, in Fig. 4, a representation of what the network keeps in memory as the sequence is produced. The column denoted as \(\kappa (\mathbf {h_t})\) shows a summary function of the hidden state \(\mathbf {h}\) of the second ConvLSTM layer. The summary function consists of the sum of the absolute values across channels for each pixel. The column denoted as \(\mathbf {\hat{Y}_t}\) corresponds to the output produced by the network. We observe that as time advances, the state is modified to take into account parts of the image that have been visited. We also inspected the value of the cell, that is \(\mathbf {c_t}\) in Eq. (1), but we have not observed any clear clues.

Fig. 4.
figure 4

Representation of the state, \(\kappa (\mathbf {h_t})\) (the sum of the absolute values across channels), and the output, \(\mathbf {\hat{Y}_t}\), of a sequence produced by our model.

5 Discussion

In this paper we have proposed a new instance segmentation paradigm characterized by its sequential nature. Similarly to what human beings do when counting objects in a scene, our model proceeds sequentially, segmenting one instance of the scene at a time. The resulting model integrates in a single pipeline all the required functions to segment instances. These functions are defined by a set of parameters that are jointly learned end-to-end. A key aspect in our model is the use of a recurrent structure that is able to track visited areas in the image as well as to handle occlusion among instances. Another key aspect is the definition of a loss function that accurately represents the instance segmentation objective we aim to achieve. The experiments carried out on multiple person segmentation and leaf counting show that our approach outperforms state of the art methods. Qualitative results show that the state in the recurrent stage contains information regarding the visited instances in the sequence.

The primary objective of this paper is to show that learning end-to-end instance segmentation is possible by means of a recurrent neural network. Nevertheless, variations of some architectural choices could lead to even better results. We tried a variety of other alternatives, such as adding the sum of the prediction masks as an extra input into the recurrent unit. We also tried alternatives to \(f_\mathrm{IoU}\) in Eq. (3), such as log-likelihood. The results obtained in either case were not better than the ones achieved by the model described in Sect. 3. The analysis of these and other alternative architectures is left for future work.

There are several extensions that can be carried out in our approach. One is allowing the model to classify the segmented instance at each time. This can be done by generalizing the loss function in Eq. (5). Another extension consists in integrating the CRF module as a layer in the end-to-end model, such as in [14]. Another interesting line of research is investigating other recurrent structures that could be as good or even better than ConvLSTMs for instance segmentation. Finally, the extension of this model for exploiting co-occurrence of objects, parts of objects, and attributes, could be a promising research direction.