1 Introduction

Convolutional networks have become a dominant approach for visual object recognition [16, 30, 27].

A typical convolutional feed-forward network connects the \(l\mathrm{th}\) layer’s output to the \((l+1)\mathrm{th}\) layer’s input. It gives rise to a layer transition: \(x_l\) = \(H_l(x_{l-1})\). ResNet [16] adds an identity mapped connection, also referred as skip connection, that bypasses the transformation in between:

$$\begin{aligned} x_l = H_l(x_{l-1}) + x_{l-1} \end{aligned}.$$
(1)

This mechanism allows the network to flow gradients directly through the identity functions which results in faster training and better error propagation. However, in [18] it was argued that despite the benefits of using skip connections, there is a possibility that when a layer is connected by a skip connection it may disrupt the information flow of the network therefore degrading the performance of the network.

In [56], a wider version of ResNet was proposed where the authors showed that an increased number of filters in each layer could improve the overall performance with sufficient depth. FractalNet [29] also shows comparable improvement on similar benchmark datasets[27].

DenseNet As an alternative to ResNet, DenseNet proposed a different connectivity scheme. They allowed connections from each layer to all of its subsequent layers. Thus \(lth\) layer receives feature maps from all previous layers. Considering \(x_0,x_1 \ldots , x_{l-1}\) as input:

$$\begin{aligned} x_l = H_l([x_0,x_1 \ldots , x_{l-1}]) \end{aligned}$$
(2)

where \([x_0,x_1 \ldots , x_{l-1}]\) denotes the concatenation of feature maps produced from previous layers, respectively.

The network maximises information flow by connecting the convolutional layers channelwise instead of skip** connections. In this model, the layer l has l number of inputs consisting of all the feature maps of previous \(l-1\) layers. Thus on the \(lth\) layer, there are \(l(l+1)/2\) connections. DenseNet requires fewer parameters as there is no need to learn from redundant features maps. This allows the network to compete against ResNet using fewer parameters.

We propose an alternative connectivity that retains the advantages of the above architectures while reducing some of their limitations. Figure 1 illustrates the connectivity layout between each layer of a single module. Each block of ChoiceNet contains three modules and the total network is comprised of three blocks with pooling operations in the middle (see Fig. 3).

3 ChoiceNet

Consider a single image \(x_0\) that is passing through a CNN. The network has L layers, each with a nonlinear transformation \(H_l(.)\), where l is the index number of the layer. \(H_l(.)\) is a list of operations such as batch normalisation [20], Pooling [31], rectified linear units [39] or a convolutional operation. The output of lth layer is denoted as \(x_l\).

ChoiceNet We propose an alternative connectivity that retains the advantages of the above architectures while reducing some of their limitations. Figure 1 illustrates the connectivity layout between each layer of a single module. Each block of ChoiceNet contains three modules and the total network is comprised of three blocks with pooling operations in the middle (see Fig. 3).

Figure 2 shows a breakdown of each module. Letters A to G denote unique information generated by one forward pass through the model. B is generated by three consecutive \(3 \times 3\) convolutional operations, whereas A is the result of the same three convolutional operations but additionally connected by a skip connection. Following this pattern, we generate information represented by letters C, D, F and G. Letter E denotes the special case where no convolutional operation is done after the \(1 \times 1\) convolutional operation and it contains all the original information. This information is then concatenated with the others (i.e. A, B, etc.) at the final output.

Therefore, the final output contains information with and without skip connections from filters of size 3, 5 and 7 and also from the original input without any modification. Note that the \(1 \times 1\) convolutional operation at the start acts as a bottleneck to limit computational costs and all the convolutional operations are padded appropriately for the concatenation at the final stage. Kernel sizes of 3, 5 and 7 were chosen because these three sizes together give the best performance [54, 55]. Adding more kernel sizes such as a combination of 3, 5, 9 and 11 or 3, 7 and 11 increases the network size in parameters without much improvement in performance (Fig. 4).

Considering \(x_0, x_1 \ldots , x_{l-1}\) as input, our proposed connectivity is given by:

$$\begin{aligned}&x_l = H_l(x_{l-1}) + x_{l-1} \end{aligned}$$
(3)
$$\begin{aligned}&x_{l+1} = H_l([x_{l},x_{l-1}] ) + x_{l} \end{aligned}$$
(4)

where \([x_l, x_{l-1}]\) is concatenation of feature maps. The feature maps are first summed and then concatenated which resembles characteristics of ResNet and DenseNet, respectively.

Composite function Each of the composite functions consists of a convolution operation followed by a batch normalisation and ends with a rectified linear unit(ReLU) operation.

Pooling Pooling is an essential part of convolutional networks since Eqs. (1) and (2) are not viable when the feature maps are not of equal size. We divide the network into multiple blocks where each block contains same sized features. Instead of using either max pooling or average pooling, we use both pooling mechanisms and concatenate them before feeding it to the next layer (see Fig. 5).

Bottleneck layers The use of \(1\times 1\) convolutional operations (known as bottleneck layers) can reduce computational complexity without hurting the overall performance of a network [16, 19, 38]. The dataset contains a training set of 367 images, a validation set of 100 images and a test set of 233 images. The challenge is to do pixelwise classification of the input image and correctly identify the objects in the scene. The metric called IoU or ‘intersection over union’ is commonly used for this particular task [2, 7,

Fig. 8
figure 8

An inside look at ChoiceNet. The top figure shows the skeleton of ChoiceNet before training and the bottom figure shows a path way the model has chosen for C10* (middle) and C10 (bottom) dataset for best classification accuracy after training. The coloured boxes and lines are putting the most contribution

Ablation Study 2 Similarly, in Table 3, we show the effect of the usage of two types of pooling method with our architectural design. We find that for all three models the use of max pool gave advantage over avg pooling. ChoiceNet-40 achieved the lowest error rate among the pooling techniques individually however it was superseded by the same model when both pooling were used. This shows even though, in cases, avg pooling may not be as effective as maxpool, using them together leads to improved performance.

Table 7 An error rate ablation study on ChoiceNet-30 on C10 and C10+

In Table 4, we show the Mean Intersection over Union (m_IoU) on the CamVid dataset of some of the current state-of-the-art models. We used the U-Net training scheme and changed the basic convolutional operations with ResBlocks, DenseBlocks and ChoiceNet-module (see Fig. 1). While our network has fewer parameters compared to ResBlock and Denseblocks, it achieved a higher score. Note that even though our model achieved a good m_IoU score, it is not as good as some of the network architectures designed specifically for segmentation tasks [21, 24, 57,58,59]. Nevertheless, it performed well comparing to both ResBlock and Dense-block as well as some other general purpose convolutional neural networks [36]. Some outputs are displayed in ‘S1’ section of supplementary materials.

In Tables 5 and 6, we show the performance of different state-of-the-art neural networks on 300W dataset using L1 and Wing-Loss, respectively. We also include methods such as CNN 6/7 which is specifically designed for this purpose with its robust loss function (Wing-Loss). The tables show that, with both loss function our model performs the highest on the ‘full’ dataset. ChoiceNet also achieves the lowest error on the ‘Challenging’ test set which further demonstrates the superiority of the proposed architecture. Detailed table and graph are displayed in ‘S2’ and ‘S3’ section of supplementary materials. In Fig. 7, we also show that in some cases the network predicts more precisely that the ground truth which increases the error rate as it doesn’t match with the less precise ground truth (Table 7).

Our intuition is that the extra connections and paths in our method enable the network to learn from a large variety of feature maps. This also enables the network to backpropagate errors more efficiently (see also [16, 18]). We found that due to all the connections the network can be prone to exploding gradient and therefore needs a small learning rate to begin with. We also found by grid search that the network shows peak performance when the depth is between 30 and 40 layers and further increasing the layers appears to have little effect. We suspect that ChoiceNet plateaus at depth 30–40 although it is possible that it could be a local minima as we couldn’t train models with depth more than 60 layers due to resource limitation.

The performance on ImageNet dataset is displayed in Table 2. Our model with all three variation achieves lower top 1% score compared to other state-of-the-art neural network architectures like ResNet, DenseNet, Inception (v3/v4) and ChoiceNet-40 scores the lowest top 5% and top 1% error. This is a result of the unique connectivity design (see Fig. 2). Due to the usage of convolutional output with and without skip connection, using different kernel sizes, concatenating the original input per module via the connection ‘E’ of Fig. 2 and using two different pooling techniques together, it achieves this superior performance. Also as the architecture has many connections, therefore it can work with less channel outputs per convolution operation which makes it parameter efficient. This means given a number of parameters it achieves better performance than other methods.