1 Introduction

CNNs are currently among the most widely used machine learning models for object recognition and computer vision [1,2,3]. Despite the fact that CNN with several layers have been in use for a long time, they gained widespread interest in the scientific community in 2006 following the work of several researchers, such as Bengio et al. 2007 [4] and LeCun et al. 2015 [5]. In fact, when dealing with extremely complex classification problems or for specific purposes, CNN has become increasingly used, especially at a high level of precision. CNNs architecture is defined by a large number of hyperparameters, which should be fine-tuned to optimize the architecture. Previous works in the literature have been proposed with the goal of optimizing architectures such as ResNet [6] and VGGNet [7]. Unfortunately, the majority of these architectures are either defined manually by experts or automatically created using greedy induction techniques. Despite the impressive performance of the CNN design, experts in the disciplines of optimization and machine learning proposed that improved structures may be discovered using automated approaches. Evolutionary computation researchers proposed modeling this task as an optimization problem and then solving it with an appropriate search algorithm [8]. Indeed, selecting the blocks’ number, nodes per block and the graph’s topology within each CNN block is similar to solving a problem of optimization within a large search space. Due to the fact that EAs are capable of approximating the global optimum and thus avoiding local optimal solutions (architectures), authors in [9] proposed recently the use of such metaheuristic techniques to handle the challenge of optimizing the CNN architecture.

Fig. 1
figure 1

Bi-CNN-D-C scenario

Efficient model designs [10, 11] focus on acceleration over compression through the use of optimized convolutional operations or network architectures. Recently, as a means of improving accuracy, deepening on CNN models has become a popular trend, as demonstrated by ResNet [6], VGGNet [7] and Xception [12]. Indeed, it is challenging to deploy these deep models on low-resource devices such as smartphones and mobile robots. However, billions of network parameters represent a significant storage overhead for embedded devices, such as the VGG16 deep learning model, which has over 138 million parameters and requires over 500MB of memory space to classify a 224 224 image. Obviously, such a large model cannot be directly deployed in on-board devices. Deep compression process is a critical technique for resizing a deep learning model by consolidating and removing inadequate components. However, compressing deep models without significant loss of precision is a critical issue. Several techniques for CNN pruning have been presented, including neurons, filters and channel pruning approaches [13], which reduce model weight by removing unimportant connections.

Due to the fact that EAs are capable of approximating the global optimum and thus avoiding locally optimal solutions, recent works [9, 14, 15] recommend that similar metaheuristic algorithms be employed to address the CNN architectural optimization challenge in the field of network compression. To do so successfully, the solution encoding, the fitness function and the variation operators must all be defined. In fact, all previous work focuses on compressing manual architectures and their nonexistent compression for automated CNN architecture. We notice that in our previous works, the problematic of compression is not tackled.

Motivated by recent survey papers [9, 14, 16] on deep neural networks pruning and the reported interesting results, we decided to tackle the problem of filter pruning. As any CNN architecture could be pruned in different ways, we framed the problem of “joint design and pruning” as a bi-level optimization problem. The upper-level goal is to search for good architectures, while the lower-level one is to apply filter pruning on the considered architecture. Indeed, the evaluation of an upper-level architecture requires sending this architecture to the lower level to execute the fitter pruning on it by deactivating some filters. The filters that should be deactivated could not be known before hand as the number of possibilities is huge and corresponds to a whole search space. For this reason, the filter pruning task is executed at the upper level as an evolutionary optimization (search) process. In this way, the fitness evaluation of each upper-level solution (architecture) requires the (near) optimal filter pruning decision (encoded as a binary vector where 0 means that the corresponding filter is deactivated) found at the lower level. By following such as a bi-level optimization process, the final output of our approach is an CNN architecture with minimum number of filters and optimized topology. Figure 1 illustrates an example of a Bi-CNN-D-C (bi-level convolution neural network design and compression) scenario. To our knowledge, this is the first study to model and solve the CNN architecture design and compression problem as bi-level method. Each upper-level solution necessitates solving a separate lower-level optimization problem; the computational cost intends to be prohibitively expensive. We address this issue by solving combinatorial BLOPs using CEMBA [17]. Indeed, each upper-level population collaborates with its corresponding lower-level population. This fact enables a significant reduction in the number of evaluations performed during the lower-level search process. The main contributions of our paper could be summarized as follows:

  • For the first time, an evolutionary method that combines CNN architecture generation with filter pruning within the optimization process is developed. This is motivated by the fact that any generated architecture should be first pruned before evaluating its classification performance.

  • The joint design and filter pruning is modeled as a bi-level optimization problem where architectures are generated through crossover and mutation at the upper level with minimum NB and NNB, while filter pruning of each architecture is applied at the lower level.

  • The bi-level optimization modeling is solved using a bi-level co-evolutionary algorithm to ensure the effective collaboration between the architecture generation (at the upper level) and the filter pruning at (the lower level).

  • Detailed experiments on CIFAR and ImageNet data sets in addition to a COVID-19 case study are conducted in comparison with several recent and prominent peer works. The merits of our proposed algorithm, Bi-CNN-D-C, are demonstrated based on several metrics including the classification error, the number of GPUDays and the number of parameters.

The rest of this paper is structured as follows. Section 2 summarizes the review of the literature on CNN pruning. Section 3 details our proposed approach. Section 4 details the experimental design and performance analysis results. Finally, in Sect. 5, the paper is concluded and some future research directions are suggested.

2 Related work

2.1 CNN design based on evolutionary optimization

Recently, some researchers have taken an interest in EAs as a means of evolving deep neural network architectures. A survey on applications of swarm intelligence and evolutionary computing based optimization of deep learning models has been published by Darwish et al. [9]. Based on this survey, we selected the most representative:

  • Cheung and Sable [18] optimized the architecture hyperparameters using a hybrid EA on the basis of the diagonal Levenberg Marquardt technique with rapid convergence and a low computing cost of fitness assessments number. They established the critical role of architectural choices in convolutional layer networks. Their findings demonstrate that even the simplest evolution strategies can yield significant gains. When variation effects are present, the employment of evolved parameters in combination with local contrast normalization preprocessing and absolute value across layers has proven a compulsive performance on the MNIST data sets [19].

  • Fu**o et al. [20] presented evolutionary Deep Learning, called evoDL, as a technique for discovering unique architectural designs. This technique is intended to be used to investigate the development of hyperparameters in deep convolutional neural networks, called DCNNs. Additionally, authors proposed AlexNet as a fundamental framework of the CNN and optimize both the parameters tuning and activation functions using evoDL.

  • Real et al. [21] used the CIFAR-10 and CIFAR-100 data sets to develop the CNN structure in order to identify the classification model. They presented a mutation operator that may be used to avoid locally optimum models. They demonstrated that neuro-evolution is capable of constructing highly accurate networks.

  • **e et al. [22] maximized the recognition accuracy by representing the network topology as a binary string. The primary constraint was the high computing cost, which compelled the authors to conduct the tests on small-scale data sets.

  • Mirjalili et al. [23] developed an adaption for solving bi-objective models, called NSGA-Net. The image classification and object alignment results obtained demonstrate that NSGA-Net is capable of providing the user with less than complicated correct designs.

  • Alejandro et al. [24] developed EvoDEEP to optimize network characteristics by calculating the probability of layer transitions based on the finite state machines concept. The goal was reducing classification error rates and preserving the layer sequence.

  • Real et al. [25] provided a GA with an updated tournament selection operator that takes into account the age of the chromosomes while selecting youngest chromosomes. The architectures are described as small directed graphs with edges and vertices representing common network actions and hidden states. They developed novel mutation operators connecting the edges’ origin to other vertices and rename the edges arbitrarily in order to cover the entire search space.

  • Sun et al. [26] developed an evolutionary technique for improving convolutional neural network designs and initializing their weights for image classification problems. This aim was realized by develo** a unique approach for initializing weight, a novel encoding variable-length chromosomes strategy, a slacked binary tournament selection methodology and an efficient fitness evaluation technique. Experiments indicated that the EvoCNN methodology surpasses clearly a wide number of existing approaches in terms of classification performance on practically all data sets investigated.

  • Lu et al. [26] established a multi-objective modeling of the architectural search problem for the first time by minimizing two potentially conflicting objectives: classification error rate and computational complexity, as measured by the number of floating point operations (FLOPS). In order to execute a multi-objective EA, they updated the non-dominated sorting GA-II (NSGA-II) algorithm.

  • **g et al. [27] developed a multi-objective model aiming to maximize classification accuracy while kee** the tuning parameters to a minimum. The proposed model was solved based on a hybrid binary encoding representing component layers and network connections using multi-objective particle swarm optimization with Decomposition, called MOPSO/D. The architectures discovered are considered to be exceptionally competitive when compared to models created manually and automatically.

2.2 CNN compression

Deep network compression is one of the most significant strategies for resizing a deep learning model by combining the removal of ineffective components [14]. However, compressing deep models without considerable loss of precision is a key challenge. Recently, many studies have been focused on discovering new techniques to minimize the computational complexity of CNNs based on EAs while retaining their performance [14]. We divide the network compression techniques into three categories depending on the existing work: filter pruning [29,30,31,32], quantization [33,34,35,36,37] and Huffman encoding [38,39,40].

Fig. 2
figure 2

An illustration of how filter-level pruning works [28]

The convolutional operation in the CNN model integrates a large number of filters to improve its performance under various classification and prediction processes [41]. Recently, various pruning-based filter pruning techniques [29,30,31,32] have been suggested. The addition of filters enhances the defining features of the spatial characteristics generated by the CNN model [9, 42]. However, this increment results in a significant increase in the DNN model’s FLOPs. As a result, removing superfluous filters is critical for reducing the computational requirements of the DCNN model. Figure 2 illustrates a scenario using filter-level pruning. We summarize the most important works on filter pruning currently available:

  • Luo et al. [31] introduced an efficient framework named ThiNet for accelerating the operation of the CNN model through the use of compression during the training and testing phases. They implemented filter-level pruning, in which a filter that is no longer necessary is deleted based on statistical information generated from the following layer. The authors proposed pruning filters at the filter level as an optimization issue for determining which filters to prune. They solve the optimization problem with a greedy method which is defined as follows:

    $$\begin{aligned} arg \min _{E} \sum _{N}^{i=1}\left(y_{i}-\sum _{j\epsilon E}X_{ij}\right)^{2} \nonumber \\ Subject \ to, \left| E \right| = k \times c_{rate} \nonumber \\ E \subset \begin{Bmatrix} 1,2,...,k \end{Bmatrix}, \end{aligned}$$
    (1)

    where N represents the training example number (\(X_{i}\),\(Y_{i}\)), \(\left| E \right|\) represents the subset element number, k represents the channel number within the CNN model and \(c_{rate}\) represents the channels number retrained after compression.

  • Bhattacharya and Lane [43] developed a technique for CNN compression that removes sparsification in convolutional filters and the fully connected layer. The primary goal was to minimize the amount of storage required by devices throughout the training and inference processes. By utilizing layer separation and convolutional filters, the computational and spatial complexity of the DCNN model can be significantly expanded.

  • Zhou et al. [44] suggested a multi-objective optimization problem for filter pruning, followed by a knee-guided approach. They proposed a trade-off between performance degradation and parameter count. The fundamental concept is to remove parameters that contribute to performance degradation. They used the performance loss criteria to determine the significance of a parameter. To produce a tiny compressed model, the number of filters should be limited to a minimum while yet achieving a high degree of precision. The challenge can be handled by identifying a compact binary representation capable of pruning the maximum number of filters while maintaining a reasonable level of performance. This work has the advantage of lowering the number of parameters and processing overhead.

  • Huynh et al. [45] presented the DeepMon approach for develo** deep learning inference on mobile devices. They assert that they can do inference in a short period of time and with minimal power consumption by using the graphics processing unit on the mobile device. They presented a method for convolutional processes on mobile graphics processing units to be optimized. The technique repurposes the results by utilizing CNN’s internal processing structure, which includes filters and a network of connections. Thus, deleting filters and superfluous connections demonstrates faster inference.

  • Denton et al. [32] significantly reduce the time required to evaluate a large CNN model developed for object recognition. The authors used insignificant convolutional filters to develop approximations that significantly minimize the necessary computation. They began by compressing each convolutional layer using an appropriate low-rank approximation and then fine-tuning until prediction performance was recovered.

Weight quantization decreases both the storage and computing requirements of the CNNs model [

$$\begin{aligned} R=W(Q-T) \end{aligned}$$
(2)

where Eq. 2 denotes the quantization method with the parameters W and T. For instance, Q is set to 8 for 8-bit quantization. W is an arbitrary positive real number, and T has the same type as variable Q. The quantization strategy for compressing DNN models is explored in the current literature [33,34,35,36,37]. The strategies cover model reduction by arranging weight matrices optimally. However, the previous work does not address the negative repercussions of weight quantization or its estimation complexity.

A Huffman encode is a lossless data compression algorithm that is frequently used [46]. Schmidhuber et al. [39] utilized Huffman coding to compress text files generated by a neural prediction network. Han et al. [40] used a three-stage compression strategy to encode the quantized weights, which included pruning, quantization and finally Huffman coding [9]. Ge et al. [47] proposed a hybrid model compression technique based on Huffman coding to capture the sparse nature of trimmed weights. Huffman codes are superior to all other variable-length prefix codings. However, Elias and Golomb. 1975 encoding [48] can take advantage of various intriguing characteristics, such as the recurrence of specific sequences, to achieve greater average code lengths.

Despite the interesting findings of design and compression work on optimizing deep learning architectures, all researchers believed architecture optimization was a single-level problem. Therefore, We show that CNN design can be improved if two optimization levels are considered, where a search space is assigned to each level.