Keywords

1 Introduction

Deep neural networks, especially the deep convolutional neural networks, have achieved tremendous success in computer vision and the broader artificial intelligence field. However, the large model size and high computation cost remain great hurdles for many applications, especially on some constrained devices with limited memory and computational resources.

To address this issue, there has been a surge of interests recently in reducing the model complexity of DNNs. Representative techniques include quantization [3, 6, 9, 18, 21, 22, 29, 34, 39, 52,53,54,55], pruning [12, 13, 17, 36], low-rank decomposition [7, 8, 24, 27, 38, 49, 51], hashing [4], and deliberate architecture design [19, 23, 50]. Among these approaches, quantization based methods represent the network weights with very low precision, thus yielding highly compact DNN models compared to their floating-point counterparts. Moreover, it has been shown that if both the network weights and activations are properly quantized, the convolution operations can be efficiently computed via bitwise operations [21, 39], enabling fast inference without GPU.

Notwithstanding the promising results achieved by the existing quantization-based methods [3, 6, 9, 18, 21, 22, 29, 34, 39, 52,53,54,55], there is still a sizeable accuracy gap between the quantized DNNs and their full-precision counterparts, especially when quantized with extremely low bit-widths such as 1 bit or 2 bits. For example, using the state-of-the-art method of [3], a 50-layer ResNet model [15] with 1-bit weights and 2-bit activations can achieve 64.6% top-1 image classification accuracy on ImageNet validation set [40]. However, the full-precision reference is 75.3% [15], i.e., the absolute accuracy drop induced by quantization is as large as 10.7%.

This work is devoted to pushing the limit of network quantization algorithms to achieve better accuracy with low precision weights and activations. We found that existing methods often use simple, hand-crafted quantizers (e.g., uniform or logarithmic quantization) [11, 22, 31, 37, 52, 53] or otherwise pre-computed quantizers fixed during network training [3]. However, one can never be sure that the simple quantizers are the best choices for network quantization. Moreover, the distributions of weights and activations in different networks and even different network layers may differ a lot. We believe a better quantizer should be made adaptive to the weights and activations to gain more flexibility.

To this end, we propose to jointly train a quantized DNN and its associated quantizers. The proposed method not only makes the quantizers learnable, but also renders them compatible with bitwise operations so as to keep the fast inference merit of properly-quantized neural networks. Our quantizer can be optimized via backpropagation in a standard network training pipeline, and we further propose an algorithm based on quantization error minimization which yields better performance. The proposed quantization can be applied to both network weights and activations, and arbitrary bit-width can be achieved. Moreover, layer-wise quantizers with unshared parameters can be applied to gain further flexibility. We call the networks quantized by our method the “LQ-Nets”.

We evaluate our LQ-Nets with image classification tasks on the CIFAR-10 [25] and ImageNet [40] datasets. The experimental results show that they perform remarkably well across various network structures such as AlexNet [26], VGG-Net [41], GoogLeNet [42], ResNet [15] and DenseNet [20], surpassing previous quantization methods by a wide margin.

2 Related Work

A large number of works have been devoted to reducing DNN model size and improving inference efficiency for practical applications. We briefly review the existing approaches as follows.

Compact Network Design: To achieve fast inference, one strategy is to carefully design a compact network architecture [19, 23, 32, 42, 50]. For example, Network in Network [32] enhanced the local modeling via the micro networks and replaced the costly fully-connected layer by global average pooling. GoogLeNet [42] and SqueezeNet [23] utilized \(1\!\times \!1\) convolution layers to compute reductions before the expensive \(3\!\times \!3\) or \(5\!\times \!5\) convolutions. Similarly, ResNet [15] applied “bottleneck” structures with \(1\!\times \!1\) convolutions when training deeper nets with enormous parameters. The recently proposed computation-efficient network structures MobileNet [19] and ShuffleNet [50] employed depth-wise convolution or group convolution advocated in [5, 48] to reduce the computation cost.

Network Parameter Reduction: Considerable efforts have been devoted to reducing the number of parameters in an existing network [4, 7, 8, 10, 12, 13, 17, 24, 27, 28, 35, 36, 38, 45, 46, 49, 51]. For example, by exploiting the redundancy of the filters weights, some methods substitute the pre-trained weights using their low-rank approximations [7, 8, 24, 27, 38, 49, 51]. Connection pruning was investigated in [12, 13] to reduce the parameters of AlexNet and VGG-Net, where significant reduction was achieved on fully-connected layers. Promising results on modern network architectures such as ResNet were achieved recently by [17, 36]. Another similar technique is to regularize the network by structured sparsity to obtain a hardware-friendly DNN model [28, 35, 45]. Some other approaches such as hashing and vector quantization [44] have also been explored to reduce DNN model complexity [4, 10, 46].

Network Quantization: Another category of existing methods, which our method also belongs to, train low-precision DNNs via quantization. These methods can be further divided into two subcategories: those performing quantization on weights only versus both weights and activations.

For weight-only quantization methods, Courbariaux et al.  [6] constrained the weights to only two possible values of \(-1\) and 1 (i.e., binarization or one-bit quantization). They obtained promising results on small datasets using stochastic binarization. Rastegari et al.  [39] later demonstrated that deterministic binarization with optimized scale factors to approximate the full-precision weights work better on deeper network structures and larger datasets. To obtain better accuracy, ternary and other multi-bit quantization schemes were explored in [9, 18, 29, 34, 52, 54]. It was shown in [52] that quantizing a network with five bits can achieve similar accuracy to its 32-bit floating-point counterpart by incremental group-wise quantization and re-training.

In the latter regard, Hubara et al.  [21] and Rastegari et al.  [39] proposed to binarize both weights and activations to \(-1\) and \(+1\). This way, the convolution operations can be implemented by efficient bit-wise operations for substantial speed-up. To address the significant accuracy drop, multi-bit quantization was further studied in [22, 30, 33, 37, 43, 53]. A popular choice of the quantization function is the uniform quantization [22, 53]. Miyashita et al.  [37] used logarithmic quantization and improve the inference efficiency via the bitshift operation. Cai et al.  [3] proposed to binarize the network weights while quantize the activations using multiple bits. A single activation quantizer computed by fitting the probability density function of a half-wave Gaussian distribution is applied to all network layers and fixed during training. In the multi-bit quantization methods of Tang et al.  [43] and Li et al.  [30], each bit is used to binarize the residue approximation error from previous bits.

Our proposed method can quantize both the weights and the activations with arbitrary bit-widths. Different from most of the previous methods, our quantizer is adaptively learned during network training.

3 LQ-Nets: Networks with Learned Quantization

In this section, we first briefly introduce the goal of neural network quantization. Then we present the details of our quantization method and how to train a quantized DNN model with it in a standard network training pipeline.

3.1 Preliminaries: Network Quantization

The main operations in deep neural networks are interleaved linear and non-linear transformations, expressed as

$$\begin{aligned} z = \sigma (\mathbf{w}^{\mathrm {T}}\mathbf{a}), \end{aligned}$$
(1)

where \(\mathbf{w}\in \mathbb {R}^{N}\) is the weight vector, \(\mathbf{a}\in \mathbb {R}^{N}\) is the input activation vector computed by the previous network layer, \(\sigma (\,\cdot \,)\) is a non-linear function, and z is the output activation.Footnote 1 The convolutional layers are composed by multiple convolution filters \(\mathbf{w}_i\in \mathbb {R}^{C\cdot H\cdot W}\), where C, H and W are the number of convolution filter channels, kernel height, and kernel width, respectively. Fully-connected layers can be viewed as a special type of convolutional layer. Modern deep neural networks often have millions of weight parameters, which incur large memory footprints. Meanwhile, the large numbers of inner product operations between the weights and feature vectors lead to high computation cost. The memory and computation costs are great hurdles for many applications on resource-constrained devices such as mobile phones.

The goal of network quantization is to represent the floating-point weights \(\mathbf{w}\) and/or activations \(\mathbf{a}\) with few bits. In general, a quantization function is a piecewise-constant function which can be written as

$$\begin{aligned} Q(x)=q_{l},\ \text {if }x\in (t_{l},t_{l+1}], \end{aligned}$$
(2)

where \(q_{l}\), \(l=1,...,L\) are the quantization levels and \((t_{l},t_{l+1}]\) are quantization intervals. The quantization function maps all the input values within a quantization interval to the corresponding quantization level, and a quantized value can be encoded by only \(\text {log}_{2}L\) bits. Perhaps the simplest quantizer is the sign function used for binary quantization [21, 39]: \(Q(x)= +1\) if \(x\ge 0\) or \(-1\) otherwise. For quantization with 2 or more bits, the most commonly used quantizer is the uniform quantization function where all the quantization steps \(q_{l+1}-q_l\) are equal [22, 53]. Some methods use logarithmic quantization which uniformly quantizes the data in the log-domain [37].

Quantizing the network weights can generate highly compact and memory-efficient DNN models: using n-bit encoding, the compression rate is \(\frac{32}{n}\) or \(\frac{64}{n}\) compared to the 32-bit or 64-bit floating point representation. Moreover, if both weights and activations are quantized properly, the inner product in Eq. (1) can be computed by bitwise operations such as xnor and popcnt, where xnor is the exclusive-not-or logical operation and popcnt counts the number of 1’s in a bit string. Both the two operations can process at least 64 bits in one or few clock cycle on most general computing platforms such as CPU and GPU, which potentially leads to 64\(\times \) speedup.Footnote 2

Fig. 1.
figure 1

Distributions of weights (left two columns) and activations (right two columns) at different layers of the ResNet-20 network trained on CIFAR-10. All the test-set images are used to get the activation statistics.

3.2 Learnable Quantizers

An optimal quantizer should yield minimal quantization error for the input data distribution:

$$\begin{aligned} Q^*(x) = {\mathop {\hbox {arg min}}\limits _Q} \int p(x)(Q(x)-x)^2 dx, \end{aligned}$$
(3)

where p(x) is the probability density function of x. We can never be sure if the popular quantizers such as a uniform quantizer are the optimal selections for the network weights and activations. In Fig. 1 we present the statistical distributions of the weights and activations (after batch normalization (BN) and Rectified Linear Unit (ReLU) layers) in a trained floating-point network. It can be seen that the distributions can be complex and differ across layers, and a uniform quantizer is not optimal for them. Of course, if we train a quantized network the weight and activation distributions may change. But again we can never be sure if any pre-defined quantizer is optimal for our task, and an improper quantizer can easily jeopardize the final accuracy.

To get better network quantizers and improve the accuracy of a quantized network, we propose to jointly train the network and its quantizers. The insight behind is that if the optimizers are learnable and optimized through network training, they can not only minimize the quantization error, but also adapt to the training goal thus improving the final accuracy. A naive way to train the quantizers would be directly optimizing the quantization levels \(\{q_l\}\) in network training. However, such a naive strategy would render the quantization functions not compatible with bitwise operations, which is undesired as we want to keep the fast inference merit of quantized neural networks.

Fig. 2.
figure 2

Illustration of our learnable quantizer on the 2-bit (left) and 3-bit (right) cases. For each case, the left figure shows how quantization levels are generated by the basis vector, and the right figure illustrates the corresponding quantization function.

To resolve this issue, we need to confine our quantization functions into a subspace which is compatible with bitwise operations. But how to confine the quantizers into such a space during training? Our solution is inspired by the uniform quantization which is bit-op compatible (see [53]). The uniform quantization essentially maps floating-point numbers to their nearest fixed-point integers with a normalization factor, and the key property for it to be bit-op-compatible is that the quantized values can be decomposed by a linear combination of the bits. Specifically, an integer q represented by a K-bit binary encoding is actually the inner product between a basis vector and the binary coding vector \(\mathbf {b}=[b_1,b_2,...,b_K]^{{\mathrm {T}}}\) where \(b_i\in \{0,1\}\), i.e.,

$$\begin{aligned} q=\left\langle \left[ \begin{array}{c} 1\\ 2\\ ...\\ 2^{K-1} \end{array}\right] ,\left[ \begin{array}{c} b_{1}\\ b_{2}\\ ...\\ b_{K} \end{array}\right] \right\rangle . \end{aligned}$$
(4)

In order to learn the quantizers while kee** them compatible with bitwise operations, we can simply learn the basis vector which consists of K scalars.

Concretely, our learnable quantization function is simply in the form of

$$\begin{aligned} Q_{\text {ours}}(x,\mathbf{v})=\mathbf {v}^{{\mathrm {T}}}\mathbf {e}_{l},\qquad \text {if }x\in (t_{l},t_{l+1}], \end{aligned}$$
(5)

where \(\mathbf {v}\in \mathbb {R}^{K}\) is the learnable floating-point basis and \(\mathbf {e}_{l}\in \{-1,1\}^{K}\) for \(l=1,\ldots ,2^{K}\) enumerates all the K-bit binary encodings from \([-1,\ldots ,-1]\) to \([1,\ldots ,1]\).Footnote 3 For a K-bit quantization, the \(2^{K}\) quantization levels are generated by \(q_{l}=\mathbf {v}^{{\mathrm {T}}}\mathbf {e}_{l}\) for \(l=1,\ldots ,2^K\). Given \(\{q_l\}\) and assuming \(q_{1}<q_{2}<...<q_{2^K}\), it can be easily derived that for any x, the optimal \(\{t_l\}\) minimizing the error in Eq. (3) are simply \(t_{l}=(q_{l-1}+q_{l})/2\) for \(l=2,...,2^K\) (note \(t_1=-\infty \) and \(t_{2^K+1}=+\infty \)). Figure 2 illustrates our quantizer with the 2-bit and 3-bit cases.

We now show how the inner products between our quantized weights and activations can be computed by bitwise operations. Let a weight vector \(\mathbf{w}\in \mathbb {R}^N\) be encoded by the vectors \(\mathbf{b}^w_{i}\in \{-1,1\}^N\), \({i}=1,\ldots ,K_w\) where \(K_w\) is the bit-width for weights and \(\mathbf{b}^w_{i}\) consists of the encoding of the i-th bit for all the values in \(\mathbf{w}\). Similarly, for activation vector \(\mathbf{a}\in \mathbb {R}^N\) we have \(\mathbf{b}^a_{j}\in \{-1,1\}^N\), \({j}=1,\ldots ,K_a\). It can be readily derived that

$$\begin{aligned} Q_{\text {ours}}(\mathbf {w},\mathbf {v}^{w})^{{\mathrm {T}}}Q_{\text {ours}}(\mathbf {a},\mathbf {v}^{a})= \sum _{i=1}^{K_{w}}\sum _{j=1}^{K_{a}}{v}^w_i{v}^a_j (\mathbf b _{i}^{w}\odot \mathbf b _{j}^{a}) \end{aligned}$$
(6)

where \(\mathbf{v}^w\in \mathbb {R}^{K_w}\) and \(\mathbf{v}^a\in \mathbb {R}^{K_a}\) are the learned basis vectors for the weight and activation quantizers respectively, and \(\odot \) denotes the inner product with bitwise operations xnor and popcnt.

In practice, we apply layer-wise quantizers for activations (i.e., one quantizer per layer) and channel-wise quantizers for weights (one quantizer for each conv filter). The number of extra parameters introduced by the quantizers is negligible compared to the large volume of network weights.

3.3 Training Algorithm

To train the LQ-Nets, we use floating-point network weights which are quantized before convolution and optimized with error back-propagation (BP) and gradient descent. After training, they can be discarded and their binary codes and quantizer bases are kept. We now present how we optimize the quantizers.

Quantizer Optimization: A simple and straightforward way to optimize our quantizers is through BP similar to weight optimization. Here we present an algorithm based on quantization error minimization which optimizes our quantizers in the forward passes during training. This algorithm leads to much better performance as we will show later in the experiments.

Let \(\mathbf {x}=[x_{1},...,x_{N}]^{{\mathrm {T}}}\in \mathbb {R}^{N}\) be the full-precision data (weights or activations) and K be the specified bit number for quantization. Our goal is to find an optimal quantizer basis \(\mathbf{v}\in \mathbb {R}^K\) as well as an encoding \(B=[\mathbf {b}_{1},...,\mathbf {b}_{N}]\in \{-1,1\}^{K\times N}\) that minimize the quantization error:

$$\begin{aligned} \mathbf{v}^*,B^*={\mathop {\hbox {arg min}}\limits _{\mathbf{v},B}}\left\| B^{{\mathrm {T}}}\mathbf {v}-\mathbf {x}\right\| _2^2,\ \ \ s.t.\ B\in \{-1,1\}^{K\times N}. \end{aligned}$$
(7)

Equation (7) is complex and to provably solve for the optimal solution via brute-force search is exponential in the size of B. For efficiency purposes, we alternately solve for \(\mathbf{v}\) and B in a block coordinate descent fashion:

  • Fix \(\mathbf {v}\) and update B. Given \(\mathbf {v}\), the optimal encoding \(B^*\) can be simply found by looking up the quantization intervals \(t_{1},...,t_{2^K+1}\).

  • Fix B and update \(\mathbf {v}\). Given B, Eq. (7) reduces to a linear regression problem with a closed form solution as

    $$\begin{aligned} \mathbf {v}^*=(BB^{{\mathrm {T}}})^{-1}B\mathbf {x\,}. \end{aligned}$$
    (8)

We iterate the alternation T times. For brevity, we will refer to the above procedure as the QEM (Quantization Error Minimization) algorithm.

Network Training: We use the standard mini-batch based approach to train the LQ-Nets, and our quantizer learning is conducted in the forward passes with the QEM algorithm. Since, for activation quantization, only part of the input data is visible in one iteration due to mini-batch sampling, we apply moving average for the optimized quantizer parameters (i.e., basis vectors). We also apply the moving average strategy for the weight quantizers to gain more stability. The operations in our quantizers are summarized in Algorithm 1.

In a backward pass, direct error back-propagation would be problematic as the gradient of the quantization function is 0 at almost everywhere. To tackle this issue, we use the Straight-Through Estimator (STE) proposed in [2] to compute the gradients. Specifically, for activations we set the gradient of the quantization function to 1 for values between \(q_1\) and \(q_{2^K}\) defined in Eq. (5) and 0 elsewhere; for weights, the gradient is set to 1 everywhere [3]. The QEM algorithm is unrelated to the backward pass so the quantizers will remain unchanged (unless BP is used to train them instead).

figure a

4 Experiments

In this section, we evaluate the proposed method on two image classification datasets: CIFAR-10 [25] and ImageNet (ILSVRC12) [40]. The CIFAR-10 dataset consists of 60,000 color images of size \(32\times 32\) belonging to 10 classes (6,000 images per class). There are 50,000 training and 10,000 test images. ImageNet ILSVRC12 contains about 1.2 million training and 50K validation images of 1,000 object categories.

Although our method is designed to quantize both weights and activations to facilitate fast inference through bitwise operations, we also conduct experiments of weight-only quantization and compare with the prior art.

4.1 Implementation Details

Our LQ-Nets are implemented with TensorFlow [1] and trained with the aid of the Tensorpack library [47].Footnote 4 We present our implementation details as follows.

Quantizer Implementation: We apply layer re-ordering to the networks similar to [3, 39]: the typical Conv\(\rightarrow \)BN\(\rightarrow \)ReLU operations is re-organized as BN\(\rightarrow \) ReLU (\(\rightarrow \)Quant.)\(\rightarrow \)Conv. Following previous methods [3, 21, 39, 16, 20]. For fair comparisons with the method of [3], we use the hyper-parameters described in [3] to train all the networks with 1-bit weights and 2-bit activations. The iteration number T in our QEM algorithm is fixed as 1 (no significant benefit was observed with larger values; see Sect. 4.2). The moving average factor for quantizer learning is fixed as 0.9. Details of all our hyper-parameter settings can be found in the supplementary material as well as our released source code.

In the remaining text, we used “W/A” to denote the number of bits used for weights/activations. A bit-width of 32 indicates using 32-bit floating-point values without quantization (thus “w/32” with \(w<32\) indicates weight-only quantization and “32/32” are“full-precision” (FP) models). For the experiments on CIFAR-10, we run our method 5 times and report the mean accuracy.

Fig. 3.
figure 3

Error curves with the two optimization methods

4.2 Performance Analysis

Effectiveness of the QEM Algorithm: Our quantizer can be trained by either the proposed QEM algorithm or a naive BP procedure. In this experiment, we evaluate the effectiveness of the QEM algorithm and compare it against BP. Table 1 shows the performance of the quantized ResNet-20 models on CIFAR-10 test set, and Fig. 3 presents the corresponding training and testing curves. The quantized network trained using QEM is clearly better than BP for weight-only quantization as well as weight-and-activation quantization. In all the following experiments, we use the QEM algorithm to optimize our quantizers.

Table 1. Optimization method comparison on the ResNet-20 model.
Table 2. Accuracy w.r.t. QEM iteration number T

Table 2 shows the accuracy of quantized ResNet-20 models with different QEM solver iteration T. As can be seen, using \(T=2,3\) or 4 did not show significant benefit compared to \(T=1\). Note that each time the solver starts from the result of the last training iteration (see Line 6 in Algorithm 1) which is a good starting point especially when the gradients become small after a few epochs. The good performance with \(T=1\) suggests that the iterations of the alternately-directional optimization can be effectively substituted by the training iterations. In this paper, we use \(T=1\) in all the experiments.

Effectiveness of the Learnable Quantizers: The key idea of our method is to apply flexible quantizers and optimize them jointly with the network. Table 3 compares the results of our method and two previous methods: DoReFa-Net [53] and HWGQ [3], the former of which is based on fixed uniform quantizers and latter pre-computes the quantizer by fitting a half-wave Gaussian distribution. It can be seen that using 1-bit weights and 2-bit activations, the ResNet-18 model with our learnable quantizers outperformed HWGQ under the same setting and also outperformed DoReFa-Net with 4-bit activations on ImageNet. More result comparisons on various network structures can be found in Sect. 4.3.

Table 3. Comparison of different quantization methods (ResNet-18 on ImageNet
Fig. 4.
figure 4

Statistics of the weights (top row) and activations (bottom row) before (i.e., the floating-point values) and after quantization. A ResNet-20 model trained on CIFAR-10 with “2/2” quantization is used. The orange diamonds indicate the four quantization levels of our learned quantizers. Note that in the left figures for the floating-point values, the histogram bins are of equal step size, whereas in the right figures each of the four bins contains all the values quantized to its corresponding quantization levels.

Figure 4 presents the weight and activation statistics in two layers of a trained ResNet-20 model before (i.e., the floating-point values) and after quantization using our method. The network is quantized with “2/2” bits and the floating-point weights are obtained from the last iteration of training (these values can be discarded after training and only quantized values are used in the inference time). The floating-point activations are obtained using all the test images of CIFAR-10. It can be seen that our learned quantizers are not uniform ones and they differ at different layers. Statistical results with more bits can be found in the supplementary material.

Performance w.r.t. Bit-width: We now study the impact of bit-width on the performance of our LQ-Nets. Table 4 shows the results of three network structures: ResNet-20, VGG-Small and ResNet-18.

Table 4. Impact of bit-width on our LQ-Nets

On the CIFAR-10 dataset, high accuracy can be achieved by our low-precision networks. The accuracy from “3/32” quantization has roughly reached our full-precision result for both ResNet-20 and VGG-Small. The accuracy decreases gracefully with lower bits for weights, and the absolution drops are low even with 1-bit weights: 2.0% for ResNet-20 and 0.3% for VGG-Small. The accuracy drops are more appreciable when quantizing both weights and activations, though the largest absolute drop is only 3.7% for the“1/2” ResNet-20 model. Very minor accuracy drops (maximum 0.4%) are observed for VGG-Small which has many more parameters than ResNet-20.

On the ImageNet dataset which is more challenging, the accuracy drops of the ResNet-18 model after quantization are relatively larger especially with very low precision: the largest absolute drop is 7.7% (70.3%\(\rightarrow \)62.6%) with bit-widths of “1/2”. Nevertheless, our learnable quantizer is particularly beneficial when using 2 or more bits due to its high flexibility. The accuracy of the quantized ResNet-18 quickly increases with 2 or more bits as shown in Table 4. The accuracy gap is almost closed with“4/32” bits (0.3% absolute difference only), and the accuracy drop with the “4/4” case is as low as 1%.

4.3 Comparison with Previous Methods

In this section, we compare the performance of our quantization method with existing methods including TWN [29], TTQ [54], BNN [21], BWN [39], XNOR-Net [39], DoReFa-Net [53], HWQG [3] and ABC-Net [33], with various network architectures tested on CIFAR-10 and ImageNet classification tasks.

Table 5. Comparison of quantized VGG-Small networks on CIFAR-10

Comparison on CIFAR-10: Table 5 presents the results of the VGG-Small model quantized using different methods. All these methods quantize (or binarize) both weights and activations to achieve extremely low precision. With 1-bit weights and 2-bit activations, the accuracy using our method is significantly better than the state-of-the-art method HWGQ (93.4% vs. 92.5%).

Table 6. Comparison with state-of-the-art quantization methods on ImageNet. “FP” denotes “Full Precision”; the “W/A” values are the bit-widths of weights/activations.

Comparison on ImageNet: The results on ImageNet validation set are presented in Table 6. For weight-only quantization, our LQ-Nets outperformed BWN, TWN, TTQ and DoReFa-Net by large margins.

As for quantizing both weights and activations, our results are significantly better than DoReFa-Net and HWGQ when using very low bit-widths (1 bit for weights and few for activations). Our method is even more advantageous when using larger bit-widths. Table 6 shows that with more bits (2, 3, or 4), the accuracy can be dramatically improved by our method. For example, with “4/4” bits, the top-1 accuracy of ResNet-50 is boosted from 68.7% (with“1/2” bits) to 75.1%. The absolute accuracy increase is as high as 6.4%, and the gap to its FP counterpart is reduced to 1.3%. According to Table 6, the accuracy of our LQ-Nets comprehensively surpassed the other competing methods under the same bit-width settings.

Table 7. Train. time

4.4 Training Time

Compared to training floating-point networks, our extra cost lies in quantizer optimization. In the QEM algorithm, the cost of solving B is negligible. For N input scalars, the time complexity of solving \(\mathbf {v}\) of length K is \(O(K^2N)\), which is a relatively small compared to the conv operations in theory.Footnote 6 Table 7 shows the total training time comparison based on our current unoptimized implementation. The network is ResNet-18 and no bitwise operation is used in all cases. Our training time increases gracefully with larger bit-widths.

5 Conclusions

We have presented a novel DNN quantization method that led to state-of-the-art accuracy for various network structures. The key idea is to apply learnable quantizers which can be jointly trained with the network parameters to gain more flexibility. Our quantizers can be applied to both weights and activations, and they are made compatible with bitwise operations facilitating fast inference. In future, we plan to deploy our LQ-Nets on some resource-constrained devices such as mobile phones and test their performance.