Abstract
Convolutional neural networks have recently shown excellent results in general object detection and many other tasks. Albeit very effective, they involve many user-defined design choices. In this paper we want to better understand these choices by inspecting two key aspects “what did the network learn?”, and “what can the network learn?”. We exploit new annotations (Pascal3D+), to enable a new empirical analysis of the R-CNN detector. Despite common belief, our results indicate that existing state-of-the-art convnets are not invariant to various appearance factors. In fact, all considered networks have similar weak points which cannot be mitigated by simply increasing the training data (architectural changes are needed). We show that overall performance can improve when using image renderings as data augmentation. We report the best known results on Pascal3D+ detection and view-point estimation tasks.
Similar content being viewed by others
References
Agrawal, P., Girshick, R., Malik, J.: Analyzing the performance of multilayer neural networks for object recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VII. LNCS, vol. 8695, pp. 329–344. Springer, Heidelberg (2014)
Bengio, Y., Delalleau, O.: On the expressive power of deep architectures. In: Kivinen, J., Szepesvári, C., Ukkonen, E., Zeugmann, T. (eds.) ALT 2011. LNCS, vol. 6925, pp. 18–36. Springer, Heidelberg (2011)
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: BMVC (2014)
Chen, X., Yuille, A.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: NIPS (2014)
Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: NIPS, pp. 2933–2941 (2014)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Enzweiler, M., Gavrila, D.M.: A mixed generative-discriminative framework for pedestrian classification. In: CVPR, pp. 1–8. IEEE (2008)
Everingham, M., Zisserman, A., Williams, C.K.I., Van Gool, L.: The 2007 Pascal Visual Object Classes Challenge. Springer-Verlag, Berlin (2007)
Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. Arxiv. No. 1405.5769 (2015). http://lmb.informatik.uni-freiburg.de//Publications/2015/FDIB15
Fischer, P., Dosovitskiy, A., Brox, T.: Descriptor matching with convolutional neural networks: a comparison to sift (2014). ar**v:1405.5769
Girshick, R.: Fast R-CNN (2015). ar**v:1504.08083
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. ar**v (2014)
Goodfellow, I., Le, Q., Saxe, A., Ng, A.Y.: Measuring invariances in deep networks. In: NIPS (2009)
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: ICLR (2015)
Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 340–353. Springer, Heidelberg (2012)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift (2015). ar**v:1502.03167
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Le, Q.V., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y.: Building high-level features using large scale unsupervised learning. In: ICML (2012)
Lenc, K., Vedaldi, A.: Understanding image representations by measuring their equivariance and equivalence. In: CVPR (2015)
Li, H., Li, Y., Porikli, F.: Robust online visual tracking with a single convolutional neural network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9007, pp. 194–209. Springer, Heidelberg (2015)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, November 2015
Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: CVPR, June 2015
Pepik, B., Stark, M., Gehler, P., Ritschel, T., Schiele, B.: 3D object class detection in the wild. In: 3DSI in Conjunction with CVPR (2015)
Pepik, B., Stark, M., Gehler, P., Schiele, B.: Multi-view and 3D deformable part models. TPAMI (2015)
Pishchulin, L., Jain, A., Andriluka, M., Thormaehlen, T., Schiele, B.: Articulated people detection and pose estimation: resha** the future. In: CVPR, June 2012
Razavian, A.S., Azizpour, H., Maki, A., Sullivan, J., Ek, C.H., Carlsson, S.: Persistent evidence of local image properties in generic convnets (2014). ar**v:1411.6509
Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: CVPR Workshops, pp. 512–519. IEEE (2014)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering (2015). ar**v:1503.03832
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: ICLR Workshop (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. In: ICLR (2015)
Stark, M., Goesele, M., Schiele, B.: Back to the future: learning shape models from 3D CAD data. In: BMVC, vol. 2, p. 5 (2010)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions (2014). ar**v preprint ar**v:1409.4842
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: ICLR (2014)
Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR, pp. 1521–1528. IEEE (2011)
Uijlings, J., van de Sande, K., Gevers, T., Smeulders, A.: Selective search for object recognition. In: IJCV (2013)
**ang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: a benchmark for 3D object detection in the wild. In: WACV (2014)
**e, S., Tu, Z.: Holistically-nested edge detection (2015). ar**v:1504.06375
Xu, J., Vazquez, D., Lopez, A.M., Marin, J., Ponsa, D.: Learning a part-based pedestrian detector in a virtual world. IEEE Trans. Intell. Transp. Syst. 15(5), 2121–2131 (2014)
Zbontar, J., LeCun, Y.: Computing the stereo matching cost with a convolutional neural network. In: CVPR, June 2015
Zhu, X., Vondrick, C., Ramanan, D., Fowlkes, C.: Do we need more training data or better models for object detection? In: BMVC (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Pepik, B., Benenson, R., Ritschel, T., Schiele, B. (2015). What Is Holding Back Convnets for Detection?. In: Gall, J., Gehler, P., Leibe, B. (eds) Pattern Recognition. DAGM 2015. Lecture Notes in Computer Science(), vol 9358. Springer, Cham. https://doi.org/10.1007/978-3-319-24947-6_43
Download citation
DOI: https://doi.org/10.1007/978-3-319-24947-6_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24946-9
Online ISBN: 978-3-319-24947-6
eBook Packages: Computer ScienceComputer Science (R0)