Abstract
We study the problem of translating an image-based, step-by-step assembly manual created by human designers into machine-interpretable instructions. We formulate this problem as a sequential prediction task: at each step, our model reads the manual, locates the components to be added to the current shape, and infers their 3D poses. This task poses the challenge of establishing a 2D-3D correspondence between the manual image and the real 3D object, and 3D pose estimation for unseen 3D objects, since a new component to be added in a step can be an object built from previous steps. To address these two challenges, we present a novel learning-based framework, the Manual-to-Executable-Plan Network (MEPNet), which reconstructs the assembly steps from a sequence of manual images. The key idea is to integrate neural 2D keypoint detection modules and 2D-3D projection algorithms for high-precision prediction and strong generalization to unseen components. The MEPNet outperforms existing methods on three newly collected LEGO manual datasets and a Minecraft house dataset.
C.-Y. Cheng—Work done when working at Autodesk AI Lab.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In the implementation we used a slightly more complex scheme to handle symmetries. Details are included in the supplementary material.
- 2.
References
Agrawala, M., Li, W., Berthouzoz, F.: Design principles for visual communication. Commun. ACM 54(4), 60–69 (2011)
Berthouzoz, F., Garg, A., Kaufman, D.M., Grinspun, E., Agrawala, M.: Parsing sewing patterns into 3d garments. ACM TOG 32(4), 1–12 (2013)
Bever, T.G., Poeppel, D.: Analysis by synthesis: a (re-) emerging program of research for language and vision. Biolinguistics 4(2–3), 174–200 (2010)
Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 536–551. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_35
Chang, A.X., et al.: Shapenet: An information-rich 3D model repository. ar**v:1512.03012 (2015)
Chaudhuri, S., Kalogerakis, E., Guibas, L., Koltun, V.: Probabilistic reasoning for assembly-based 3D modeling. ACM TOG 30(4), 35 (2011)
Chen, Z., et al.: Order-aware generative modeling using the 3d-craft dataset. In: ICCV (2019)
Chu, H., Wang, S., Urtasun, R., Fidler, S.: Housecraft: Building houses from rental ads and street views. In: ECCV (2016)
Chung, H., et al.: Brick-by-brick: Combinatorial construction with deep reinforcement learning. In: NeurIPS (2021)
Du, T.: Inversecsg: automatic conversion of 3D models to csg trees. ACM TOG 37(6), 1–16 (2018)
Fan, H., Su, H., Guibas, L.: A point set generation network for 3D object reconstruction from a single image. In: CVPR (2017)
Funkhouser, T., et al.: Modeling by example. ACM TOG 23(3), 652–663 (2004)
Haralick, R.M., Queeney, D.: Understanding engineering drawings. Comput. Graphics Image Process. 20(3), 244–258 (1982)
Heiser, J., Phan, D., Agrawala, M., Tversky, B., Hanrahan, P.: Identification and validation of cognitive design principles for automated generation of assembly instructions. In: Proceedings of the working conference on Advanced Visual Interfaces, pp. 311–319 (2004)
van den Hengel, A., Russell, C., Dick, A., Bastian, J., Pooley, D., Fleming, L., Agapito, L.: Part-based modelling of compound scenes from images. In: CVPR (2015)
Huang, J., et al.: Generative 3d part assembly via dynamic graph learning. In: NeurIPS (2020)
Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. In: NeurIPS (2015)
Jones, R.K., Barton, T., Xu, X., Wang, K., Jiang, E., Guerrero, P., Mitra, N.J., Ritchie, D.: Shapeassembly: Learning to generate programs for 3d shape structure synthesis. ACM TOG 39(6), 1–20 (2020)
Lee, Y., Hu, E.S., Lim, J.J.: Ikea furniture assembly environment for long-horizon complex manipulation tasks. In: ICRA (2021)
Li, C., Pan, H., Bousseau, A., Mitra, N.J.: Sketch2cad: Sequential cad modeling by sketching in context. ACM TOG 39(6), 1–14 (2020)
Li, J., Xu, K., Chaudhuri, S., Yumer, E., Zhang, H., Guibas, L.: Grass: Generative recursive autoencoders for shape structures. In: SIGGRAPH (2017)
Li, Y., Wang, G., Ji, X., **ang, Y., Fox, D.: Deepim: Deep iterative matching for 6d pose estimation. In: ECCV (2018)
Li, Y., Mo, K., Shao, L., Sung, M., Guibas, L.: Learning 3d part assembly from a single image. In: ECCV (2020)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Liu, R., et al.: An intriguing failing of convolutional neural networks and the coordconv solution. ar**v:1807.03247 (2018)
Mena, J.B.: State of the art on automatic road extraction for gis update: a novel classification. Pattern Recogn. Lett. 24(16), 3037–3058 (2003)
Mo, K.: Structurenet: hierarchical graph networks for 3D shape generation. ACM TOG 38(6), 1–19 (2019)
Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grou**. In: NeurIPS (2017)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Niu, C., Li, J., Xu, K.: Im2struct: Recovering 3d shape structure from a single rgb image. In: CVPR (2018)
Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In: ECCV (2018)
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: Pvnet: Pixel-wise voting network for 6dof pose estimation. In: CVPR (2019)
Rad, M., Lepetit, V.: Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: CVPR (2017)
Shao, T., Li, D., Rong, Y., Zheng, C., Zhou, K.: Dynamic furniture modeling through assembly instructions. In: ACM TOG. vol. 35. Association for Computing Machinery (2016)
Suárez-Ruiz, F., Zhou, X., Pham, Q.C.: Can robots assemble an ikea chair? Science Robotics 3(17) 6385 (2018)
Tian, Y., Luo, A., Sun, X., Ellis, K., Freeman, W.T., Tenenbaum, J.B., Wu, J.: Learning to infer and execute 3d shape programs. In: International Conference on Learning Representations (2018)
Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstractions by assembling volumetric primitives. In: CVPR (2017)
Wang, C., et al.: Densefusion: 6d object pose estimation by iterative dense fusion. In: CVPR (2019)
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: Generating 3d mesh models from single rgb images. ar**v:1804.01654 (2018)
Willis, K.D., et al.: Fusion 360 gallery: a dataset and environment for programmatic cad construction from human design sequences. ACM TOG 40(4), 1–24 (2021)
**ang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In: RSS (2018)
**ao, Y., Marlet, R.: Few-shot object detection and viewpoint estimation for objects in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 192–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_12
**ao, Y., Qiu, X., Langlois, P.A., Aubry, M., Marlet, R.: Pose from shape: Deep pose estimation for arbitrary 3d objects. In: BMVC (2019)
Xu, X., Peng, W., Cheng, C.Y., Willis, K.D., Ritchie, D.: Inferring cad modeling sequences using zone graphs. In: CVPR (2021)
Yuille, A., Kersten, D.: Vision as bayesian inference: analysis by synthesis? TiCS 10(7), 301–308 (2006)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. ar**v preprint ar**v:1904.07850 (2019)
Acknowldegements
We thank Joy Hsu, Chengshu Li, and Samuel Clarke for detailed feedback on the paper. This work is partly supported by Autodesk, the Stanford Institute for Human Centered AI (HAI), the Stanford Center for Integrated Facility Engineering (CIFE), ARMY MURI grant W911NF-15–1-0479, NSF CCRI #2120095, the Samsung Global Research Outreach (GRO) Program, and Amazon, Analog, Bosch, IBM, Meta, and Salesforce.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, R., Zhang, Y., Mao, J., Cheng, CY., Wu, J. (2022). Translating a Visual LEGO Manual to a Machine-Executable Plan. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_38
Download citation
DOI: https://doi.org/10.1007/978-3-031-19836-6_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)