Translating a Visual LEGO Manual to a Machine-Executable Plan

Wang, Ruocheng; Zhang, Yunzhi; Mao, Jiayuan; Cheng, Chin-Yi; Wu, Jiajun

doi:10.1007/978-3-031-19836-6_38

Ruocheng Wang¹²,
Yunzhi Zhang¹²,
Jiayuan Mao¹³,
Chin-Yi Cheng¹⁴ &
…
Jiajun Wu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13697))

Included in the following conference series:

European Conference on Computer Vision

2448 Accesses
3 Citations
1 Altmetric

Abstract

We study the problem of translating an image-based, step-by-step assembly manual created by human designers into machine-interpretable instructions. We formulate this problem as a sequential prediction task: at each step, our model reads the manual, locates the components to be added to the current shape, and infers their 3D poses. This task poses the challenge of establishing a 2D-3D correspondence between the manual image and the real 3D object, and 3D pose estimation for unseen 3D objects, since a new component to be added in a step can be an object built from previous steps. To address these two challenges, we present a novel learning-based framework, the Manual-to-Executable-Plan Network (MEPNet), which reconstructs the assembly steps from a sequence of manual images. The key idea is to integrate neural 2D keypoint detection modules and 2D-3D projection algorithms for high-precision prediction and strong generalization to unseen components. The MEPNet outperforms existing methods on three newly collected LEGO manual datasets and a Minecraft house dataset.

C.-Y. Cheng—Work done when working at Autodesk AI Lab.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: EUR 29.95; Price includes VAT (Germany)

eBook: EUR 93.08; Price includes VAT (Germany)

Softcover Book: EUR 117.69; Price includes VAT (Germany)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning 3D Part Assembly from a Single Image

Strategy with machine learning models for precise assembly using programming by demonstration

Article 19 June 2023

The Shape Part Slot Machine: Contact-Based Reasoning for Generating 3D Shapes from Parts

Notes

1.
In the implementation we used a slightly more complex scheme to handle symmetries. Details are included in the supplementary material.
2.
https://trevorsandy.github.io/lpub3d/.

References

Agrawala, M., Li, W., Berthouzoz, F.: Design principles for visual communication. Commun. ACM 54(4), 60–69 (2011)
Article Google Scholar
Berthouzoz, F., Garg, A., Kaufman, D.M., Grinspun, E., Agrawala, M.: Parsing sewing patterns into 3d garments. ACM TOG 32(4), 1–12 (2013)
Article Google Scholar
Bever, T.G., Poeppel, D.: Analysis by synthesis: a (re-) emerging program of research for language and vision. Biolinguistics 4(2–3), 174–200 (2010)
Article Google Scholar
Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 536–551. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_35
Chapter Google Scholar
Chang, A.X., et al.: Shapenet: An information-rich 3D model repository. ar**v:1512.03012 (2015)
Chaudhuri, S., Kalogerakis, E., Guibas, L., Koltun, V.: Probabilistic reasoning for assembly-based 3D modeling. ACM TOG 30(4), 35 (2011)
Article Google Scholar
Chen, Z., et al.: Order-aware generative modeling using the 3d-craft dataset. In: ICCV (2019)
Google Scholar
Chu, H., Wang, S., Urtasun, R., Fidler, S.: Housecraft: Building houses from rental ads and street views. In: ECCV (2016)
Google Scholar
Chung, H., et al.: Brick-by-brick: Combinatorial construction with deep reinforcement learning. In: NeurIPS (2021)
Google Scholar
Du, T.: Inversecsg: automatic conversion of 3D models to csg trees. ACM TOG 37(6), 1–16 (2018)
Article Google Scholar
Fan, H., Su, H., Guibas, L.: A point set generation network for 3D object reconstruction from a single image. In: CVPR (2017)
Google Scholar
Funkhouser, T., et al.: Modeling by example. ACM TOG 23(3), 652–663 (2004)
Google Scholar
Haralick, R.M., Queeney, D.: Understanding engineering drawings. Comput. Graphics Image Process. 20(3), 244–258 (1982)
Article Google Scholar
Heiser, J., Phan, D., Agrawala, M., Tversky, B., Hanrahan, P.: Identification and validation of cognitive design principles for automated generation of assembly instructions. In: Proceedings of the working conference on Advanced Visual Interfaces, pp. 311–319 (2004)
Google Scholar
van den Hengel, A., Russell, C., Dick, A., Bastian, J., Pooley, D., Fleming, L., Agapito, L.: Part-based modelling of compound scenes from images. In: CVPR (2015)
Google Scholar
Huang, J., et al.: Generative 3d part assembly via dynamic graph learning. In: NeurIPS (2020)
Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. In: NeurIPS (2015)
Google Scholar
Jones, R.K., Barton, T., Xu, X., Wang, K., Jiang, E., Guerrero, P., Mitra, N.J., Ritchie, D.: Shapeassembly: Learning to generate programs for 3d shape structure synthesis. ACM TOG 39(6), 1–20 (2020)
Article Google Scholar
Lee, Y., Hu, E.S., Lim, J.J.: Ikea furniture assembly environment for long-horizon complex manipulation tasks. In: ICRA (2021)
Google Scholar
Li, C., Pan, H., Bousseau, A., Mitra, N.J.: Sketch2cad: Sequential cad modeling by sketching in context. ACM TOG 39(6), 1–14 (2020)
Article Google Scholar
Li, J., Xu, K., Chaudhuri, S., Yumer, E., Zhang, H., Guibas, L.: Grass: Generative recursive autoencoders for shape structures. In: SIGGRAPH (2017)
Google Scholar
Li, Y., Wang, G., Ji, X., **ang, Y., Fox, D.: Deepim: Deep iterative matching for 6d pose estimation. In: ECCV (2018)
Google Scholar
Li, Y., Mo, K., Shao, L., Sung, M., Guibas, L.: Learning 3d part assembly from a single image. In: ECCV (2020)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Google Scholar
Liu, R., et al.: An intriguing failing of convolutional neural networks and the coordconv solution. ar**v:1807.03247 (2018)
Mena, J.B.: State of the art on automatic road extraction for gis update: a novel classification. Pattern Recogn. Lett. 24(16), 3037–3058 (2003)
Article Google Scholar
Mo, K.: Structurenet: hierarchical graph networks for 3D shape generation. ACM TOG 38(6), 1–19 (2019)
Article Google Scholar
Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grou**. In: NeurIPS (2017)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Niu, C., Li, J., Xu, K.: Im2struct: Recovering 3d shape structure from a single rgb image. In: CVPR (2018)
Google Scholar
Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In: ECCV (2018)
Google Scholar
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: Pvnet: Pixel-wise voting network for 6dof pose estimation. In: CVPR (2019)
Google Scholar
Rad, M., Lepetit, V.: Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: CVPR (2017)
Google Scholar
Shao, T., Li, D., Rong, Y., Zheng, C., Zhou, K.: Dynamic furniture modeling through assembly instructions. In: ACM TOG. vol. 35. Association for Computing Machinery (2016)
Google Scholar
Suárez-Ruiz, F., Zhou, X., Pham, Q.C.: Can robots assemble an ikea chair? Science Robotics 3(17) 6385 (2018)
Google Scholar
Tian, Y., Luo, A., Sun, X., Ellis, K., Freeman, W.T., Tenenbaum, J.B., Wu, J.: Learning to infer and execute 3d shape programs. In: International Conference on Learning Representations (2018)
Google Scholar
Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstractions by assembling volumetric primitives. In: CVPR (2017)
Google Scholar
Wang, C., et al.: Densefusion: 6d object pose estimation by iterative dense fusion. In: CVPR (2019)
Google Scholar
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: Generating 3d mesh models from single rgb images. ar**v:1804.01654 (2018)
Willis, K.D., et al.: Fusion 360 gallery: a dataset and environment for programmatic cad construction from human design sequences. ACM TOG 40(4), 1–24 (2021)
Article MathSciNet Google Scholar
**ang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In: RSS (2018)
Google Scholar
**ao, Y., Marlet, R.: Few-shot object detection and viewpoint estimation for objects in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 192–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_12
Chapter Google Scholar
**ao, Y., Qiu, X., Langlois, P.A., Aubry, M., Marlet, R.: Pose from shape: Deep pose estimation for arbitrary 3d objects. In: BMVC (2019)
Google Scholar
Xu, X., Peng, W., Cheng, C.Y., Willis, K.D., Ritchie, D.: Inferring cad modeling sequences using zone graphs. In: CVPR (2021)
Google Scholar
Yuille, A., Kersten, D.: Vision as bayesian inference: analysis by synthesis? TiCS 10(7), 301–308 (2006)
Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. ar**v preprint ar**v:1904.07850 (2019)

Download references

Acknowldegements

We thank Joy Hsu, Chengshu Li, and Samuel Clarke for detailed feedback on the paper. This work is partly supported by Autodesk, the Stanford Institute for Human Centered AI (HAI), the Stanford Center for Integrated Facility Engineering (CIFE), ARMY MURI grant W911NF-15–1-0479, NSF CCRI #2120095, the Samsung Global Research Outreach (GRO) Program, and Amazon, Analog, Bosch, IBM, Meta, and Salesforce.

Author information

Authors and Affiliations

Stanford University, Stanford, USA
Ruocheng Wang, Yunzhi Zhang & Jiajun Wu
Massachusetts Institute of Technology, Cambridge, USA
Jiayuan Mao
Google Research, Mountain View, USA
Chin-Yi Cheng

Authors

Ruocheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yunzhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiayuan Mao
View author publications
You can also search for this author in PubMed Google Scholar
Chin-Yi Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruocheng Wang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 603 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, R., Zhang, Y., Mao, J., Cheng, CY., Wu, J. (2022). Translating a Visual LEGO Manual to a Machine-Executable Plan. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-19836-6_38
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Translating a Visual LEGO Manual to a Machine-Executable Plan

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning 3D Part Assembly from a Single Image

Strategy with machine learning models for precise assembly using programming by demonstration

The Shape Part Slot Machine: Contact-Based Reasoning for Generating 3D Shapes from Parts

Notes

References

Acknowldegements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 603 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Translating a Visual LEGO Manual to a Machine-Executable Plan

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning 3D Part Assembly from a Single Image

Strategy with machine learning models for precise assembly using programming by demonstration

The Shape Part Slot Machine: Contact-Based Reasoning for Generating 3D Shapes from Parts

Notes

References

Acknowldegements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 603 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation