Translating a Visual LEGO Manual to a Machine-Executable Plan

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13697))

Included in the following conference series:

Abstract

We study the problem of translating an image-based, step-by-step assembly manual created by human designers into machine-interpretable instructions. We formulate this problem as a sequential prediction task: at each step, our model reads the manual, locates the components to be added to the current shape, and infers their 3D poses. This task poses the challenge of establishing a 2D-3D correspondence between the manual image and the real 3D object, and 3D pose estimation for unseen 3D objects, since a new component to be added in a step can be an object built from previous steps. To address these two challenges, we present a novel learning-based framework, the Manual-to-Executable-Plan Network (MEPNet), which reconstructs the assembly steps from a sequence of manual images. The key idea is to integrate neural 2D keypoint detection modules and 2D-3D projection algorithms for high-precision prediction and strong generalization to unseen components. The MEPNet outperforms existing methods on three newly collected LEGO manual datasets and a Minecraft house dataset.

C.-Y. Cheng—Work done when working at Autodesk AI Lab.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (Germany)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 93.08
Price includes VAT (Germany)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 117.69
Price includes VAT (Germany)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free ship** worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    In the implementation we used a slightly more complex scheme to handle symmetries. Details are included in the supplementary material.

  2. 2.

    https://trevorsandy.github.io/lpub3d/.

References

  1. Agrawala, M., Li, W., Berthouzoz, F.: Design principles for visual communication. Commun. ACM 54(4), 60–69 (2011)

    Article  Google Scholar 

  2. Berthouzoz, F., Garg, A., Kaufman, D.M., Grinspun, E., Agrawala, M.: Parsing sewing patterns into 3d garments. ACM TOG 32(4), 1–12 (2013)

    Article  Google Scholar 

  3. Bever, T.G., Poeppel, D.: Analysis by synthesis: a (re-) emerging program of research for language and vision. Biolinguistics 4(2–3), 174–200 (2010)

    Article  Google Scholar 

  4. Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 536–551. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_35

    Chapter  Google Scholar 

  5. Chang, A.X., et al.: Shapenet: An information-rich 3D model repository. ar**v:1512.03012 (2015)

  6. Chaudhuri, S., Kalogerakis, E., Guibas, L., Koltun, V.: Probabilistic reasoning for assembly-based 3D modeling. ACM TOG 30(4), 35 (2011)

    Article  Google Scholar 

  7. Chen, Z., et al.: Order-aware generative modeling using the 3d-craft dataset. In: ICCV (2019)

    Google Scholar 

  8. Chu, H., Wang, S., Urtasun, R., Fidler, S.: Housecraft: Building houses from rental ads and street views. In: ECCV (2016)

    Google Scholar 

  9. Chung, H., et al.: Brick-by-brick: Combinatorial construction with deep reinforcement learning. In: NeurIPS (2021)

    Google Scholar 

  10. Du, T.: Inversecsg: automatic conversion of 3D models to csg trees. ACM TOG 37(6), 1–16 (2018)

    Article  Google Scholar 

  11. Fan, H., Su, H., Guibas, L.: A point set generation network for 3D object reconstruction from a single image. In: CVPR (2017)

    Google Scholar 

  12. Funkhouser, T., et al.: Modeling by example. ACM TOG 23(3), 652–663 (2004)

    Google Scholar 

  13. Haralick, R.M., Queeney, D.: Understanding engineering drawings. Comput. Graphics Image Process. 20(3), 244–258 (1982)

    Article  Google Scholar 

  14. Heiser, J., Phan, D., Agrawala, M., Tversky, B., Hanrahan, P.: Identification and validation of cognitive design principles for automated generation of assembly instructions. In: Proceedings of the working conference on Advanced Visual Interfaces, pp. 311–319 (2004)

    Google Scholar 

  15. van den Hengel, A., Russell, C., Dick, A., Bastian, J., Pooley, D., Fleming, L., Agapito, L.: Part-based modelling of compound scenes from images. In: CVPR (2015)

    Google Scholar 

  16. Huang, J., et al.: Generative 3d part assembly via dynamic graph learning. In: NeurIPS (2020)

    Google Scholar 

  17. Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. In: NeurIPS (2015)

    Google Scholar 

  18. Jones, R.K., Barton, T., Xu, X., Wang, K., Jiang, E., Guerrero, P., Mitra, N.J., Ritchie, D.: Shapeassembly: Learning to generate programs for 3d shape structure synthesis. ACM TOG 39(6), 1–20 (2020)

    Article  Google Scholar 

  19. Lee, Y., Hu, E.S., Lim, J.J.: Ikea furniture assembly environment for long-horizon complex manipulation tasks. In: ICRA (2021)

    Google Scholar 

  20. Li, C., Pan, H., Bousseau, A., Mitra, N.J.: Sketch2cad: Sequential cad modeling by sketching in context. ACM TOG 39(6), 1–14 (2020)

    Article  Google Scholar 

  21. Li, J., Xu, K., Chaudhuri, S., Yumer, E., Zhang, H., Guibas, L.: Grass: Generative recursive autoencoders for shape structures. In: SIGGRAPH (2017)

    Google Scholar 

  22. Li, Y., Wang, G., Ji, X., **ang, Y., Fox, D.: Deepim: Deep iterative matching for 6d pose estimation. In: ECCV (2018)

    Google Scholar 

  23. Li, Y., Mo, K., Shao, L., Sung, M., Guibas, L.: Learning 3d part assembly from a single image. In: ECCV (2020)

    Google Scholar 

  24. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)

    Google Scholar 

  25. Liu, R., et al.: An intriguing failing of convolutional neural networks and the coordconv solution. ar**v:1807.03247 (2018)

  26. Mena, J.B.: State of the art on automatic road extraction for gis update: a novel classification. Pattern Recogn. Lett. 24(16), 3037–3058 (2003)

    Article  Google Scholar 

  27. Mo, K.: Structurenet: hierarchical graph networks for 3D shape generation. ACM TOG 38(6), 1–19 (2019)

    Article  Google Scholar 

  28. Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grou**. In: NeurIPS (2017)

    Google Scholar 

  29. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29

    Chapter  Google Scholar 

  30. Niu, C., Li, J., Xu, K.: Im2struct: Recovering 3d shape structure from a single rgb image. In: CVPR (2018)

    Google Scholar 

  31. Oberweger, M., Rad, M., Lepetit, V.: Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In: ECCV (2018)

    Google Scholar 

  32. Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: Pvnet: Pixel-wise voting network for 6dof pose estimation. In: CVPR (2019)

    Google Scholar 

  33. Rad, M., Lepetit, V.: Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: CVPR (2017)

    Google Scholar 

  34. Shao, T., Li, D., Rong, Y., Zheng, C., Zhou, K.: Dynamic furniture modeling through assembly instructions. In: ACM TOG. vol. 35. Association for Computing Machinery (2016)

    Google Scholar 

  35. Suárez-Ruiz, F., Zhou, X., Pham, Q.C.: Can robots assemble an ikea chair? Science Robotics 3(17) 6385 (2018)

    Google Scholar 

  36. Tian, Y., Luo, A., Sun, X., Ellis, K., Freeman, W.T., Tenenbaum, J.B., Wu, J.: Learning to infer and execute 3d shape programs. In: International Conference on Learning Representations (2018)

    Google Scholar 

  37. Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstractions by assembling volumetric primitives. In: CVPR (2017)

    Google Scholar 

  38. Wang, C., et al.: Densefusion: 6d object pose estimation by iterative dense fusion. In: CVPR (2019)

    Google Scholar 

  39. Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: Generating 3d mesh models from single rgb images. ar**v:1804.01654 (2018)

  40. Willis, K.D., et al.: Fusion 360 gallery: a dataset and environment for programmatic cad construction from human design sequences. ACM TOG 40(4), 1–24 (2021)

    Article  MathSciNet  Google Scholar 

  41. **ang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In: RSS (2018)

    Google Scholar 

  42. **ao, Y., Marlet, R.: Few-shot object detection and viewpoint estimation for objects in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 192–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_12

    Chapter  Google Scholar 

  43. **ao, Y., Qiu, X., Langlois, P.A., Aubry, M., Marlet, R.: Pose from shape: Deep pose estimation for arbitrary 3d objects. In: BMVC (2019)

    Google Scholar 

  44. Xu, X., Peng, W., Cheng, C.Y., Willis, K.D., Ritchie, D.: Inferring cad modeling sequences using zone graphs. In: CVPR (2021)

    Google Scholar 

  45. Yuille, A., Kersten, D.: Vision as bayesian inference: analysis by synthesis? TiCS 10(7), 301–308 (2006)

    Google Scholar 

  46. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. ar**v preprint ar**v:1904.07850 (2019)

Download references

Acknowldegements

We thank Joy Hsu, Chengshu Li, and Samuel Clarke for detailed feedback on the paper. This work is partly supported by Autodesk, the Stanford Institute for Human Centered AI (HAI), the Stanford Center for Integrated Facility Engineering (CIFE), ARMY MURI grant W911NF-15–1-0479, NSF CCRI #2120095, the Samsung Global Research Outreach (GRO) Program, and Amazon, Analog, Bosch, IBM, Meta, and Salesforce.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruocheng Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 603 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, R., Zhang, Y., Mao, J., Cheng, CY., Wu, J. (2022). Translating a Visual LEGO Manual to a Machine-Executable Plan. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19836-6_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19835-9

  • Online ISBN: 978-3-031-19836-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Navigation