Learning Neuro-symbolic Programs for Language Guided Robot Manipulation

ICRA 2023

Namasivayam K*1, Himanshu Singh*1, Vishal Bindal*1, Arnav Tuli1, Vishwajeet Agrawal#1, Rahul Jain#1, Parag Singla1, Rohan Paul1
1Authors are with IIT Delhi, * and # denote equal contribution

Put the white dice above the yellow lego object and move the yellow cube on top of the white dice

NSRM is a novel neuro-symbolic model that learns the semantics of visual, action and spatial concepts in an end-to-end manner with no sub-goal supervision

Abstract

Given a natural language instruction, and an input and an output scene, our goal is to train a neuro-symbolic model which can output a manipulation program that can be executed by the robot on the input scene resulting in the desired output scene.

Prior approaches for this task possess one of the following limitations: (i) rely on hand-coded symbols for concepts limiting generalization beyond those seen during training [1] (ii) infer action sequences from instructions but require dense sub-goal supervision [2] or (iii) lack semantics required for deeper object-centric reasoning inherent in interpreting complex instructions [3]. In contrast, our approach is neuro-symbolic and can handle linguistic as well as perceptual variations, is end-toend differentiable requiring no intermediate supervision, and makes use of symbolic reasoning constructs which operate on a latent neural object-centric representation, allowing for deeper reasoning over the input scene.

Central to our approach is a modular structure, consisting of a hierarchical instruction parser, and a manipulation module to learn disentangled action representations, both trained via RL. Our experiments on a simulated environment with a 7-DOF manipulator, consisting of instructions with varying number of steps, as well as scenes with different number of objects, and objects with unseen attribute combinations, demonstrate that our model is robust to such variations, and significantly outperforms existing baselines, particularly in generalization settings

Video

Generalisation

Richer scenes

Model is trained on simple scenes with 3-5 objects. It generalizes well to richer scenes with more objects, such as this scene with 15 objects.

Move the cyan cube on top of the red lego object

Multi-step plans

Model is trained on instructions up to 2 steps. It generalizes well to multi-step plans, such as this example with a 5-step instruction

Put the blue lego thing to the left of the red lego thing and place the red cube on the left side of the white lego object and move the magenta dice to the right of the green box and move the red box on the left side of the blue lego thing and put the blue lego object to the left of the white lego thing

References

[1]R. Paul, J. Arkin, N. Roy, and T. M Howard, “Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,” in Robotics: Science and Systems Foundation, 2016

[2]C. Paxton, Y. Bisk, J. Thomason, A. Byravan, and D. Fox, “Prospection: Interpretable plans from language by predicting the future,” in 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 6942-6948

[3]M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Conference on Robot Learning, PMLR, 2022, pp. 894-906

Citation

@inproceedings{Kalithasan2023NSRM,
      title     = {Learning Neuro-symbolic Programs for Language Guided Robot Manipulation},
      author    = {Kalithasan, Namasivayam and Singh, Himanshu and Bindal, Vishal and Tuli, Arnav and Agrawal, Vishwajeet and Jain, Rahul and Singla, Parag and Paul, Rohan},
      booktitle = {IEEE International Conference on Robotics and Automation},
      year      = {2023}
    }