Richer scenes
Model is trained on simple scenes with 3-5 objects. It generalizes well to richer scenes with more objects, such as this scene with 15 objects.
Given a natural language instruction, and an input and an output scene, our goal is to train a neuro-symbolic model which can output a manipulation program that can be executed by the robot on the input scene resulting in the desired output scene.
Prior approaches for this task possess one of the following limitations: (i) rely on hand-coded symbols for concepts limiting generalization beyond those seen during training [1] (ii) infer action sequences from instructions but require dense sub-goal supervision [2] or (iii) lack semantics required for deeper object-centric reasoning inherent in interpreting complex instructions [3]. In contrast, our approach is neuro-symbolic and can handle linguistic as well as perceptual variations, is end-toend differentiable requiring no intermediate supervision, and makes use of symbolic reasoning constructs which operate on a latent neural object-centric representation, allowing for deeper reasoning over the input scene.
Central to our approach is a modular structure, consisting of a hierarchical instruction parser, and a manipulation module to learn disentangled action representations, both trained via RL. Our experiments on a simulated environment with a 7-DOF manipulator, consisting of instructions with varying number of steps, as well as scenes with different number of objects, and objects with unseen attribute combinations, demonstrate that our model is robust to such variations, and significantly outperforms existing baselines, particularly in generalization settings
Model is trained on simple scenes with 3-5 objects. It generalizes well to richer scenes with more objects, such as this scene with 15 objects.
Model is trained on instructions up to 2 steps. It generalizes well to multi-step plans, such as this example with a 5-step instruction
[1]R. Paul, J. Arkin, N. Roy, and T. M Howard, “Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators,” in Robotics: Science and Systems Foundation, 2016
[2]C. Paxton, Y. Bisk, J. Thomason, A. Byravan, and D. Fox, “Prospection: Interpretable plans from language by predicting the future,” in 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 6942-6948
[3]M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Conference on Robot Learning, PMLR, 2022, pp. 894-906
@inproceedings{Kalithasan2023NSRM,
title = {Learning Neuro-symbolic Programs for Language Guided Robot Manipulation},
author = {Kalithasan, Namasivayam and Singh, Himanshu and Bindal, Vishal and Tuli, Arnav and Agrawal, Vishwajeet and Jain, Rahul and Singla, Parag and Paul, Rohan},
booktitle = {IEEE International Conference on Robotics and Automation},
year = {2023}
}