As robots grow increasingly capable of understanding and interacting with objects, it is important for non-expert users to be able to communicate with robots. One of the most sought after communication modalities is natural language, allowing users to verbally issue directives. We focus our work on natural language object retrieval—the task of finding and recovering an object specified by a human user using natural language.
Natural language object retrieval becomes far more difficult when the robot is forced to disambiguate between objects of the same class (such as between different teapots, rather than between a teapot and a bowl). The solution to this problem is to include descriptive language to specify not only object classes, but also object attributes such as shape, size, or more abstract features. We must also handle partially observed objects; it is unreasonable for a robot to observe objects in its environment from many angles with no self-occlusion. Furthermore, as human-annotated language data is expensive to gather, it is important to not require language annotated depth images from all possible viewpoints. A robust system must be able to generalize from a small human-labeled set of (depth) images to novel and unique object views.
We develop a system that addresses these challenges by grounding descriptions to objects via explicit reasoning over 3D structure. We combine Bayesian Eigenobjects (BEOs), a framework which enables estimation of a novel object’s full 3D geometry from a single partial view, with a language grounding model. This combined model uses BEOs to predict a low-dimensional object embedding into a pose-invariant learned object space and predicts language grounding from this low-dimensional space. This structure enables training of the model on a small amount of expensive human-generated language data. In addition, because BEOs generalize well to novel objects, our system scales well to new objects and partial views, an important capability for a robot operating in an unstructured human-centric environment.
We evaluate our system on a dataset of several thousand ShapeNet object instances across three object classes, paired with human-generated object descriptions obtained from Amazon Mechanical Turk. We show that not only is our system able to distinguish between objects of the same class, it can do so even when objects are only observed from partial views. In a second experiment, our language model is trained on depth images taken only from the front of objects and successfully predict attributes given test depth images taken from rear views. This view-invariance is a key property afforded by our use of an explicitly learned 3D representation—traditional fully supervised approaches are not capable of handling this scenario. We demonstrate our system on a Baxter robot, enabling the robot to successfully determine which object to pick up based on a Microsoft Kinect depth image of several candidate objects and a simple language description of the desired object. Our system is fast, with inference running under 100ms for a single language+depth-image query.