Natural language allows us to create abstractions to be as specific about a task as needed, without providing trivial details. For example, when we give directions we mention salient places that contain information like traffic signals, or street corners or house colors, but we do not specify what gait a person must take, or how a car must be driven to get to the destination. Humans contribute to a conversation by being as informative as needed; attempting to be relevant and while providing an ordered, concise, and understandable input. Humans achieve properties of the Grician Cooperative principle by creating abstractions that allow them to be as specific about a task as needed. These abstractions can be in the form of a house color, or tree without getting bogged down by details such as the number of leaves on the tree, or the number of bricks used to make the house. These salient places are sub-goals that the person receiving the instructions needs to achieve to reach its goal. The person solving the task needs to understand these sub-goal conditions and plan to reach these sub-goals in the order of their specifications.
In this work, instructions are provided in natural language, and the robot observes the world in continuous LIDAR data and performs continuous actions. This problem is difficult as the instructions have discrete abstractions like corridors and intersections, but the agent is observing and interacting with the world with continuous data. Learning these abstractions can require a lot of data if done using state-of-the-art deep learning methods. Moreover, it would not be possible to learn sub-goals that can be used for planning in other tasks or with other agents, as most deep methods would learn a direct mapping from language and current state to actions.
To solve this problem we connect the ideas of abstraction in language with the ideas of hierarchical abstraction in planning. We treat natural language as a sequence of abstract symbols. Further we learn sub-goals that are observed consistently in solving tasks. First we collected data by driving the robot around and asking users to AMT to provide instructions for the task the robot just performed. We then identify skills within the robot trajectories using change point detection. These skills are similar to options, in that they have a policy and a termination set. We learn the termination set by learning a classifier around termination sets observed by the robot at the end of a skill. These termination sets provide us with sub-goal termination conditions that a planner can achieve in sequence in order to satisfy an instruction. Last, we learn a model that translates natural language to the sequence of these termination conditions.
We demonstrate our method in two different real robot domains. One is on a previously collected self driving car dataset and another on a robot that was driven around the campus at Brown. We convert the long continuous state-action demonstrations of the robot into a handful symbols that are important in completing the instructions. This learned abstraction is transferable, interpretable and allows learning of language and skills with a realistic number of demonstrations that a human can provide. For more details refer to our paper! Our code is here.
As robots grow increasingly capable of understanding and interacting with objects, it is important for non-expert users to be able to communicate with robots. One of the most sought after communication modalities is natural language, allowing users to verbally issue directives. We focus our work on natural language object retrieval—the task of finding and recovering an object specified by a human user using natural language.
Natural language object retrieval becomes far more difficult when the robot is forced to disambiguate between objects of the same class (such as between different teapots, rather than between a teapot and a bowl). The solution to this problem is to include descriptive language to specify not only object classes, but also object attributes such as shape, size, or more abstract features. We must also handle partially observed objects; it is unreasonable for a robot to observe objects in its environment from many angles with no self-occlusion. Furthermore, as human-annotated language data is expensive to gather, it is important to not require language annotated depth images from all possible viewpoints. A robust system must be able to generalize from a small human-labeled set of (depth) images to novel and unique object views.
We develop a system that addresses these challenges by grounding descriptions to objects via explicit reasoning over 3D structure. We combine Bayesian Eigenobjects (BEOs), a framework which enables estimation of a novel object’s full 3D geometry from a single partial view, with a language grounding model. This combined model uses BEOs to predict a low-dimensional object embedding into a pose-invariant learned object space and predicts language grounding from this low-dimensional space. This structure enables training of the model on a small amount of expensive human-generated language data. In addition, because BEOs generalize well to novel objects, our system scales well to new objects and partial views, an important capability for a robot operating in an unstructured human-centric environment.
We evaluate our system on a dataset of several thousand ShapeNet object instances across three object classes, paired with human-generated object descriptions obtained from Amazon Mechanical Turk. We show that not only is our system able to distinguish between objects of the same class, it can do so even when objects are only observed from partial views. In a second experiment, our language model is trained on depth images taken only from the front of objects and successfully predict attributes given test depth images taken from rear views. This view-invariance is a key property afforded by our use of an explicitly learned 3D representation—traditional fully supervised approaches are not capable of handling this scenario. We demonstrate our system on a Baxter robot, enabling the robot to successfully determine which object to pick up based on a Microsoft Kinect depth image of several candidate objects and a simple language description of the desired object. Our system is fast, with inference running under 100ms for a single language+depth-image query.
Communicating human knowledge and intent to robots is essential for successful human-robot interaction (HRI). Failures in communication, and thus collaboration, occur when there is mismatch between two agents’ mental states. Question-asking allows a robot to acquire information that targets its uncertainty, facilitating recovery from failure states. However, all question-asking modalities (natural language, eye-gaze, gestures, visual interfaces) have tradeoffs, making choosing which to use an important and context-dependent decision. For example, for robots with “real” eyes or pan/tilt screens, looking requires fewer joints to move less distance compared to pointing, decreasing the speed of the referential action. However, eye gaze, or “pointing with the eyes,” is inherently more difficult to interpret since pointing gestures can reduce the distance between the referrer and the referenced item. Another approach to question-asking is to use visualizations to make the communication modality independent of the robot. While the performance of physical actions like eye gaze and pointing gestures rely on the physical robot, visualization methods like mixed reality (MR) enable the robot to communicate information by visually depicting its mental state in the real environment. Related work has investigated the effects of using MR visualizations for reducing mental workload in HRI, but there remains a gap of research on how it compares to physical actions like pointing and eye gaze for reducing robot uncertainty.
This work investigates how physical and visualization-based question-asking can be used for reducing robot uncertainty under varying levels of ambiguity. To do this, we first model our problem as a POMDP, termed the Physio-Virtual Deixis POMDP (PVD-POMDP), that observes a human’s speech, gestures, and eye gaze, and decides when to ask questions (to increase accuracy) and when to decide to choose the item (to decrease interaction time). The PVD-POMDP enables the robot to ask questions either using physical modalities (like eye-gaze and gesture) or using virtual modalities (like mixed reality). To evaluate our model, we conducted a between-subjects user study, where 80 participants interact with a robot in an item-fetching task. Participants experience one of three different conditions of our PVD-POMDP: a no feedback control condition, a physical feedback condition, or a mixed reality feedback condition. Our results show that our mixed reality model is able to successfully choose the correct item 93% of the time, with each interaction lasting about 5 seconds, outperforming the physical and no feedback models in both quantitative metrics and qualitative metrics (highest usability, task load, and trust scores).
A long-term goal of robotics is designing robots intelligent enough to enter a person’s home and perform daily chores for them. This requires the robot to learn specific behaviors and semantic information that can only be acquired after entering the home and interacting with the humans living there. For example, there may be a trinket that the robot has never encountered before, and the owner might want to instruct the robot on how to handle the item (i.e., object manipulation information), as well as directly specify where the item should be kept (i.e., navigation information). To approach this problem, one must consider two sub-problems: a) the agent’s representation of object manipulation actions and semantic information about the environment, and b) the method with which an agent can learn this knowledge from a teacher.
Semantic maps provide a representation sufficient for navigating an environment, but map information alone is insufficient for enabling object manipulation. Conversely, there are knowledge bases that store requisite object manipulation information, but do not help with navigation or grasping in novel orientations. Previous studies have shown that Mixed Reality (MR) interfaces are effective for specifying navigation commands and programming egocentric robot behaviors. However, none of these works have demonstrated the use of MR interfaces for teaching high-level object manipulation actions, and semantic information of the environment.
Our contribution is a system that enables humans to teach robots both object manipulation actions—in a local object frame of reference—and b) semantic information about objects in a global map. We use a Mixed Reality Head Mounted Display (MR-HMD) to enable humans to teach a robot a plannable representation of their environment. By plannable, we mean structured representations that are searchable with AI planning tools. By teach, we mean having the human explicitly provide information necessary for instantiating our representation. Our representation, the Action-Oriented Semantic Map (AOSM), enables robots to perform complex object manipulation tasks that require navigation around an environment.
To test our system for building AOSMs, we created a test environment where a mobile manipulator was tasked with learning how to perform household chores with the help of a human trainer. Three novice humans used our MR interface to teach a robot an AOSM, allowing the robot to autonomously plan to navigate to a bottle, pick it up, and throw it out. In addition, two expert users taught a robot to autonomously plan to flip a light switch off and manipulate a sink faucet to the closed position, both in under 2 minutes.
As drones become increasingly autonomous, it is imperative that designers create intuitive and flexible ways for untrained users to interact with these systems. A natural language interface is immediately accessible to non-technical users and does not require the user to use a touchscreen or radio control (RC). Current natural language interfaces require a predefined model of the environment including landmarks, which is difficult for a drone to obtain. For example, given the instruction “Fly around the wall to the chair and take a picture,” the drone must already have a model of the wall and the chair to infer a policy.
We address these interaction problems by using natural language within a mixed reality headset (MR) to provide an intuitive high-level interface for controlling a drone using goal-based planning. By using MR, a user can annotate landmarks with natural language in the drone’s frame of reference. The process starts with the MR interface displaying the virtual environment of a room that can be adjusted to overlay on physical reality with gestures. Afterward, the user can annotate landmarks with colored boxes using the MR interface. This process allows the user to specify landmarks that the drone does not have the capability to detect on its own. Then, when the user sends a command, we translate this natural language text to a reward function.
Further, we conducted an exploratory user study to compare the low-level 2D interface with two MR interfaces. Overall, we found that our MR system offers a user-friendly approach to control a drone with both low-level and high-level instructions. The language interface does not require direct, continuous commands from the user. Overall, the MR system does not require the user to spend as much time controlling the drone as with the 2D interface.
A key bottleneck in widespread deployment of robots in human-centric environments is the ability for non-expert users to communicate with robots. Natural language is one of the most popular communication modalities due to the familiarity and comfort it affords a majority of users. However, training a robot to understand open-ended natural language commands is challenging since humans will inevitably produce words that were never seen in the robot’s training data. These unknown words can come from paraphrasing such as using “saucer” instead of “plate,” or from novel object classes in the robot’s environments, for example a kitchen with a “rolling pin” when the robot has never seen a rolling pin before. We aim to develop a model that can handle open-ended commands with unknown words and object classes. As a first step in solving this challenging problem, we focus on the natural language object retrieval task — selecting the correct object based on an indirect natural language command with constraints on the functionality of the object.
More specifically, our work focuses on fulfilling commands requesting an object for a task specified by a verb such as “Hand me a box cutter to cut.” Being able to handle these types of commands is highly useful for a robot agent in human-centric environments, as people usually ask for an object with a specific usage in mind. The robot would be able to correctly fetch the desired object for the given task, such as cut, without needing to have seen the object, a box cutter for example, or the word representing the object, such as the noun “box cutter.” In addition, the robot has the freedom to substitute objects as long as the selected object satisfies the specified task. This is particularly useful in cases where the robot cannot locate the specific object the human asked for but found another object that can satisfy the given task, such as a knife instead of a box cutter to cut.
Our work demonstrates that an object’s appearance provides sufficient signals to predict whether the object is suitable for a specific task, without needing to explicitly classify the object class and visual attributes. Our model takes in RGB images of objects and a natural language command containing a verb, generates embeddings of the input language command and images, and selects the image most similar to the given command in embedding space. The selected image should represent the object that best satisfies the task specified by the verb in the command. We train our model on natural language command-RGB image pairs. The evaluation task for the model is to retrieve the correct object from a set of five images, given a natural language command. We use ILSVRC2012 images and language commands generated from verb-object pairs extracted from Wikipedia for training and evaluation of our model. Our model achieves an average retrieval accuracy of 62.3% on a held-out test set of unseen ILSVRC2012 object classes and 53.0% on unseen object classes and unknown nouns. Our model also achieves an average accuracy of 54.7% on unseen YCB object classes. We also demonstrate our model on a KUKA LBR iiwa robot arm, enabling the robot to retrieve objects based on natural language commands, and present a new dataset of 655 verb-object pairs denoting object usage over 50 verbs and 216 object classes.
With my wonderful collaborators Cynthia Matuszek, Hadas Kress-Gazit and Nakul Gopalan, we published a review paper about language and robots. This article surveys the use of natural language in robotics from a robotics point of view. To use human language, robots must map words to aspects of the physical world, mediated by the robot’s sensors and actuators. This problem differs from other natural language processing domains due to the need to ground the language to noisy percepts and physical actions. We describe central aspects of language use by robots, including understanding natural language requests, using language to drive learning about the physical world, and engaging in collaborative dialogue with a human partner. We describe common approaches, roughly divided into learning methods, logic-based methods, and methods that focus on questions of human–robot inter-action. Finally, we describe several application domains for language-using robots.
It was a wonderful opportunity to reflect on the past 10 years since my Ph.D. thesis in 2010 and survey the amazing progress we have seen in robots and language.
When a person visits a strange city, they can follow directions by finding landmarks in a map. For example, the can follow instructions like “Go to the Wegmans supermarket” by finding the supermarket in their map of the city.
Autonomous systems on robots need the ability to understand natural language instructions referring to landmarks as well . Crucially, the robot’s model for understanding the user’s instructions should be flexible to new environments. If the robot has never seen “the Wegmans supermarket” before but it is provided with a map of the environment that marks these locations, it should be able to follow these sorts of instructions. Unfortunately, existing approaches need not just a map of the environment, but also a corpus of natural language commands collected in that environment.
Our recent paper accepted to ICRA 2020, Grounding Language to Landmarks in Arbitrary Outdoor Environments, addresses this problem by enabling robots to use maps to interpret natural language commands. Our approach uses a novel domain-flexible language and planning model that enables untrained users to give landmark-based natural language instructions to a robot. We use Open Street Maps (OSM) to identify landmarks in the robot’s environment, and leverage semantic information encoded in OSM’s representation of landmarks (such as type of business and/or street address) to assess similarity between the user’s natural language expression (for example, “medicine store”) and landmarks in the robot’s environment (CVS, medical research lab). You can download the collected natural language data on the h2r GitHub repository: https://github.com/h2r/Language-to-Landmarks-Data
We tested our model with a Skydio R1 drone in simulation, then outdoors. Here’s the video!
Writing is a form of human linguistic expression, and Atsu Kotani has investigate this by creating a system to enable a robot to write. Atsu’s work introduces an approach which enables manipulator robots to write handwrit ten characters or line drawings. Given an image of just-drawn handwritten characters, the robot infers a plan to replicate the image with a writing utensil, and then reproduces the image. Our approach draws each target stroke in one continuous drawing motion and does not rely on handcrafted rules or on predefined paths of characters. Instead, it learns to write from a dataset of demonstrations. We evaluate our approach in both simulation and on two real robots. Our model can draw handwritten characters in a variety of languages which are disjoint from the training set, such as Greek, Tamil, or Hindi, and also reproduce any stroke-based drawing from an image of the drawing. This work was featured in Wired,and you can learn more from the paper, which was published at ICRA last summer
We created a hardware and software framework for an autonomous aerial robot, in which all software for autonomy can run onboard the drone, implemented in Python. We present an Unscented Kalman Filter (UKF) for accurate state estimation. Next, we present an implementation of Monte Carlo (MC) Localization and FastSLAM for Simultaneous Localization and Mapping (SLAM). The performance of UKF, localization, and SLAM is tested and compared to ground truth, provided by a motion-capture system. Our autonomous educational framework runs quickly and accurately on a Raspberry Pi in Python. The base station machine only needs to have a web browser and does not need to have ROS installed, making it easy to use in school settings.
In partnership with the Duckietown Foundation, we are releasing our course materials for anyone to use. The Operations Manual contains a description of the drone and how to use it. The Learning Materials book contains all our projects and assignments. This work is being presented at IROS this week and has been nominated for a best paper award! Read the full paper here.