A key bottleneck in widespread deployment of robots in human-centric environments is the ability for non-expert users to communicate with robots. Natural language is one of the most popular communication modalities due to the familiarity and comfort it affords a majority of users. However, training a robot to understand open-ended natural language commands is challenging since humans will inevitably produce words that were never seen in the robot’s training data. These unknown words can come from paraphrasing such as using “saucer” instead of “plate,” or from novel object classes in the robot’s environments, for example a kitchen with a “rolling pin” when the robot has never seen a rolling pin before. We aim to develop a model that can handle open-ended commands with unknown words and object classes. As a first step in solving this challenging problem, we focus on the natural language object retrieval task — selecting the correct object based on an indirect natural language command with constraints on the functionality of the object.
More specifically, our work focuses on fulfilling commands requesting an object for a task specified by a verb such as “Hand me a box cutter to cut.” Being able to handle these types of commands is highly useful for a robot agent in human-centric environments, as people usually ask for an object with a specific usage in mind. The robot would be able to correctly fetch the desired object for the given task, such as cut, without needing to have seen the object, a box cutter for example, or the word representing the object, such as the noun “box cutter.” In addition, the robot has the freedom to substitute objects as long as the selected object satisfies the specified task. This is particularly useful in cases where the robot cannot locate the specific object the human asked for but found another object that can satisfy the given task, such as a knife instead of a box cutter to cut.
Our work demonstrates that an object’s appearance provides sufficient signals to predict whether the object is suitable for a specific task, without needing to explicitly classify the object class and visual attributes. Our model takes in RGB images of objects and a natural language command containing a verb, generates embeddings of the input language command and images, and selects the image most similar to the given command in embedding space. The selected image should represent the object that best satisfies the task specified by the verb in the command. We train our model on natural language command-RGB image pairs. The evaluation task for the model is to retrieve the correct object from a set of five images, given a natural language command. We use ILSVRC2012 images and language commands generated from verb-object pairs extracted from Wikipedia for training and evaluation of our model. Our model achieves an average retrieval accuracy of 62.3% on a held-out test set of unseen ILSVRC2012 object classes and 53.0% on unseen object classes and unknown nouns. Our model also achieves an average accuracy of 54.7% on unseen YCB object classes. We also demonstrate our model on a KUKA LBR iiwa robot arm, enabling the robot to retrieve objects based on natural language commands, and present a new dataset of 655 verb-object pairs denoting object usage over 50 verbs and 216 object classes.
With my wonderful collaborators Cynthia Matuszek, Hadas Kress-Gazit and Nakul Gopalan, we published a review paper about language and robots. This article surveys the use of natural language in robotics from a robotics point of view. To use human language, robots must map words to aspects of the physical world, mediated by the robot’s sensors and actuators. This problem differs from other natural language processing domains due to the need to ground the language to noisy percepts and physical actions. We describe central aspects of language use by robots, including understanding natural language requests, using language to drive learning about the physical world, and engaging in collaborative dialogue with a human partner. We describe common approaches, roughly divided into learning methods, logic-based methods, and methods that focus on questions of human–robot inter-action. Finally, we describe several application domains for language-using robots.
It was a wonderful opportunity to reflect on the past 10 years since my Ph.D. thesis in 2010 and survey the amazing progress we have seen in robots and language.
When a person visits a strange city, they can follow directions by finding landmarks in a map. For example, the can follow instructions like “Go to the Wegmans supermarket” by finding the supermarket in their map of the city.
Autonomous systems on robots need the ability to understand natural language instructions referring to landmarks as well . Crucially, the robot’s model for understanding the user’s instructions should be flexible to new environments. If the robot has never seen “the Wegmans supermarket” before but it is provided with a map of the environment that marks these locations, it should be able to follow these sorts of instructions. Unfortunately, existing approaches need not just a map of the environment, but also a corpus of natural language commands collected in that environment.
Our recent paper accepted to ICRA 2020, Grounding Language to Landmarks in Arbitrary Outdoor Environments, addresses this problem by enabling robots to use maps to interpret natural language commands. Our approach uses a novel domain-flexible language and planning model that enables untrained users to give landmark-based natural language instructions to a robot. We use Open Street Maps (OSM) to identify landmarks in the robot’s environment, and leverage semantic information encoded in OSM’s representation of landmarks (such as type of business and/or street address) to assess similarity between the user’s natural language expression (for example, “medicine store”) and landmarks in the robot’s environment (CVS, medical research lab). You can download the collected natural language data on the h2r GitHub repository: https://github.com/h2r/Language-to-Landmarks-Data
We tested our model with a Skydio R1 drone in simulation, then outdoors. Here’s the video!
Writing is a form of human linguistic expression, and Atsu Kotani has investigate this by creating a system to enable a robot to write. Atsu’s work introduces an approach which enables manipulator robots to write handwrit ten characters or line drawings. Given an image of just-drawn handwritten characters, the robot infers a plan to replicate the image with a writing utensil, and then reproduces the image. Our approach draws each target stroke in one continuous drawing motion and does not rely on handcrafted rules or on predefined paths of characters. Instead, it learns to write from a dataset of demonstrations. We evaluate our approach in both simulation and on two real robots. Our model can draw handwritten characters in a variety of languages which are disjoint from the training set, such as Greek, Tamil, or Hindi, and also reproduce any stroke-based drawing from an image of the drawing. This work was featured in Wired,and you can learn more from the paper, which was published at ICRA last summer
We created a hardware and software framework for an autonomous aerial robot, in which all software for autonomy can run onboard the drone, implemented in Python. We present an Unscented Kalman Filter (UKF) for accurate state estimation. Next, we present an implementation of Monte Carlo (MC) Localization and FastSLAM for Simultaneous Localization and Mapping (SLAM). The performance of UKF, localization, and SLAM is tested and compared to ground truth, provided by a motion-capture system. Our autonomous educational framework runs quickly and accurately on a Raspberry Pi in Python. The base station machine only needs to have a web browser and does not need to have ROS installed, making it easy to use in school settings.
In partnership with the Duckietown Foundation, we are releasing our course materials for anyone to use. The Operations Manual contains a description of the drone and how to use it. The Learning Materials book contains all our projects and assignments. This work is being presented at IROS this week and has been nominated for a best paper award! Read the full paper here.
Natural language instructions issued to robots convey intent with multiple layers of spatial abstraction. These layers include high-level goals (“fly to the end of the street”), low-level specifications (“go ten meters south then ten meters east”), or a hybrid (“fly to the lamppost on Hex Street”). In addition, language can express constraints on reaching a goal, such as “go to the lamppost while avoiding the fire hydrant.” Current robot autonomy systems struggle to understand instructions above the lowest layer of spatial abstraction.
Our recent paper published at RSS 2019, titled Planning with State Abstractions for Non-Markovian Task Specifications, approaches this problem with Abstract Product Markov Decision Process (AP-MDP), a novel framework which enables the robot to understand higher-level spatial abstractions in natural language. This framework leverages a deep-learned model to translate natural language into Linear Temporal Logic (LTL), a symbolic form the robot can understand. From LTL, we create motion plans for the robot aligning with goals and constraints spanning different layers of abstraction.
While AP-MDPs were initially tested with a small indoor robot, we realized this framework generalizes to more complex robots and environments. We connected AP-MDPs to a Skydio R1 drone and gave natural language instructions to fly the drone around Brown’s campus! The success of this flight has led to exciting new verticals of research for the lab.
The human mind does not see a visual scene as pixels; rather it is able to pick out objects of interest and selectively reason about them. Although object-based reasoning represents a core feature of human reasoning, only recently has it been possible to study in robotics due to advances in computer vision for segmentation and classification of objects from camera images. Our recent paper published at ICRA 2019, titled Multi-Object Search in Object-Oriented POMDPs, takes a stab at the problem by solving a novel multi-object search (MOS) task: without knowing in advance where the objects are located, a robot must find them in an indoor, roomed environment. We formulate the MOS task within a new framework called an object-oriented partially observable Markov decision process (OO-POMDP). An OO-POMDP represents the state and observation spaces in terms of classes and objects. The structure afforded by OO-POMDPs supports reasoning about each object independently while also providing a means for grounding language commands from a human on task onset. A human, for example, may issue an initial command such as “Find the mugs in the kitchen and books in the library,” where a robot can associate the locations to each object class so as to improve its search. We show that OO-POMCP with grounded language commands is sufficient for solving challenging MOS tasks both in simulation and on a physical mobile robot, which has applications for rescue- and home- robots.
Interestingly, the OO-POMDP allows the robot to recover from lies or mistakes in what a human tells it. If the person tells the robot “Find the mug in the kitchen,” the robot will first look in the kitchen for the objects. But after failing to find them in the kitchen, it will then systematically search the rest of the environment. Our OO-POMCP inference algorithm allows the robot to quickly and efficiently use all the information it has to find the object.
Kevin Stacey and his team released a great article about progress in our drone project. I especially like the video! Thank you for a fantastic summer that included improved PID control, a working 3D Unscented Kalman Filter, and on-board, off-line SLAM, all on our little Raspberry Pi drone!
We also received the fantastic news that our IROS paper was accepted! Stay tuned for the presentation at IROS 2018. This paper describes the initial platform that we used for the course last fall. We are now on Version 3 of the hardware and software stack, and we can’t wait to see what else is in store!
Many times, natural language commands issued to robots not only specify a particular target configuration or goal state but also outline constraints on how the robot goes about its execution. That is, the path taken to achieving some goal state is given equal importance to the goal state itself. One example of this could be instructing a wheeled robot to “go to the living room but avoid the kitchen,” in order to avoid scuffing the floor. This class of behaviors poses a serious obstacle to existing language understanding for robotics approaches that map to either action sequences or goal state representations. Due to the non-Markovian nature of the objective, approaches in the former category must map to potentially unbounded action sequences whereas approaches in the latter category would require folding the entirety of a robot’s trajectory into a (traditionally Markovian) state representation, resulting in an intractable decision-making problem. To resolve this challenge, we use a recently introduced probabilistic variant of Linear Temporal Logic (LTL) as a goal specification language for a Markov Decision Process (MDP). While demonstrating that standard neural sequence-to-sequence learning models can successfully ground language to this semantic representation, we also provide analysis that highlights generalization to novel, unseen logical forms as an open problem for this class of model. We evaluate our system within two simulated robot domains as well as on a physical robot, demonstrating accurate language grounding alongside a significant expansion in the space of interpretable robot behaviors.
Our paper was recently published at RSS 2018 and you can read it here!
Using light field methods, we can use Baxter’s monocular camera to localize the metal 0.24′ nut and corresponding bolt. The localization is precise enough to allow the robot to use the estimated poses to then perform an open-loop pick, place, and screw to put the nut on the bolt. The precise pose estimation enables complex routines to be quickly encoded, because once the robot knows where the parts are, it can perform accurate grasps and placement actions. You can read more about how it works in our RSS 2017 paper.