Category Archives: Publication

Learning to Follow Directions With Less Supervision

A person should be able to give a complex natural language command to a robot drone or self-driving car in a cityscale environment and have it be understood, such as “”walk along third street until the intersection with main street, then walk until you reach the charles river.” Existing approaches represent commands like these as expressions in Linear Temporal Logic, which can represent constraints such as “eventually X” and “avoid X”. This representation is powerful, but to use it requires a large dataset of language paired with LTL expressions for training the model. This paper represents the first ever framework for learning to map from English to LTL without requiring any LTL annotations at training time. We learn a semantic parsing model that does not require paired data of language and LTL logical forms, but instead learns from trajectories as a proxy. To collect trajectories on a large scale over a range

We release this data as well as the data collection procedure to simulate paths in large environments. We see the benefits of using a more expressive language such as LTL in instructions that require temporal ordering, and also see that the path taken with our approach more closely follows constraints specified in natural language. This dataset consists of 10 different environments, with up to 2,458 samples in each environment, giving us a total of 18,060 samples. To the best of our knowledge this is the largest dataset of temporal commands in existence.

You can read the paper here!

Learning Collaborative Pushing and Grasping Policies

Imagine a robot trying to clean up a messy dinner table. Two main manipulation skills are required: grasping that enables the robot to pick up objects, and planar pushing that allows the robot to isolate objects in the dense clutter to find a good grasp pose. It is necessary to identify grasps within the full 6D space because top-down grasping is insufficient for objects with diverse shapes, e.g. a plate or a filled cup. Pushing operations are also essential because in real-world scenarios, the robot’s workspace can contain many objects and a collision-free direct grasp may not exist. Pushing operations can singulate objects in clutter, enabling future grasping of these isolated objects. We explore learning joint planar pushing and 6-degree-of-freedom (6-DoF) grasping policies in a cluttered environment.

In a Q-learning framework, we jointly train two separate neural networks with reinforcement learning to maximize a reward function. The reward function is defined as only encouraging successful grasps; we do not directly reward pushing actions, because such intermediate rewards often lead to undesired behavior. We tackle the problem of limited top-down grasping
action space by integrating a 6-DoF grasping pose sampler rather than using dense pixel-wise sampling from visual inputs and only considering hard-coded top-down grasping candidates.

We evaluate our approach by task completion rate, action efficiency, and grasp accuracy in simulation and demonstrate performance on a real robot implementation. Our system shows 10% higher action efficiency and 20% higher grasp success rate than VPG, the current state-of-the-art, indicating significantly better performance in terms of both higher
prediction accuracy and quality of grasp pose selection.

This work was published at ICRA 2021. The code is here, and you can see a video here!

Understanding Spatial Language to Find Objects Faster

Humans use spatial language to describe object locations and their relations. Consider the following scenario. A tourist is looking for an ice cream truck in an amusement park. She asks a passer-by and gets the reply “the ice cream truck is behind the ticket booth.” The tourist looks at the amusement park map and locates the ticket booth. Then, she is able to infer a region corresponding to that statement and go there to search for the ice cream truck, even though the spatial preposition “behind” is inherently ambiguous.

If robots can understand spatial language, they can leverage prior knowledge possessed by humans to search for objects more efficiently, and interface with humans more naturally. This can be useful for applications such as autonomous delivery and search-and-rescue, where the customer or people at the scene communicate with the robot via natural language.

Unfortunately, humans produce diverse spatial language phrases based on their observation of the environment and knowledge of target locations, yet none of these factors are available to the robot. In addition, the robot may operate in a different area than where it was trained. The robot must generalize its ability to understand spatial language across environments.

Prior works on spatial language understanding assume referenced objects already exist in the robot’s world model or within the robot’s field of view. Works that consider partial observability do not handle ambiguous spatial prepositions or assume a robot-centric frame of reference, limiting the ability to understand diverse spatial relations that provide critical disambiguating information, such as behind the ticket booth.

We present Spatial Language Object-Oriented POMDP (SLOOP), which extends OO-POMDP (a recent work in our lab) by considering spatial language as an additional perceptual modality. To interpret ambiguous, context-dependent prepositions (e.g. behind), we design a simple convolutional neural network that predicts the language provider’s latent frame of reference (FoR) given the environment context.

We apply SLOOP to object search in city-scale environments given a spatial language description of target locations. Search strategies are computed via an online POMDP planner based on Monte Carlo Tree Search.

Evaluation based on crowdsourced language data, collected over areas of five cities in OpenStreetMap, shows that our approach achieves faster search and higher success rate compared to baselines, with a wider margin as the spatial language becomes more complex.

Finally, we demonstrate the proposed method in AirSim, a realistic simulator where a drone is tasked to find cars in a neighborhood environment.

For future work, we plan to investigate compositionality in spatial language for partially observable domains.

See our video!

And download the paper!

Object Search in 3D

Robots operating in human spaces must find objects such as glasses, books, or cleaning supplies that could be on the floor, shelves, or tables. This search space is naturally 3D.

When multiple objects must be searched for, such as a cup and a mobile phone, an intuitive strategy is to first hypothesize likely search regions for each target object based on semantic knowledge or past experience, then search carefully within those regions by moving the robot’s camera around the 3D environment. To be successful, it is essential for the robot to produce an efficient search policy within a designated search region under limited field of view (FOV), where target objects could be partially or completely blocked by other objects. In this work, we consider the problem setting where a robot must search for multiple objects in a search region by actively moving its camera, with as few steps as possible.

Searching for objects in a large search region requires acting over long horizons under various sources of uncertainty in a partially observable environment. For this reason, previous works have used Partially Observable Markov Decision Process (POMDP) as a principled decision-theoretic framework for object search. However, to ensure the POMDP is manageable to solve, previous works reduce the search space or robot mobility to 2D, although objects exist in rich 3D environments. The key challenges lie in the intractability of maintaining exact belief due to large state space, and the high branching factor for planning due to large observation space.

In this paper, we present a POMDP formulation for multi-object search in a 3D region with a frustum-shaped field-of-view. To efficiently solve this POMDP, we propose a multi-resolution planning algorithm based on online Monte-Carlo tree search. In this approach, we design a novel octree-based belief representation to capture uncertainty of the target objects at different resolution levels, then derive abstract POMDPs at lower resolutions with dramatically smaller state and observation spaces.

Evaluation in a simulated 3D domain shows that our approach finds objects more efficiently and successfully compared to a set of baselines without resolution hierarchy in larger instances under the same computational requirement.

Finally, we demonstrate our approach on a torso-actuated mobile robot in a lab environment. The robot finds 3 out of 6 objects placed at different heights in two 10m2 x 2m2 regions in around 15 minutes.

You can download the paper here! The code is available at https://zkytony.github.io/3D-MOS/.

This demonstrates that such challenging POMDPs can be solved online efficiently and scalably with practicality for a real robot by extending existing general POMDP solvers with domain-specific structure and belief representation. This paper won RoboCup Best Paper Award!

Scanning the Internet for ROS

Security is particularly important in robotics. A robot can sense the physical world using sensors, or directly change it with its actuators. Thus, it can leak sensitive information about its environment, or even cause physical harm if accessed by an unauthorized party. Existing work has assessed the state of industrial robot security and found a number of vulnerabilities. However, we are unaware of any studies that gauge the state of security in robotics research.

To address this problem we conducted several scans of the whole IPv4 address space, in order to identify unprotected hosts using the Robot Operating System (ROS), which is widely used in robotics research. Like many research platforms, the ROS designers made a conscious decision to exclude security mechanisms because they did not have a clear model of security threats and were not security experts themselves. The ROS master node trusts all nodes that connect to it, and thus should not be exposed to the public Internet. Nonetheless, our scans identified over 100 publicly-accessible hosts running a ROS master. Of those we found, a number of them are connected to simulators, such as Gazebo, while others appear to be real robots capable of being remotely moved in ways dangerous both to the robot and those around it.

As a qualitative case study, we also present a proof-of-concept “takeover” of one of the robots (with the consent of its owner), to demonstrate that an open ROS master indicates a robot whose sensors can be remotely accessed, and whose actuators can be remotely controlled.
This scan was eye-opening for us, too—we found two of our own robots as part of the scan, one Baxter [5] robot and one drone. Neither was intentionally made available on the public Internet, and both have the potential to cause physical harm if used inappropriately. Read more in our 2019 ICRA paper!

Learning Abstractions to Follow Commands

Natural language allows us to create abstractions to be as specific about a task as needed, without providing trivial details. For example, when we give directions we mention salient places that contain information like traffic signals, or street corners or house colors, but we do not specify what gait a person must take, or how a car must be driven to get to the destination. Humans contribute to a conversation by being as informative as needed; attempting to be relevant and while providing an ordered, concise, and understandable input. Humans achieve properties of the Grician Cooperative principle by creating abstractions that allow them to be as specific about a task as needed. These abstractions can be in the form of a house color, or tree without getting bogged down by details such as the number of leaves on the tree, or the number of bricks used to make the house. These salient places are sub-goals that the person receiving the instructions needs to achieve to reach its goal. The person solving the task needs to understand these sub-goal conditions and plan to reach these sub-goals in the order of their specifications.

In this work, instructions are provided in natural language, and the robot observes the world in continuous LIDAR data and performs continuous actions. This problem is difficult as the instructions have discrete abstractions like corridors and intersections, but the agent is observing and interacting with the world with continuous data. Learning these abstractions can require a lot of data if done using state-of-the-art deep learning methods. Moreover, it would not be possible to learn sub-goals that can be used for planning in other tasks or with other agents, as most deep methods would learn a direct mapping from language and current state to actions.

To solve this problem we connect the ideas of abstraction in language with the ideas of hierarchical abstraction in planning. We treat natural language as a sequence of abstract symbols. Further we learn sub-goals that are observed consistently in solving tasks. First we collected data by driving the robot around and asking users to AMT to provide instructions for the task the robot just performed. We then identify skills within the robot trajectories using change point detection. These skills are similar to options, in that they have a policy and a termination set. We learn the termination set by learning a classifier around termination sets observed by the robot at the end of a skill. These termination sets provide us with sub-goal termination conditions that a planner can achieve in sequence in order to satisfy an instruction. Last, we learn a model that translates natural language to the sequence of these termination conditions.

We demonstrate our method in two different real robot domains. One is on a previously collected self driving car dataset and another on a robot that was driven around the campus at Brown. We convert the long continuous state-action demonstrations of the robot into a handful symbols that are important in completing the instructions. This learned abstraction is transferable, interpretable and allows learning of language and skills with a realistic number of demonstrations that a human can provide. For more details refer to our paper! Our code is here.

Understanding Adjectives

As robots grow increasingly capable of understanding and interacting with objects, it is important for non-expert users to be able to communicate with robots. One of the most sought after communication modalities is natural language, allowing users to verbally issue directives. We focus our work on natural language object retrieval—the task of finding and recovering an object specified by a human user using natural language.

Natural language object retrieval becomes far more difficult when the robot is forced to disambiguate between objects of the same class (such as between different teapots, rather than between a teapot and a bowl). The solution to this problem is to include descriptive language to specify not only object classes, but also object attributes such as shape, size, or more abstract features. We must also handle partially observed objects; it is unreasonable for a robot to observe objects in its environment from many angles with no self-occlusion. Furthermore, as human-annotated language data is expensive to gather, it is important to not require language annotated depth images from all possible viewpoints. A robust system must be able to generalize from a small human-labeled set of (depth) images to novel and unique object views.

We develop a system that addresses these challenges by grounding descriptions to objects via explicit reasoning over 3D structure. We combine Bayesian Eigenobjects (BEOs), a framework which enables estimation of a novel object’s full 3D geometry from a single partial view, with a language grounding model. This combined model uses BEOs to predict a low-dimensional object embedding into a pose-invariant learned object space and predicts language grounding from this low-dimensional space. This structure enables training of the model on a small amount of expensive human-generated language data. In addition, because BEOs generalize well to novel objects, our system scales well to new objects and partial views, an important capability for a robot operating in an unstructured human-centric environment.

We evaluate our system on a dataset of several thousand ShapeNet object instances across three object classes, paired with human-generated object descriptions obtained from Amazon Mechanical Turk. We show that not only is our system able to distinguish between objects of the same class, it can do so even when objects are only observed from partial views. In a second experiment, our language model is trained on depth images taken only from the front of objects and successfully predict attributes given test depth images taken from rear views. This view-invariance is a key property afforded by our use of an explicitly learned 3D representation—traditional fully supervised approaches are not capable of handling this scenario. We demonstrate our system on a Baxter robot, enabling the robot to successfully determine which object to pick up based on a Microsoft Kinect depth image of several candidate objects and a simple language description of the desired object. Our system is fast, with inference running under 100ms for a single language+depth-image query.

Learn more in our IROS 2019 paper! Find the code and data in our githup repository.

POMDPS and Mixed Reality for Resolving References to Objects

Communicating human knowledge and intent to robots is essential for successful human-robot interaction (HRI). Failures in communication, and thus collaboration, occur when there is mismatch between two agents’ mental states. Question-asking allows a robot to acquire information that targets its uncertainty, facilitating recovery from failure states. However, all question-asking modalities (natural language, eye-gaze, gestures, visual interfaces) have tradeoffs, making choosing which to use an important and context-dependent decision.  For example, for robots with “real” eyes or pan/tilt screens, looking requires fewer joints to move less distance compared to pointing, decreasing the speed of the referential action. However, eye gaze, or “pointing with the eyes,” is inherently more difficult to interpret since pointing gestures can reduce the distance between the referrer and the referenced item. Another approach to question-asking is to use visualizations to make the communication modality independent of the robot. While the performance of physical actions like eye gaze and pointing gestures rely on the physical robot, visualization methods like mixed reality (MR) enable the robot to communicate information by visually depicting its mental state in the real environment. Related work has investigated the effects of using MR visualizations for reducing mental workload in HRI, but there remains a gap of research on how it compares to physical actions like pointing and eye gaze for reducing robot uncertainty.

This work investigates how physical and visualization-based question-asking can be used for reducing robot uncertainty under varying levels of ambiguity. To do this, we first model our problem as a POMDP, termed the Physio-Virtual Deixis POMDP (PVD-POMDP), that observes a human’s speech, gestures, and eye gaze, and decides when to ask questions (to increase accuracy) and when to decide to choose the item (to decrease interaction time). The PVD-POMDP enables the robot to ask questions either using physical modalities (like eye-gaze and gesture) or using virtual modalities (like mixed reality). To evaluate our model, we conducted a between-subjects user study, where 80 participants interact with a robot in an item-fetching task. Participants experience one of three different conditions of our PVD-POMDP: a no feedback control condition, a physical feedback condition, or a mixed reality feedback condition. Our results show that our mixed reality model is able to successfully choose the correct item 93% of the time, with each interaction lasting about 5 seconds,  outperforming the physical and no feedback models in both quantitative metrics and qualitative metrics (highest usability, task load, and trust scores).

You can read more in our paper!

Action Oriented Semantic Maps

A long-term goal of robotics is designing robots intelligent enough to enter a person’s home and perform daily chores for them. This requires the robot to learn specific behaviors and semantic information that can only be acquired after entering the home and interacting with the humans living there. For example, there may be a trinket that the robot has never encountered before, and the owner might want to instruct the robot on how to handle the item (i.e., object manipulation information), as well as directly specify where the item should be kept (i.e.,  navigation information). To approach this problem, one must consider two sub-problems: a) the agent’s representation of object manipulation actions and semantic information about the environment, and b) the method with which an agent can learn this knowledge from a teacher.

Semantic maps provide a representation sufficient for navigating an environment, but map information alone is insufficient for enabling object manipulation. Conversely, there are knowledge bases that store requisite object manipulation information, but do not help with navigation or grasping in novel orientations. Previous studies have shown that Mixed Reality (MR) interfaces are effective for specifying navigation commands and programming egocentric robot behaviors. However, none of these works have demonstrated the use of MR interfaces for teaching high-level object manipulation actions, and semantic information of the environment.

Our contribution is a system that enables humans to teach robots both object manipulation actions—in a local object frame of reference—and b) semantic information about objects in a global map. We use a Mixed Reality Head Mounted Display (MR-HMD) to enable humans to teach a robot a plannable representation of their environment. By plannable, we mean structured representations that are searchable with AI planning tools. By teach, we mean having the human explicitly provide information necessary for instantiating our representation. Our representation, the Action-Oriented Semantic Map (AOSM), enables robots to perform complex object manipulation tasks that require navigation around an environment.

To test our system for building AOSMs, we created a test environment where a mobile manipulator was tasked with learning how to perform household chores with the help of a human trainer. Three novice humans used our MR interface to teach a robot an AOSM, allowing the robot to autonomously plan to navigate to a bottle, pick it up, and throw it out. In addition, two expert users taught a robot to autonomously plan to flip a light switch off and manipulate a sink faucet to the closed position, both in under 2 minutes.

You can read more in our paper!

Natural Language, Mixed Reality, and Drones

As drones become increasingly autonomous, it is imperative that designers create intuitive and flexible ways for untrained users to interact with these systems. A natural language interface is immediately accessible to non-technical users and does not require the user to use a touchscreen or radio control (RC). Current natural language interfaces require a predefined
model of the environment including landmarks, which is difficult for a drone to obtain. For example, given the instruction “Fly around the wall to the chair and take a picture,” the drone must already have a model of the wall and the chair to infer a policy.


We address these interaction problems by using natural language within a mixed reality headset (MR) to provide an intuitive high-level interface for controlling a drone using goal-based planning. By using MR, a user can annotate landmarks with natural language in the drone’s frame of reference. The process starts with the MR interface displaying the virtual environment of a room that can be adjusted to overlay on physical reality with gestures. Afterward, the user can annotate landmarks with colored boxes using the MR interface. This process allows the user to specify landmarks that the drone does not have the capability to detect on its own. Then, when the user sends a command, we translate this natural language text to a reward function.

Further, we conducted an exploratory user study to compare the low-level 2D interface with two MR interfaces. Overall, we found that our MR system offers a user-friendly approach to control a drone with both low-level and high-level instructions. The language interface does not require direct, continuous commands from the user. Overall, the MR system does not require the user to spend as much time controlling the drone as with the 2D interface.

You can read the paper here!