Checklist for Paper Submissions

We all want our papers to be awesome. Here is the checklist I use before paper submission.

  • Make sure there are no latex errors. Especially if you are using Overleaf, it will make a pdf even if your latex has errors, but it may not render the way you expect. If you are building on a PC, sometimes latex errors show up far up in the lag, so check for undefined references or citations or anything like that.
  • Spellcheck!
  • Read the bibliography and make sure there are no typos/errors/missing fields there.
  • Make sure the file size is 1MB or less. If it is larger, it is probably because of included images. Make sure 1) the resolution of the images is not too large (e.g., max 800 on a side), and 2) that if it is a photograph you are using jpg (which is compressed) and not png (which is losslessly compressed). However if it’s a diagram with lots of white space, use png, or even pdf or svg which will be imported into the generated pdf as a vector graphics image rather than pixles, and be small but look good at any resolution. Taking these steps will not change how it looks, but will dramatically reduce file size, making it take less time for you to upload, for reviewers to download, and for everyone else to download the paper for the rest of its lifetime.
  • Make sure all authors know they are authors and are okay with submitting.
  • Read the conference guidelines for authors and make sure you follow them all, especially as related to space. baseline stretch is now getting explicitly disallowed so be careful!
  • If double blind, make sure 1) you are citing all your own previous work, but you are citing it anonymously (as if someone else wrote the paper) 2) there are no other bits that deanonymize you.

Scanning the Internet for ROS

Security is particularly important in robotics. A robot can sense the physical world using sensors, or directly change it with its actuators. Thus, it can leak sensitive information about its environment, or even cause physical harm if accessed by an unauthorized party. Existing work has assessed the state of industrial robot security and found a number of vulnerabilities. However, we are unaware of any studies that gauge the state of security in robotics research.

To address this problem we conducted several scans of the whole IPv4 address space, in order to identify unprotected hosts using the Robot Operating System (ROS), which is widely used in robotics research. Like many research platforms, the ROS designers made a conscious decision to exclude security mechanisms because they did not have a clear model of security threats and were not security experts themselves. The ROS master node trusts all nodes that connect to it, and thus should not be exposed to the public Internet. Nonetheless, our scans identified over 100 publicly-accessible hosts running a ROS master. Of those we found, a number of them are connected to simulators, such as Gazebo, while others appear to be real robots capable of being remotely moved in ways dangerous both to the robot and those around it.

As a qualitative case study, we also present a proof-of-concept “takeover” of one of the robots (with the consent of its owner), to demonstrate that an open ROS master indicates a robot whose sensors can be remotely accessed, and whose actuators can be remotely controlled.
This scan was eye-opening for us, too—we found two of our own robots as part of the scan, one Baxter [5] robot and one drone. Neither was intentionally made available on the public Internet, and both have the potential to cause physical harm if used inappropriately. Read more in our 2019 ICRA paper!

How to Fix a Broken Robot

This post describes steps for fixing a robot that is broken. Since robots are computers, many of the steps also apply to fixing a broken phone, tablet or PC, although the details differ.

Hardware

The first step is to determine if the hardware itself is working. The hardware subsystems that most commonly fail are 1) the power subsystem 2) the compute subsystem (RAM/CPU/motherboard 3) the long-term storage subsystem (hard drive or SD card) . You want to therefore systematically check each subsystem (in that order) to verify that it is working correctly. Sometimes it is necessary to take the robot apart to check these subsystems. While it is rarely the case, it is possible that dust build-up inside your robot is causing ventilation issues. If you are pulling your robot apart and see a lot of dust, take a second to clear it out! Also, always make sure that when fixing your robot and the power is on, that you are near the e-stop. Be careful when taking the robot apart. Pieces can be very fragile and delicate and aren’t made for outside conditions. static charge can build on you, which is dangerous when getting near power systems to fix them. You should always wear an antistatic wrist strap when fixing power systems to ensure you don’t discharge any static into the robot. When doing this, be sure to keep track of all the pieces you unscrew, and it’s always good to have backups in case. Be careful with unscrewing, you don’t want to strip the screws.

Power Subsystem. The most common symptom of a power problem is that the robot or device just won’t turn on. This could be because the power supply failed, because the battery died, or maybe your wall outlet doesn’t have power. (Batteries have a limited number of charging cycles; they wear out and need to be replaced!) Most robots or devices have some kind of LED that indicates that power is being transmitted to the device, and the location, color, and meanings of these LEDs is often documented in the device spec sheet or on the board itself. For example, the Raspberry Pi Model B+ has two LEDs, labeled ACT (activity) and PWD (Power), which you can see in this picture. If you can’t verify power is coming to the device, get a multimeter and check that the voltage coming from the battery or power supply is what is specified in the datasheet. You might have to take the device apart to access the power supply; you might have to do some digging to find the data sheet. For example, here is the information on the voltages required to power the Raspberry Pi. Don’t forget to check the wall! Maybe the wall outlet isn’t receiving power, and your robot is fine. It’s also possible a short-circuit has occurred, and something is fried. In this case, you’ll most likely need to replace a fuse, wire, etc. One useful way to quickly check for this is the “smell test”: if the inside of your robot smells slightly smokey or burned, you most likely had a short and should look for something that looks burned.

Compute Subystem. The compute subsystem fails less frequently than the other two. However it is next in the debugging process because it is necessary for checking if your long-term storage system is working. Once a computer is powered on, it conducts a power-on self-test (POST). This POST occurs immediately after powering on, before you boot into your operating system (which requires access to long-term storage). It verifies that each of the hardware components of the machine are working. A PC that fails its post will make strange BIOS beep codes, and you need to figure out what the beep codes mean. This requires figuring out exactly which BIOS you have by looking up the specifications for that PC, or opening it up to look at the motherboard. Then the documentation for the BIOS will indicate what the beep codes mean. The Raspberry PI doesn’t have beep codes; instead it has LED flash codes to indicate different errors.

Of course, the best way to get more information about what’s going on with the POST is to plug in a monitor and keyboard into the computer. This may already be true if you are working with a PC, but if you are working with a robot, it may not have a monitor plugged in by default. The BIOS beep codes and LED flashes indicate problems without needing a monitor, but plugging in a monitor will show exactly what’s going on. Most robots support this somehow; for example our MOVO robot has an HDMI port on the bag to plug into. The Raspberry PI Model B+ also has an HDMI port, and many times my students are surprised to realize that you can plug in a monitor and keyboard and suddenly their quadrotor drone is a PC! Once you’ve verified the POST ran, you can be fairly confident the CPU, motherboard, and RAM are working. Or if not, it will indicate what’s wrong and you can try replacing those components.

Long-term Storage Subystem. If the POST tests succeed, the next step is for the computer to boot into its operating system, which is held on the long-term storage device. Hard drives and SD cards are another common failure point. Both have a limited number of read/writes before they will fail. Hard drives can have bad sectors, parts of the disk that are permanently bad, and can fail entirely. You want to verify that your robot is able to read the long-term storage and boot into its operating system. One of the simplest ways to do this is to plug it into a keyboard and mouse and see if it boots up! This allows you to watch the whole boot process and see if it completes successfully and enables checking other problems too, like networking and software issues. Fortunately the hard drive or SD card is relatively easy to replace, if you backed up your data. In many cases there may exist a standard disk image for the system you are working with; for example the SD card in our Kuri robot failed, but we were able to restore it using an image we could download from their website.

Networking

The next major goal is to connect to your robot over the network. You don’t want to start working on this until you are reasonably sure the robot is passing its POST test and booting into its operating system. The robot’s wifi can be configured in either Master mode or Managed mode. In Master mode, it will act as its own wifi hot spot. You will see an SSID on your base station laptop or desktop that corresponds to the robot’s network. In this mode if you connect to the robot (or PC), it will give your base station an IP address. You can find the robot’s IP address by using a tool like nmap to scan, or route to look at the gateway machine, or look at your base station’s logs to find the IP address of the DHCP server. Second, it can be configured in Managed mode, where the robot is looking for a wireless network. Typically in this case it is configured to look for a network with a particular SSID, and this configuration can be stored on the hard drive. Sometimes for commercial robots, there is a process where this is configured via an app. For example, I recently installed the Mysa smart thermostat. First my phone asked which network I wanted the Mysa to connect to. Then, it connected to the Mysa wireless network, where the Mysa itself was the AP master, and neither my phone nor Mysa had internet access. Then the phone tells Mysa what SSID and password to use, and then both devices connect to that SSID (the “house” internet), and can talk to each other. The Kuri robot works the same way. Third, the robot can be configured to use a fixed, static IP address. Many home routers work this way; in this case the static IP should hopefully be documented in the robot’s or device’s documentation. This is the cause of many issues: setting a static IP but the IP of the robot gets dynamically changed via DHCP server. A simple test is to make sure you can ping your robot, and that your robot can ping you. Sometimes the networking problem comes from firewalls. Make sure the firewalls aren’t stopping networking traffic your robot needs, but don’t just take them completely down; a full discussion of this is beyond the scope of this article, but note that we scanned the internet for robots and found a lot! You can configure your base station to use a different static IP address in the same subnet, and they should be able to connect. You can also snoop by using the ARP cache or network snooping tools, but that is beyond the scope of this article. If your network setup uses ethernet cables, make sure there are no issue with the wires. They can break quite easily without it being obvious. You should also be aware of your networking bandwidth/latency. If your robot is jittering around or sending incomplete sensor data, you may need to throttle things or swap from wifi to a wired connection.

Software

Next you want to make sure the software on your robot is working. The details of this are also beyond the scope of this article, but at a high level, you want to make sure that 1) the software to make the robot move started up without errors and 2) that it can connect to whatever it needs to connect to to do its job. Typically this means using tools like ps to show running processes, checking log files to see if there are errors on startup, and writing and running client programs to see if each of the programs is running. The key is to be systematic; check each sensor or actuator subsystem on the robot to verify it started and is working, because you may find other hardware issues in this process. Typically each sensor comes with its own drivers and minipackages for checking if they work, and these drivers are used within a larger system like ROS to connect things together. Always make sure that the drivers for your sensors work on your robot platform first. For example, a common problem on our drones is being unable to connect to the flight controller so the drone can’t send a command to arm or spin its motors, because the USB socket has failed. ROS does not make this debugging process easy, because people often configure their robot to use a single roslaunch file to start the entire robot stack, making it easy to miss errors in one subsystem or another, and hard to audit a substack. For example, to debug our MOVO robot, we manually started the MoveIt stack from the command line to see the error logs and try different configurations to fix problems.

Software Dependencies.

Calibration. A very common source of errors is miscalibration of some kind. For example, our MOVO robot has a Kinect 2 sensor mounted on its head. Even though we carefully calibrated the sensor about one year ago, it got knocked or moved in that time, and when we checked it yesterday it was off by quite a bit. If your robot does not know where its sensors are relative to its base frame, then it cannot effectively use the sensor data. To check this, use a tool such as RVis to visualize the sensor data overlayed on the robot’s model, or use some kind of fiducial such as AprilTags to check the calibration or recalibrate.

Learning Abstractions to Follow Commands

Natural language allows us to create abstractions to be as specific about a task as needed, without providing trivial details. For example, when we give directions we mention salient places that contain information like traffic signals, or street corners or house colors, but we do not specify what gait a person must take, or how a car must be driven to get to the destination. Humans contribute to a conversation by being as informative as needed; attempting to be relevant and while providing an ordered, concise, and understandable input. Humans achieve properties of the Grician Cooperative principle by creating abstractions that allow them to be as specific about a task as needed. These abstractions can be in the form of a house color, or tree without getting bogged down by details such as the number of leaves on the tree, or the number of bricks used to make the house. These salient places are sub-goals that the person receiving the instructions needs to achieve to reach its goal. The person solving the task needs to understand these sub-goal conditions and plan to reach these sub-goals in the order of their specifications.

In this work, instructions are provided in natural language, and the robot observes the world in continuous LIDAR data and performs continuous actions. This problem is difficult as the instructions have discrete abstractions like corridors and intersections, but the agent is observing and interacting with the world with continuous data. Learning these abstractions can require a lot of data if done using state-of-the-art deep learning methods. Moreover, it would not be possible to learn sub-goals that can be used for planning in other tasks or with other agents, as most deep methods would learn a direct mapping from language and current state to actions.

To solve this problem we connect the ideas of abstraction in language with the ideas of hierarchical abstraction in planning. We treat natural language as a sequence of abstract symbols. Further we learn sub-goals that are observed consistently in solving tasks. First we collected data by driving the robot around and asking users to AMT to provide instructions for the task the robot just performed. We then identify skills within the robot trajectories using change point detection. These skills are similar to options, in that they have a policy and a termination set. We learn the termination set by learning a classifier around termination sets observed by the robot at the end of a skill. These termination sets provide us with sub-goal termination conditions that a planner can achieve in sequence in order to satisfy an instruction. Last, we learn a model that translates natural language to the sequence of these termination conditions.

We demonstrate our method in two different real robot domains. One is on a previously collected self driving car dataset and another on a robot that was driven around the campus at Brown. We convert the long continuous state-action demonstrations of the robot into a handful symbols that are important in completing the instructions. This learned abstraction is transferable, interpretable and allows learning of language and skills with a realistic number of demonstrations that a human can provide. For more details refer to our paper! Our code is here.

Understanding Adjectives

As robots grow increasingly capable of understanding and interacting with objects, it is important for non-expert users to be able to communicate with robots. One of the most sought after communication modalities is natural language, allowing users to verbally issue directives. We focus our work on natural language object retrieval—the task of finding and recovering an object specified by a human user using natural language.

Natural language object retrieval becomes far more difficult when the robot is forced to disambiguate between objects of the same class (such as between different teapots, rather than between a teapot and a bowl). The solution to this problem is to include descriptive language to specify not only object classes, but also object attributes such as shape, size, or more abstract features. We must also handle partially observed objects; it is unreasonable for a robot to observe objects in its environment from many angles with no self-occlusion. Furthermore, as human-annotated language data is expensive to gather, it is important to not require language annotated depth images from all possible viewpoints. A robust system must be able to generalize from a small human-labeled set of (depth) images to novel and unique object views.

We develop a system that addresses these challenges by grounding descriptions to objects via explicit reasoning over 3D structure. We combine Bayesian Eigenobjects (BEOs), a framework which enables estimation of a novel object’s full 3D geometry from a single partial view, with a language grounding model. This combined model uses BEOs to predict a low-dimensional object embedding into a pose-invariant learned object space and predicts language grounding from this low-dimensional space. This structure enables training of the model on a small amount of expensive human-generated language data. In addition, because BEOs generalize well to novel objects, our system scales well to new objects and partial views, an important capability for a robot operating in an unstructured human-centric environment.

We evaluate our system on a dataset of several thousand ShapeNet object instances across three object classes, paired with human-generated object descriptions obtained from Amazon Mechanical Turk. We show that not only is our system able to distinguish between objects of the same class, it can do so even when objects are only observed from partial views. In a second experiment, our language model is trained on depth images taken only from the front of objects and successfully predict attributes given test depth images taken from rear views. This view-invariance is a key property afforded by our use of an explicitly learned 3D representation—traditional fully supervised approaches are not capable of handling this scenario. We demonstrate our system on a Baxter robot, enabling the robot to successfully determine which object to pick up based on a Microsoft Kinect depth image of several candidate objects and a simple language description of the desired object. Our system is fast, with inference running under 100ms for a single language+depth-image query.

Learn more in our IROS 2019 paper! Find the code and data in our githup repository.

Writing a Research Statement for Graduate School and Fellowships

Writing a research statement happens many times throughout a research career. Often for the first time it happens when applying to Ph.D. programs or applying to fellowships. Later, you will be writing postdoc and faculty applications. These documents are challenging to write because they seek to capture your entire research career in one document that may be read in 90 seconds or less.

Think of the research statement as a proposal. Whether you are applying for a Ph.D. program or for a faculty position, you are trying to convince the reader to invest in you. To decide whether to make this investment they need to know three things: 1) what will you do with the investment, 2) why is that an important problem? and 3) what evidence is there that you will be successful in achieving the goals that you have set out.

The first paragraph, therefore, should describe what problem you are aiming to solve, and why it is an important problem. One common failure mode is to be too general and vague in this paragraph. Yes we all want to solve AI! But you want to write about your specific take, angle, or approach. This will set up the rest of the statement, about why you are the one person uniquely qualified to solve the problem you set up here. Sometimes people shy away from being too specific, because they worry that it will put them in a box. Don’t worry! Research interests always evolve, and you will not be signing in blood to do this exact research plan. It is better to ere on the side of being too specific because it shows you can scope out an exciting project and that you have good ideas, even if you are not sure that this specific idea is the one you will eventually pursue.

The next paragraphs should describe your past work as it fits into the research vision you have outlined in the first paragraph. You can start with a paragraph for each project or paper you have worked on. The paragraphs can be more or less the abstract for the paper . However you should be clear exactly what your role in the project was, give credit to collaborators, and spend more time on the parts of the project you contributed to directly. You also need to tie it to the research vision in paragraph 1. The strongest statement presents your life, as an arrow that points unambiguously towards solving the research question you have outlined in paragraph 1. Of course, no one’s life is actually an unambiguous arrow! However I think it helps to think that way because you are trying to tie the projects together to show how they have prepared you and furthered you along the research trajectory. Even if this project wasn’t directly connected in terms of its research questions, you can write about how it taught you technical tools that you can apply to your research objective, or how it taught you something that led to your current research objective.

The last paragraphs should describe concretely what you plan to do next. If you are applying to a Ph.D. program, you should name the groups you wish to work with and explain why they are a good fit for you. For a fellowship, you should describe why this work is a good fit for the work done by the organization you are applying to.

POMDPS and Mixed Reality for Resolving References to Objects

Communicating human knowledge and intent to robots is essential for successful human-robot interaction (HRI). Failures in communication, and thus collaboration, occur when there is mismatch between two agents’ mental states. Question-asking allows a robot to acquire information that targets its uncertainty, facilitating recovery from failure states. However, all question-asking modalities (natural language, eye-gaze, gestures, visual interfaces) have tradeoffs, making choosing which to use an important and context-dependent decision.  For example, for robots with “real” eyes or pan/tilt screens, looking requires fewer joints to move less distance compared to pointing, decreasing the speed of the referential action. However, eye gaze, or “pointing with the eyes,” is inherently more difficult to interpret since pointing gestures can reduce the distance between the referrer and the referenced item. Another approach to question-asking is to use visualizations to make the communication modality independent of the robot. While the performance of physical actions like eye gaze and pointing gestures rely on the physical robot, visualization methods like mixed reality (MR) enable the robot to communicate information by visually depicting its mental state in the real environment. Related work has investigated the effects of using MR visualizations for reducing mental workload in HRI, but there remains a gap of research on how it compares to physical actions like pointing and eye gaze for reducing robot uncertainty.

This work investigates how physical and visualization-based question-asking can be used for reducing robot uncertainty under varying levels of ambiguity. To do this, we first model our problem as a POMDP, termed the Physio-Virtual Deixis POMDP (PVD-POMDP), that observes a human’s speech, gestures, and eye gaze, and decides when to ask questions (to increase accuracy) and when to decide to choose the item (to decrease interaction time). The PVD-POMDP enables the robot to ask questions either using physical modalities (like eye-gaze and gesture) or using virtual modalities (like mixed reality). To evaluate our model, we conducted a between-subjects user study, where 80 participants interact with a robot in an item-fetching task. Participants experience one of three different conditions of our PVD-POMDP: a no feedback control condition, a physical feedback condition, or a mixed reality feedback condition. Our results show that our mixed reality model is able to successfully choose the correct item 93% of the time, with each interaction lasting about 5 seconds,  outperforming the physical and no feedback models in both quantitative metrics and qualitative metrics (highest usability, task load, and trust scores).

You can read more in our paper!

Action Oriented Semantic Maps

A long-term goal of robotics is designing robots intelligent enough to enter a person’s home and perform daily chores for them. This requires the robot to learn specific behaviors and semantic information that can only be acquired after entering the home and interacting with the humans living there. For example, there may be a trinket that the robot has never encountered before, and the owner might want to instruct the robot on how to handle the item (i.e., object manipulation information), as well as directly specify where the item should be kept (i.e.,  navigation information). To approach this problem, one must consider two sub-problems: a) the agent’s representation of object manipulation actions and semantic information about the environment, and b) the method with which an agent can learn this knowledge from a teacher.

Semantic maps provide a representation sufficient for navigating an environment, but map information alone is insufficient for enabling object manipulation. Conversely, there are knowledge bases that store requisite object manipulation information, but do not help with navigation or grasping in novel orientations. Previous studies have shown that Mixed Reality (MR) interfaces are effective for specifying navigation commands and programming egocentric robot behaviors. However, none of these works have demonstrated the use of MR interfaces for teaching high-level object manipulation actions, and semantic information of the environment.

Our contribution is a system that enables humans to teach robots both object manipulation actions—in a local object frame of reference—and b) semantic information about objects in a global map. We use a Mixed Reality Head Mounted Display (MR-HMD) to enable humans to teach a robot a plannable representation of their environment. By plannable, we mean structured representations that are searchable with AI planning tools. By teach, we mean having the human explicitly provide information necessary for instantiating our representation. Our representation, the Action-Oriented Semantic Map (AOSM), enables robots to perform complex object manipulation tasks that require navigation around an environment.

To test our system for building AOSMs, we created a test environment where a mobile manipulator was tasked with learning how to perform household chores with the help of a human trainer. Three novice humans used our MR interface to teach a robot an AOSM, allowing the robot to autonomously plan to navigate to a bottle, pick it up, and throw it out. In addition, two expert users taught a robot to autonomously plan to flip a light switch off and manipulate a sink faucet to the closed position, both in under 2 minutes.

You can read more in our paper!

Natural Language, Mixed Reality, and Drones

As drones become increasingly autonomous, it is imperative that designers create intuitive and flexible ways for untrained users to interact with these systems. A natural language interface is immediately accessible to non-technical users and does not require the user to use a touchscreen or radio control (RC). Current natural language interfaces require a predefined
model of the environment including landmarks, which is difficult for a drone to obtain. For example, given the instruction “Fly around the wall to the chair and take a picture,” the drone must already have a model of the wall and the chair to infer a policy.


We address these interaction problems by using natural language within a mixed reality headset (MR) to provide an intuitive high-level interface for controlling a drone using goal-based planning. By using MR, a user can annotate landmarks with natural language in the drone’s frame of reference. The process starts with the MR interface displaying the virtual environment of a room that can be adjusted to overlay on physical reality with gestures. Afterward, the user can annotate landmarks with colored boxes using the MR interface. This process allows the user to specify landmarks that the drone does not have the capability to detect on its own. Then, when the user sends a command, we translate this natural language text to a reward function.

Further, we conducted an exploratory user study to compare the low-level 2D interface with two MR interfaces. Overall, we found that our MR system offers a user-friendly approach to control a drone with both low-level and high-level instructions. The language interface does not require direct, continuous commands from the user. Overall, the MR system does not require the user to spend as much time controlling the drone as with the 2D interface.

You can read the paper here!

Finding Objects with Verbs Instead of Nouns

A key bottleneck in widespread deployment of robots in human-centric environments is the ability for non-expert users to communicate with robots. Natural language is one of the most popular communication modalities due to the familiarity and comfort it affords a majority of users. However, training a robot to understand open-ended natural language commands is challenging since humans will inevitably produce words that were never seen in the robot’s training data. These unknown words can come from paraphrasing such as using “saucer” instead of “plate,” or from novel object classes in the robot’s environments, for example a kitchen with a “rolling pin” when the robot has never seen a rolling pin before. We aim to develop a model that can handle open-ended commands with unknown words and object classes. As a first step in solving this challenging problem, we focus on the natural language object retrieval task — selecting the correct object based on an indirect natural language command with constraints on the functionality of the object.

More specifically, our work focuses on fulfilling commands requesting an object for a task specified by a verb such as “Hand me a box cutter to cut.” Being able to handle these types of commands is highly useful for a robot agent in human-centric environments, as people usually ask for an object with a specific usage in mind. The robot would be able to correctly fetch the desired object for the given task, such as cut, without needing to have seen the object, a box cutter for example, or the word representing the object, such as the noun “box cutter.” In addition, the robot has the freedom to substitute objects as long as the selected object satisfies the specified task. This is particularly useful in cases where the robot cannot locate the specific object the human asked for but found another object that can satisfy the given task, such as a knife instead of a box cutter to cut.

Our work demonstrates that an object’s appearance provides sufficient signals to predict whether the object is suitable for a specific task, without needing to explicitly classify the object class and visual attributes. Our model takes in RGB images of objects and a natural language command containing a verb, generates embeddings of the input language command and images, and selects the image most similar to the given command in embedding space. The selected image should represent the object that best satisfies the task specified by the verb in the command. We train our model on natural language command-RGB image pairs. The evaluation task for the model is to retrieve the correct object from a set of five images, given a natural language command. We use ILSVRC2012 images and language commands generated from verb-object pairs extracted from Wikipedia for training and evaluation of our model. Our model achieves an average retrieval accuracy of 62.3% on a held-out test set of unseen ILSVRC2012 object classes and 53.0% on unseen object classes and unknown nouns. Our model also achieves an average accuracy of 54.7% on unseen YCB object classes. We also demonstrate our model on a KUKA LBR iiwa robot arm, enabling the robot to retrieve objects based on natural language commands, and present a new dataset of 655 verb-object pairs denoting object usage over 50 verbs and 216 object classes.

You can read more in our paper and our code and data!