Humans use spatial language to describe object locations and their relations. Consider the following scenario. A tourist is looking for an ice cream truck in an amusement park. She asks a passer-by and gets the reply “the ice cream truck is behind the ticket booth.” The tourist looks at the amusement park map and locates the ticket booth. Then, she is able to infer a region corresponding to that statement and go there to search for the ice cream truck, even though the spatial preposition “behind” is inherently ambiguous.
If robots can understand spatial language, they can leverage prior knowledge possessed by humans to search for objects more efficiently, and interface with humans more naturally. This can be useful for applications such as autonomous delivery and search-and-rescue, where the customer or people at the scene communicate with the robot via natural language.
Unfortunately, humans produce diverse spatial language phrases based on their observation of the environment and knowledge of target locations, yet none of these factors are available to the robot. In addition, the robot may operate in a different area than where it was trained. The robot must generalize its ability to understand spatial language across environments.
Prior works on spatial language understanding assume referenced objects already exist in the robot’s world model or within the robot’s field of view. Works that consider partial observability do not handle ambiguous spatial prepositions or assume a robot-centric frame of reference, limiting the ability to understand diverse spatial relations that provide critical disambiguating information, such as behind the ticket booth.
We present Spatial Language Object-Oriented POMDP (SLOOP), which extends OO-POMDP (a recent work in our lab) by considering spatial language as an additional perceptual modality. To interpret ambiguous, context-dependent prepositions (e.g. behind), we design a simple convolutional neural network that predicts the language provider’s latent frame of reference (FoR) given the environment context.
We apply SLOOP to object search in city-scale environments given a spatial language description of target locations. Search strategies are computed via an online POMDP planner based on Monte Carlo Tree Search.
Evaluation based on crowdsourced language data, collected over areas of five cities in OpenStreetMap, shows that our approach achieves faster search and higher success rate compared to baselines, with a wider margin as the spatial language becomes more complex.
Finally, we demonstrate the proposed method in AirSim, a realistic simulator where a drone is tasked to find cars in a neighborhood environment.
For future work, we plan to investigate compositionality in spatial language for partially observable domains.
See our video!
And download the paper!