Natural language allows us to create abstractions to be as specific about a task as needed, without providing trivial details. For example, when we give directions we mention salient places that contain information like traffic signals, or street corners or house colors, but we do not specify what gait a person must take, or how a car must be driven to get to the destination. Humans contribute to a conversation by being as informative as needed; attempting to be relevant and while providing an ordered, concise, and understandable input. Humans achieve properties of the Grician Cooperative principle by creating abstractions that allow them to be as specific about a task as needed. These abstractions can be in the form of a house color, or tree without getting bogged down by details such as the number of leaves on the tree, or the number of bricks used to make the house. These salient places are sub-goals that the person receiving the instructions needs to achieve to reach its goal. The person solving the task needs to understand these sub-goal conditions and plan to reach these sub-goals in the order of their specifications.
In this work, instructions are provided in natural language, and the robot observes the world in continuous LIDAR data and performs continuous actions. This problem is difficult as the instructions have discrete abstractions like corridors and intersections, but the agent is observing and interacting with the world with continuous data. Learning these abstractions can require a lot of data if done using state-of-the-art deep learning methods. Moreover, it would not be possible to learn sub-goals that can be used for planning in other tasks or with other agents, as most deep methods would learn a direct mapping from language and current state to actions.
To solve this problem we connect the ideas of abstraction in language with the ideas of hierarchical abstraction in planning. We treat natural language as a sequence of abstract symbols. Further we learn sub-goals that are observed consistently in solving tasks. First we collected data by driving the robot around and asking users to AMT to provide instructions for the task the robot just performed. We then identify skills within the robot trajectories using change point detection. These skills are similar to options, in that they have a policy and a termination set. We learn the termination set by learning a classifier around termination sets observed by the robot at the end of a skill. These termination sets provide us with sub-goal termination conditions that a planner can achieve in sequence in order to satisfy an instruction. Last, we learn a model that translates natural language to the sequence of these termination conditions.
We demonstrate our method in two different real robot domains. One is on a previously collected self driving car dataset and another on a robot that was driven around the campus at Brown. We convert the long continuous state-action demonstrations of the robot into a handful symbols that are important in completing the instructions. This learned abstraction is transferable, interpretable and allows learning of language and skills with a realistic number of demonstrations that a human can provide. For more details refer to our paper! Our code is here.