William Chen

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Paper · Website

Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings extracted from general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. We also find our approach outperforms instruction-following methods and performs comparably to domain-specific embeddings.

Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies

Paper

This paper proposes an approach to build 3D scene graphs in arbitrary (indoor and outdoor) environments. Such extension is challenging; the hierarchy of concepts that describe an outdoor environment is more complex than for indoors, and manually defining such hierarchy is time-consuming and does not scale. Furthermore, the lack of training data prevents the straightforward application of learning-based tools used in indoor settings. To address these challenges, we propose two novel extensions. First, we develop methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation. In particular, we use a Large Language Model (LLM) to build such an ontology, thus largely reducing the amount of manual effort required. Second, we leverage the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules, or axioms (e.g., "a beach contains sand"), which provide additional supervisory signals at training time thus reducing the need for labelled data, providing better predictions, and even allowing predicting concepts unseen at training time. We test our approach in a variety of datasets, including indoor, rural, and coastal environments, and show that it leads to a significant increase in the quality of the 3D scene graph generation with sparsely annotated data.

LaMPP: Language Models as Probabilistic Priors for Perception and Action

Paper

Language models trained on large text corpora encode rich distributional information about real-world environments and action sequences. This information plays a crucial role in current approaches to language processing tasks like question answering and instruction generation. We describe how to leverage language models for *non-linguistic* perception and control tasks. Our approach casts labeling and decision-making as inference in probabilistic graphical models in which language models parameterize prior distributions over labels, decisions and parameters, making it possible to integrate uncertain observations and incomplete background knowledge in a principled way. Applied to semantic segmentation, household navigation, and activity recognition tasks, this approach improves predictions on rare, out-of-distribution, and structurally novel inputs.

Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

Paper · GitHub Repo

If you were to ask a robot to fetch you a spoon, it should know that spoons generally belong in kitchens, which also contain stoves and fridges. It should then be able to use this information to navigate to locations it's seen before that are likely to be kitchens. This process requires a lot of semantic common sense. We thus investigated the ability for language models to act as common sense mechanisms with no task-specific fine-tuning. Specifically, we try to classify rooms in the Matterport3D dataset given the objects contained within each. We do this by summarizing the contents of the room in natural langugage, then using language models to identify which room category best fits the summary.

This work resulted in a first-author paper submission to the Scaling Robot Learning Workshop at the Robotics: Science and Systems 2022 Conference. I also delivered a spotlight talk and poster presentation on this work there.

Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

Paper · GitHub Repo

If you were to ask a robot to fetch you a spoon, it should know that spoons generally belong in kitchens, which also contain stoves and fridges. It should then be able to use this information to navigate to locations it's seen before that are likely to be kitchens. This process requires a lot of semantic common sense. We thus investigated the ability for language models to act as common sense mechanisms with no task-specific fine-tuning. Specifically, we try to classify rooms in the Matterport3D dataset given the objects contained within each. We do this by summarizing the contents of the room in natural langugage, then using language models to identify which room category best fits the summary.

This work resulted in a first-author paper submission to the Scaling Robot Learning Workshop at the Robotics: Science and Systems 2022 Conference. I also delivered a spotlight talk and poster presentation on this work there.

Motion Planning Methods for Legged Robots

Paper · GitHub Repo

In this project, we explored optimization-based methods for motion planning with legged robots. Specifically, we implemented a non-linear program that allows a virtual simplified planar version of Boston Dynamics' Little Dog to walk and run. The optimization considers robot dynamics and kinematics, time constraints, world physics, and task specifications. By altering when each of the robot's feet are in contact with the ground, we succeeded in getting the robot to walk and run. We also explored how mixed-integer quadratic programs could be used to tackle similar problems.

This was the final project for MIT's Underactuated Robotics, a graduate-level class taught by Dr. Russ Tedrake.

Visual Pick-and-place for Robot Arms with Deep Learning

Paper · GitHub Repo

This was my final group project for MIT's Intelligent Robot Manipulation , a graduate-level class taught by Dr. Russ Tedrake. The goal of this project was to use visual means to grasp an object of interest from a bin and place it upright on a nearby table using a 7-axis robot arm in the PyDrake simulation environment. Our full end-to-end system features a deep neural network for visual object pose estimation, an interpolator for trajectory planning, and an nonlinear optimization-based inverse kinematics controller to solve for desired control inputs.

Neural Transformer Language Models as Natural Language Knowledge Bases

Paper

Traditional knowledge bases contain hand-engineered compilations of facts and relationships. Because of this, they have limited scalability and require significant resources to maintain, but nevertheless are easy to understand, query, and extend with new information. Neural language models, on the other hand, encode statistical information about grammar. They can be used as question-answering systems akin to knowledge bases -- for masked language models, this can be achieved with cloze-style questions. Still, they are hard to update and interpret. We thus investigated the BERT transformer language model's ability to answer said questions with Meta Research's LAnguage Model Analysis (LAMA) probe, as well as said model's ability to learn or memorize new information (including nonsense).

This was the final project for MIT's Advanced Natural Language Processing , a graduate-level class taught by Dr. Jacob Andreas and Dr. Yoon Kim. The following fall semester, I accepted a teaching assistantship for said class.

William Chen

Education

University of California, Berkeley

The Massachusetts Institute of Technology

The Bronx High School of Science

About

Projects

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies

LaMPP: Language Models as Probabilistic Priors for Perception and Action

Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

Motion Planning Methods for Legged Robots

Visual Pick-and-place for Robot Arms with Deep Learning

Neural Transformer Language Models as Natural Language Knowledge Bases