William Chen

· UC Berkeley graduate student
  · Pursuing PhD in Computer Science, advised by Dr. Sergey Levine
  · Researching robotics, decision-making, and language
· Former MIT student
  · Master of Engineering (2023) and Bachelor of Science (2022) in Electrical Engineering and Computer Science
  · Teaching Assistant for Robotics: Science and Systems and Natural Language Processing
· Former robotics intern at NASA's Jet Propulsion Laboratory


Education

University of California, Berkeley

Doctor of Philosophy (Ph.D.) - Computer Science

Advised by Dr. Sergey Levine

June 2023 - Present

Massachusetts Institute of Technology

Master of Engineering (M.Eng.) - Electrical Engineering and Computer Science

Advised by Dr. Luca Carlone

Concentration: Artificial Intelligence | GPA: 5.0 / 5.0

  • Doing Things with Words (Language & Decision-Making AI Seminar)
  • Deep Learning
  • Advanced Sensorimotor Learning (Reinforcement Learning Seminar)
  • Large Language Models & Beyond (Language AI Seminar)

Bachelor of Science (S.B.) - Electrical Engineering and Computer Science

GPA: 5.0 / 5.0

  • Underactuated Robotics
  • Computational Sensorimotor Learning
  • Intelligent Robot Manipulation
  • Robotics: Science and Systems (TA)
  • Intro to Deep Learning (TA)
  • Advanced Natural Language Processing (TA)
  • Computational Cognitive Science
  • Feedback Control Design
  • Design and Analysis of Algorithms
Feb 2022 - May 2023 (M.Eng.)
Aug 2019 - Feb 2022 (S.B)

The Bronx High School of Science

New York City Specialized High School
Sep 2015 - June 2019

About

I am a graduate student at UC Berkeley pursuing a PhD in Computer Science under Dr. Sergey Levine. My focus is on the intersection of robotics, decision-making, and natural language processing.

Previously, I received a Master of Engineering in Electrical Engineering and Computer Science from MIT, after having finished my Bachelor's degree in the same department. I conducted my thesis work in the SPARK Lab, where I investigated the use of natural language processing tools to improve robots' abilities to understand their environments under Dr. Luca Carlone and Jacob Andreas.

As an undergraduate, I was involved in robotics research at MIT's Computer Science and Artificial Intelligence Laboratory and Laboratory for Information and Decision Systems. Moreover, I was a teaching assistant for the classes Introduction to Deep Learning (winter 2021) and Robotics: Science and Systems (spring 2021). I reprised the latter role as a graduate student in spring 2022, subsequently accepting a teaching assistantship for Natural Language Processing in fall 2022 as well.

I can be reliably reached on LinkedIn.


Projects

Here are some of the projects I have undertaken, be they for research or classes. More information can be found on a project's corresponding GitHub repository.

ECoT: Robotic Control via Embodied Chain-of-Thought Reasoning

We introduce Embodied Chain-of-Thought Reasoning (ECoT) for vision-language-action models (VLAs), in which we train them to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features like object bounding boxes and end effector positions, before predicting robot actions. We design a scalable pipeline for generating synthetic training data for ECoT on large robot datasets. We demonstrate that ECoT increases the absolute success rate of OpenVLA, the current strongest open-source VLA policy, by 28% across challenging generalization tasks without any additional robot training data. Additionally, ECoT makes it easier for humans to interpret a policy's failures and correct its behavior interactively using natural language. Finally, we show that our model learns to transfer ECoT reasonings to unseen embodiments and tasks.

This work was accepted at the 2024 Conference on Robot Learning. It was also featured in the 2024 State of AI report (slide 77). This work was co-first authored by me and Michał Zawalski.


PR2L: Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings extracted from general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. We also find our approach outperforms instruction-following methods and performs comparably to domain-specific embeddings.


Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies

This paper proposes an approach to build 3D scene graphs in arbitrary (indoor and outdoor) environments. Such extension is challenging; the hierarchy of concepts that describe an outdoor environment is more complex than for indoors, and manually defining such hierarchy is time-consuming and does not scale. Furthermore, the lack of training data prevents the straightforward application of learning-based tools used in indoor settings. To address these challenges, we propose two novel extensions. First, we develop methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation. In particular, we use a Large Language Model (LLM) to build such an ontology, thus largely reducing the amount of manual effort required. Second, we leverage the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules, or axioms (e.g., "a beach contains sand"), which provide additional supervisory signals at training time thus reducing the need for labelled data, providing better predictions, and even allowing predicting concepts unseen at training time. We test our approach in a variety of datasets, including indoor, rural, and coastal environments, and show that it leads to a significant increase in the quality of the 3D scene graph generation with sparsely annotated data.


LaMPP: Language Models as Probabilistic Priors for Perception and Action

Language models trained on large text corpora encode rich distributional information about real-world environments and action sequences. This information plays a crucial role in current approaches to language processing tasks like question answering and instruction generation. We describe how to leverage language models for *non-linguistic* perception and control tasks. Our approach casts labeling and decision-making as inference in probabilistic graphical models in which language models parameterize prior distributions over labels, decisions and parameters, making it possible to integrate uncertain observations and incomplete background knowledge in a principled way. Applied to semantic segmentation, household navigation, and activity recognition tasks, this approach improves predictions on rare, out-of-distribution, and structurally novel inputs.


Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding

If you were to ask a robot to fetch you a spoon, it should know that spoons generally belong in kitchens, which also contain stoves and fridges. It should then be able to use this information to navigate to locations it's seen before that are likely to be kitchens. This process requires a lot of semantic common sense. We thus investigated the ability for language models to act as common sense mechanisms with no task-specific fine-tuning. Specifically, we try to classify rooms in the Matterport3D dataset given the objects contained within each. We do this by summarizing the contents of the room in natural langugage, then using language models to identify which room category best fits the summary.

This work resulted in a first-author paper submission to the Scaling Robot Learning Workshop at the Robotics: Science and Systems 2022 Conference. I also delivered a spotlight talk and poster presentation on this work there.


Motion Planning Methods for Legged Robots


In this project, we explored optimization-based methods for motion planning with legged robots. Specifically, we implemented a non-linear program that allows a virtual simplified planar version of Boston Dynamics' Little Dog to walk and run. The optimization considers robot dynamics and kinematics, time constraints, world physics, and task specifications. By altering when each of the robot's feet are in contact with the ground, we succeeded in getting the robot to walk and run. We also explored how mixed-integer quadratic programs could be used to tackle similar problems.

This was the final project for MIT's Underactuated Robotics, a graduate-level class taught by Dr. Russ Tedrake.


Visual Pick-and-place for Robot Arms with Deep Learning


This was my final group project for MIT's Intelligent Robot Manipulation , a graduate-level class taught by Dr. Russ Tedrake. The goal of this project was to use visual means to grasp an object of interest from a bin and place it upright on a nearby table using a 7-axis robot arm in the PyDrake simulation environment. Our full end-to-end system features a deep neural network for visual object pose estimation, an interpolator for trajectory planning, and an nonlinear optimization-based inverse kinematics controller to solve for desired control inputs.


Neural Transformer Language Models as Natural Language Knowledge Bases

Traditional knowledge bases contain hand-engineered compilations of facts and relationships. Because of this, they have limited scalability and require significant resources to maintain, but nevertheless are easy to understand, query, and extend with new information. Neural language models, on the other hand, encode statistical information about grammar. They can be used as question-answering systems akin to knowledge bases -- for masked language models, this can be achieved with cloze-style questions. Still, they are hard to update and interpret. We thus investigated the BERT transformer language model's ability to answer said questions with Meta Research's LAnguage Model Analysis (LAMA) probe, as well as said model's ability to learn or memorize new information (including nonsense).

This was the final project for MIT's Advanced Natural Language Processing , a graduate-level class taught by Dr. Jacob Andreas and Dr. Yoon Kim. The following fall semester, I accepted a teaching assistantship for said class.