Publications

ECoT-Lite: Training Strategies for Efficient Embodied Reasoning

William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, Sergey Levine
Paper · Website

Robot chain-of-thought reasoning (CoT)—wherein a model predicts helpful intermediate representations before choosing actions—provides an effective method for improving the generalization and performance of robot policies, especially vision-language-action models (VLAs). While such approaches have been shown to improve performance and generalization, they suffer from core limitations, like needing specialized robot reasoning data and slow inference speeds. To design new robot reasoning approaches that address these issues, a more complete characterization of why reasoning helps policy performance is critical. We hypothesize several mechanisms by which robot reasoning improves policies: (1) better representation learning, (2) improved learning curricularization, and (3) increased expressivity. We then devise simple variants of robot CoT reasoning to isolate and test each one. We find that learning to generate reasonings does lead to better VLA representations, while attending to the reasonings aids in actually leveraging these features for improved action prediction. Our results provide us with a better understanding of why CoT reasoning helps VLAs, which we use to introduce two simple and lightweight alternative recipes for robot reasoning. Our proposed approaches achieve significant performance gains over non-reasoning policies, state-of-the-art results on the LIBERO-90 benchmark, and a 3x inference speedup compared to standard robot reasoning.

This work was accepted at the 2025 Conference on Robot Learning, where it was selected as an Oral Presentation.

ECoT: Robotic Control via Embodied Chain-of-Thought Reasoning

Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, Sergey Levine
Paper · Website · Code · Models · Colab

We introduce Embodied Chain-of-Thought Reasoning (ECoT) for vision-language-action models (VLAs), in which we train them to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features like object bounding boxes and end effector positions, before predicting robot actions. We design a scalable pipeline for generating synthetic training data for ECoT on large robot datasets. We demonstrate that ECoT increases the absolute success rate of OpenVLA, the current strongest open-source VLA policy, by 28% across challenging generalization tasks without any additional robot training data. Additionally, ECoT makes it easier for humans to interpret a policy's failures and correct its behavior interactively using natural language. Finally, we show that our model learns to transfer ECoT reasonings to unseen embodiments and tasks.

This work was accepted at the 2024 Conference on Robot Learning. It was also featured in the 2024 State of AI report (slide 77). This work was co-first authored by me and Michał Zawalski.

PR2L: Vision-Language Models Provide Promptable Representations for Reinforcement Learning

William Chen, Oier Mees, Aviral Kumar, Sergey Levine
Paper · Website

Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that encode semantic features of visual observations based on the VLM's internal knowledge and reasoning capabilities, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings from off-the-shelf, general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. We also find our approach outperforms instruction-following methods and performs comparably to domain-specific embeddings. Finally, we show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.

This work was accepted to the March 2025 Transactions on Machine Learning Research.

Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies

Jared Strader, Nathan Hughes, William Chen, Alberto Speranzon, Luca Carlone
Paper

This paper proposes an approach to build 3D scene graphs in arbitrary (indoor and outdoor) environments. Such extension is challenging; the hierarchy of concepts that describe an outdoor environment is more complex than for indoors, and manually defining such hierarchy is time-consuming and does not scale. Furthermore, the lack of training data prevents the straightforward application of learning-based tools used in indoor settings. To address these challenges, we propose two novel extensions. First, we develop methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation. In particular, we use a Large Language Model (LLM) to build such an ontology, thus largely reducing the amount of manual effort required. Second, we leverage the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules, or axioms (e.g., "a beach contains sand"), which provide additional supervisory signals at training time thus reducing the need for labelled data, providing better predictions, and even allowing predicting concepts unseen at training time. We test our approach in a variety of datasets, including indoor, rural, and coastal environments, and show that it leads to a significant increase in the quality of the 3D scene graph generation with sparsely annotated data.

LaMPP: Language Models as Probabilistic Priors for Perception and Action

Belinda Z. Li, William Chen, Pratyusha Sharma, Jacob Andreas
Paper

Language models trained on large text corpora encode rich distributional information about real-world environments and action sequences. This information plays a crucial role in current approaches to language processing tasks like question answering and instruction generation. We describe how to leverage language models for *non-linguistic* perception and control tasks. Our approach casts labeling and decision-making as inference in probabilistic graphical models in which language models parameterize prior distributions over labels, decisions and parameters, making it possible to integrate uncertain observations and incomplete background knowledge in a principled way. Applied to semantic segmentation, household navigation, and activity recognition tasks, this approach improves predictions on rare, out-of-distribution, and structurally novel inputs.

Organizing and Service

Workshops

Panel Organizer for the Evaluation and Deployment Workshop at CoRL 2025
Lead Organizer for the Workshop on Learned Robot Representations (RoboReps) at RSS 2025
Panelist at the Third Workshop on Language and Robot Learning at CoRL 2024

Education

University of California, Berkeley

Ph.D. - Computer Science (June 2023 - Present)
Supported by the NDSEG Fellowship.

Deep Reinforcement Learning
Convex Optimization
Artificial Intelligence Systems
Machine Learning Evaluations
Deep Neural Networks (TA, Fall 2024)

Massachusetts Institute of Technology

M.Eng. - Electrical Engineering and Computer Science (Feb 2022 - June 2023)

Doing Things with Words (Language and Decision-Making AI Seminar)
Deep Learning
Advanced Sensorimotor Learning (Reinforcement Learning Seminar)
Large Language Models and Beyond (Language AI Seminar)

B.S. - Electrical Engineering and Computer Science (Aug 2019 - Feb 2022)

Underactuated Robotics
Computational Sensorimotor Learning
Robot Manipulation
Robotics: Science and Systems (TA, Spring 2021, Spring 2022)
Advanced Natural Language Processing (TA, Fall 2022)
Computational Cognitive Science

Bronx High School of Science

High School (Sept 2015 - June 2019)