· UC Berkeley graduate student
· Pursuing PhD in Computer Science, advised by Dr. Sergey Levine
· Researching robotics, decision-making, and language
· Former MIT student
· Master of Engineering (2023) and Bachelor of Science (2022) in Electrical Engineering
and Computer Science
· Teaching Assistant for Robotics: Science and Systems and Natural Language
· Former robotics intern at NASA's Jet Propulsion Laboratory
University of California, Berkeley
Doctor of Philosophy (Ph.D.) - Computer Science
Advised by Dr. Sergey Levine
June 2023 - Present
Massachusetts Institute of Technology
Master of Engineering (M.Eng.) - Electrical Engineering and
Large Language Models & Beyond (Language AI Seminar)
Bachelor of Science (S.B.) - Electrical Engineering and Computer
GPA: 5.0 / 5.0
Underactuated Robotics
Computational Sensorimotor Learning
Intelligent Robot Manipulation
Robotics: Science and Systems (TA)
Intro to Deep Learning (TA)
Advanced Natural Language Processing (TA)
Computational Cognitive Science
Feedback Control Design
Design and Analysis of Algorithms
Feb 2022 -
May 2023
(M.Eng.) Aug 2019 - Feb 2022 (S.B)
The Bronx High School of Science
New York City Specialized High School
Sep 2015 - June 2019
I am a graduate student at UC Berkeley pursuing a PhD in Computer Science under Dr. Sergey Levine. My focus is on the
intersection of robotics, decision-making, and natural language processing.
Previously, I received a Master of Engineering in Electrical Engineering and Computer Science from
MIT, after having finished my Bachelor's degree in the same
department. I conducted my thesis work in the SPARK Lab,
where I investigated the use of
natural language processing tools to improve robots' abilities to understand their environments
Dr. Luca Carlone and Jacob Andreas.
As an undergraduate, I was involved in robotics research at MIT's Computer Science
and Artificial Intelligence
Laboratory and Laboratory for Information and Decision
Systems. Moreover, I was a teaching assistant for the classes Introduction to Deep Learning
(winter 2021) and Robotics: Science and Systems (spring 2021). I reprised the latter role as a
graduate student in spring 2022, subsequently accepting a teaching assistantship for Natural
Language Processing in fall 2022 as well.
Here are some of the projects I have undertaken, be they for research or
classes. More information can be found on a project's corresponding GitHub repository.
ECoT: Robotic Control via Embodied Chain-of-Thought Reasoning
We introduce
Embodied Chain-of-Thought Reasoning (ECoT) for vision-language-action models (VLAs), in
which we train them to perform multiple steps of reasoning about plans, sub-tasks, motions,
and visually grounded features like object bounding boxes and end effector positions,
before predicting robot actions. We design a scalable pipeline for generating
synthetic training data for ECoT on large robot datasets. We demonstrate that ECoT
increases the absolute success rate of OpenVLA, the current strongest open-source
VLA policy, by 28% across challenging generalization tasks without any additional
robot training data. Additionally, ECoT makes it easier for humans to interpret
a policy's failures and correct its behavior interactively using natural language.
Finally, we show that our model learns to transfer ECoT reasonings to unseen
embodiments and tasks.
Humans can quickly learn new behaviors by leveraging background world knowledge. In
contrast, agents trained with reinforcement learning (RL) typically learn behaviors from
scratch. We thus propose a novel approach that uses the vast amounts of general and
indexable world knowledge encoded in vision-language models (VLMs) pre-trained on
Internet-scale data for embodied RL. We initialize policies with VLMs by using them as
promptable representations: embeddings that are grounded in visual observations and encode
semantic features based on the VLM's internal knowledge, as elicited through prompts that
provide task context and auxiliary information. We evaluate our approach on
visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We
find that our policies trained on embeddings extracted from general-purpose VLMs outperform
equivalent policies trained on generic, non-promptable image embeddings. We also find our
approach outperforms instruction-following methods and performs comparably to
domain-specific embeddings.
Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial
This paper proposes an approach to build 3D scene graphs in arbitrary (indoor and outdoor)
environments. Such extension is challenging; the hierarchy of concepts that describe an
outdoor environment is more complex than for indoors, and manually defining such hierarchy
is time-consuming and does not scale. Furthermore, the lack of training data prevents the
straightforward application of learning-based tools used in indoor settings. To address
these challenges, we propose two novel extensions. First, we develop methods to build a
spatial ontology defining concepts and relations relevant for indoor and outdoor robot
operation. In particular, we use a Large Language Model (LLM) to build such an ontology,
thus largely reducing the amount of manual effort required. Second, we leverage the spatial
ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical
rules, or axioms (e.g., "a beach contains sand"), which provide additional supervisory
signals at training time thus reducing the need for labelled data, providing better
predictions, and even allowing predicting concepts unseen at training time. We test our
approach in a variety of datasets, including indoor, rural, and coastal environments, and
show that it leads to a significant increase in the quality of the 3D scene graph generation
with sparsely annotated data.
LaMPP: Language Models as Probabilistic Priors for Perception and Action
Language models trained on large text corpora encode rich distributional information about
real-world environments and action sequences. This information plays a crucial role in
current approaches to language processing tasks like question answering and instruction
generation. We describe how to leverage language models for *non-linguistic* perception and
control tasks. Our approach casts labeling and decision-making as inference in probabilistic
graphical models in which language models parameterize prior distributions over labels,
decisions and parameters, making it possible to integrate uncertain observations and
incomplete background knowledge in a principled way. Applied to semantic segmentation,
household navigation, and activity recognition tasks, this approach improves predictions on
rare, out-of-distribution, and structurally novel inputs.
Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding
If you were to ask a robot to fetch you a spoon, it should know that spoons generally belong
in kitchens, which also contain stoves and fridges. It should then be able to use this
information to navigate to locations it's seen before that are likely to be kitchens. This
process requires a lot of semantic common sense. We thus investigated the ability for
language models to act as common sense mechanisms with
no task-specific fine-tuning. Specifically, we try to classify rooms in the Matterport3D
dataset given the objects contained within each. We do this by summarizing the contents of
the room in natural langugage, then using language models to identify which room category
best fits the summary.
This work resulted in a first-author paper submission to the Scaling Robot Learning Workshop
at the Robotics: Science and Systems 2022 Conference. I also delivered a spotlight talk and
poster presentation on this work there.
In this project, we explored optimization-based methods for motion planning with legged
robots. Specifically, we implemented a non-linear program that allows a virtual simplified
planar version of Boston Dynamics' Little Dog to walk and run.
The optimization considers robot dynamics and kinematics, time constraints, world physics,
and task specifications. By altering when each of the robot's feet are in contact with the
ground, we succeeded in getting the robot to
walk and run. We also explored how mixed-integer quadratic programs could be used to tackle
similar problems.
This was the final project for MIT's Underactuated Robotics, a
graduate-level class taught by Dr. Russ Tedrake.
Visual Pick-and-place for Robot Arms with Deep Learning
This was my final group project for MIT's Intelligent Robot Manipulation
, a graduate-level class taught by Dr. Russ Tedrake. The goal of this project
was to use visual means to grasp an object of interest from a bin and place it upright on a
nearby table using a 7-axis robot arm in the PyDrake
simulation environment.
Our full end-to-end system features a deep neural network for visual object pose estimation,
an interpolator for trajectory planning, and an nonlinear optimization-based inverse
kinematics controller to solve for desired control inputs.
Neural Transformer Language Models as Natural Language Knowledge Bases
Traditional knowledge bases contain hand-engineered compilations of facts and relationships.
Because of this, they have limited scalability and require significant resources to
maintain, but
nevertheless are easy to understand, query, and extend with new information. Neural language
models, on the other hand, encode
statistical information about grammar. They can be used as question-answering systems akin
to knowledge bases -- for
masked language models, this can be achieved with cloze-style questions. Still, they are
hard to update and interpret. We thus
investigated the BERT transformer language
ability to answer said questions with Meta Research's
LAnguage Model Analysis (LAMA)
as well as said model's ability to learn or memorize new information (including nonsense).
This was the final project for MIT's Advanced Natural
Language Processing
, a graduate-level class taught by Dr. Jacob Andreas and Dr. Yoon Kim. The
following fall semester, I accepted a teaching assistantship for said class.