Many of our research projects are in one of the following general themes. Note that this page is still being updated to include all publications.

Deformable Object Manipulation

Deformable objects are challenging from both a perceptual and dynamic perspective: a crumpled cloth has many self-occlusions and its configuration is hard to infer from observations; further, the dynamics of a cloth are complex to model and incorporate into planning algorithms. We develop algorithms to handle deformable object manipulation tasks, such as cloth, liquids, dough, and articulated objects.

Relevant Publications

	Learning Generalizable Tool-use Skills through Trajectory Generation Carl Qi, Yilin Wu, Lifan Yu, Haoyue Liu, Bowen Jiang, Xingyu Lin†, David Held† @inproceedings{qitooluse2024, title={Learning Generalizable Tool-use Skills through Trajectory Generation}, author={Qi, Carl and Wu, Yilin and Yu, Lifan and Liu, Haoyue and Jiang, Bowen and Lin, Xingyu and Held, David}, booktitle={IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year={2024} } Autonomous systems that efficiently utilize tools can assist humans in completing many common tasks such as cooking and cleaning. However, current systems fall short of matching human-level of intelligence in terms of adapting to novel tools. Prior works based on affordance often make strong assumptions about the environments and cannot scale to more complex, contact-rich tasks. In this work, we tackle this challenge and explore how agents can learn to use previously unseen tools to manipulate deformable objects. We propose to learn a generative model of the tool-use trajectories as a sequence of tool point clouds, which generalizes to different tool shapes. Given any novel tool, we first generate a tool-use trajectory and then optimize the sequence of tool poses to align with the generated trajectory. We train a single model on four different challenging deformable object manipulation tasks, using demonstration data from only one tool per task. The model generalizes to various novel tools, significantly outperforming baselines. We further test our trained policy in the real world with unseen tools, where it achieves the performance comparable to human. International Conference on Intelligent Robots and Systems (IROS), 2024 [Project Page] [Bibtex] [Abstract] [arXiv]
	Force Constrained Visual Policy: Safe Robot-Assisted Dressing via Multi-Modal Sensing Zhanyi Sun, Yufei Wang, David Held†, Zackory Erickson† @article{sun2024force, title={Force-Constrained Visual Policy: Safe Robot-Assisted Dressing via Multi-Modal Sensing}, author={Sun, Zhanyi and Wang, Yufei and Held, David and Erickson, Zackory}, journal={IEEE Robotics and Automation Letters}, year={2024} } Robot-assisted dressing could profoundly enhance the quality of life of adults with physical disabilities. To achieve this, a robot can benefit from both visual and force sensing. The former enables the robot to ascertain human body pose and garment deformations, while the latter helps maintain safety and comfort during the dressing process. In this paper, we introduce a new technique that leverages both vision and force modalities for this assistive task. Our approach first trains a vision-based dressing policy using reinforcement learning in simulation with varying body sizes, poses, and types of garments. We then learn a force dynamics model for action planning to ensure safety. Due to limitations of simulating accurate force data when deformable garments interact with the human body, we learn a force dynamics model directly from real-world data. Our proposed method combines the vision-based policy, trained in simulation, with the force dynamics model, learned in the real world, by solving a constrained optimization problem to infer actions that facilitate the dressing process without applying excessive force on the person. We evaluate our system in simulation and in a real-world human study with 10 participants across 240 dressing trials, showing it greatly outperforms prior baselines. Video demonstrations are available on our project website. Robotics and Automation Letters (RAL), 2024 [Project Page] [Bibtex] [Abstract] [arXiv]
	One Policy to Dress Them All: Learning to Dress People with Diverse Poses and Garments Yufei Wang, Zhanyi Sun, Zackory Erickson, David Held @inproceedings{Wang2023One,\n title={One Policy to Dress Them All: Learning to Dress People with Diverse Poses and Garments},\n author={Wang, Yufei and Sun, Zhanyi and Erickson, Zackory and Held, David},\n booktitle={Robotics: Science\ \ and Systems (RSS)},\n year={2023}\n }" Robot-assisted dressing could benefit the lives of many people such as older adults and individuals with disabilities. Despite such potential, robot-assisted dressing remains a challenging task for robotics as it involves complex manipulation of deformable cloth in 3D space. Many prior works aim to solve the robot-assisted dressing task, but they make certain assumptions such as a fixed garment and a fixed arm pose that limit their ability to generalize. In this work, we develop a robot-assisted dressing system that is able to dress different garments on people with diverse poses from partial point cloud observations, based on a learned policy. We show that with proper design of the policy architecture and Q function, reinforcement learning (RL) can be used to learn effective policies with partial point cloud observations that work well for dressing diverse garments. We further leverage policy distillation to combine multiple policies trained on different ranges of human arm poses into a single policy that works over a wide range of different arm poses. We conduct comprehensive real-world evaluations of our system with 510 dressing trials in a human study with 17 participants with different arm poses and dressed garments. Our system is able to dress 86\% of the length of the participants arms on average. Videos can be found on the anonymized project webpage: https://sites.google.com/view/one-policy-dress. Robotics: Science and Systems (RSS), 2023 [Project Page] [Bibtex] [Abstract] [arXiv]
	Self-supervised Cloth Reconstruction via Action-conditioned Cloth Tracking Zixuan Huang, Xingyu Lin, David Held @inproceedings{huang2023act,\n title={Self-supervised Cloth Reconstruction via Action-conditioned Cloth Tracking},\n author={Huang, Zixuan and Lin, Xingyu and Held, David},\n booktitle={IEEE International Conference on Robotics and Automation (ICRA), 2023},\n year={2023}\n } State estimation is one of the greatest challenges for cloth manipulation due to cloth's high dimensionality and self-occlusion. Prior works propose to identify the full state of crumpled clothes by training a mesh reconstruction model in simulation. However, such models are prone to suffer from a sim-to-real gap due to differences between cloth simulation and the real world. In this work, we propose a self-supervised method to finetune a mesh reconstruction model in the real world. Since the full mesh of crumpled cloth is difficult to obtain in the real world, we design a special data collection scheme and an action-conditioned model-based cloth tracking method to generate pseudo-labels for self-supervised learning. By finetuning the pretrained mesh reconstruction model on this pseudo-labeled dataset, we show that we can improve the quality of the reconstructed mesh without requiring human annotations, and improve the performance of downstream manipulation task. International Conference on Robotics and Automation (ICRA), 2023 [Project Page] [Bibtex] [Abstract] [arXiv]
	ToolFlowNet: Robotic Manipulation with Tools via Predicting Tool Flow from Point Clouds Daniel Seita, Yufei Wang†, Sarthak J Shetty†, Edward Yao Li†, Zackory Erickson, David Held @inproceedings{Seita2022toolflownet,\n title={{ToolFlowNet: Robotic Manipulation with Tools via Predicting Tool Flow from Point Clouds}},\n author={Seita, Daniel and Wang, Yufei and Shetty, Sarthak, and Li, Edward and Erickson, Zackory and Held, David},\n booktitle={Conference on Robot Learning (CoRL)},\n year={2022}\n } Point clouds are a widely available and canonical data modality which conveys the 3D geometry of a scene. Despite significant progress in classifica- tion and segmentation from point clouds, policy learning from such a modality remains challenging, and most prior works in imitation learning focus on learn- ing policies from images or state information. In this paper, we propose a novel framework for learning policies from point clouds for robotic manipulation with tools. We use a novel neural network, ToolFlowNet, which predicts dense per- point flow on the tool that the robot controls, and then uses the flow to derive the transformation that the robot should execute. We apply this framework to imita- tion learning of challenging deformable object manipulation tasks with continuous movement of tools, including scooping and pouring, and demonstrate significantly improved performance over baselines which do not use flow. We perform 50 phys- ical scooping experiments with ToolFlowNet and attain 82% scooping success. See https://tinyurl.com/toolflownet for supplementary material. Conference on Robot Learning (CoRL), 2022 [Project Page] [Bibtex] [Abstract] [arXiv] [Poster] [Talk]
	Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation Xingyu Lin, Carl Qi, Yunchu Zhang, Zhiao Huang, Katerina Fragkiadaki, Yunzhu Li, Chuang Gan, David Held @inproceedings{\n \ lin2022planning,\n title={Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation},\n author={Xingyu Lin and Carl Qi and Yunchu Zhang and Yunzhu Li and Zhiao Huang and Katerina Fragkiadaki and Chuang Gan and David Held},\n booktitle={6th Annual Conference on Robot Learning},\n year={2022},\n url={https://openreview.net/forum?id=tyxyBj2w4vw}\n } Effective planning of long-horizon deformable object manipulation requires suitable abstractions at both the spatial and temporal levels. Previous methods typically either focus on short-horizon tasks or make strong assumptions that full-state information is available, which prevents their use on deformable objects. In this paper, we propose PlAnning with Spatial-Temporal Abstraction (PASTA), which incorporates both spatial abstraction (reasoning about objects and their relations to each other) and temporal abstraction (reasoning over skills instead of low-level actions). Our framework maps high-dimension 3D observations such as point clouds into a set of latent vectors and plans over skill sequences on top of the latent set representation. We show that our method can effectively perform challenging sequential deformable object manipulation tasks in the real world, which require combining multiple tool-use skills such as cutting with a knife, pushing with a pusher, and spreading dough with a roller. Conference on Robot Learning (CoRL), 2022 [Project Page] [Bibtex] [Abstract] [OpenReview] [PDF] [Code]
	Learning to Singulate Layers of Cloth based on Tactile Feedback Sashank Tirumala, Thomas Weng, Daniel Seita*, Oliver Kroemer, Zeynep Temel, David Held @inproceedings{tirumala2022reskin, author={Tirumala, Sashank and Weng, Thomas and Seita, Daniel and Kroemer, Oliver and Temel, Zeynep and Held, David}, booktitle={2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, title={Learning to Singulate Layers of Cloth using Tactile Feedback}, year={2022}, volume={}, number={}, pages={7773-7780}, doi={10.1109/IROS47612.2022.9981341} } Robotic manipulation of cloth has applications ranging from fabrics manufacturing to handling blankets and laundry. Cloth manipulation is challenging for robots largely due to their high degrees of freedom, complex dynamics, and severe self-occlusions when in folded or crumpled configurations. Prior work on robotic manipulation of cloth relies primarily on vision sensors alone, which may pose challenges for fine-grained manipulation tasks such as grasping a desired number of cloth layers from a stack of cloth. In this paper, we propose to use tactile sensing for cloth manipulation; we attach a tactile sensor (ReSkin) to one of the two fingertips of a Franka robot and train a classifier to determine whether the robot is grasping a specific number of cloth layers. During test-time experiments, the robot uses this classifier as part of its policy to grasp one or two cloth layers using tactile feedback to determine suitable grasping points. Experimental results over 180 physical trials suggest that the proposed method outperforms baselines that do not use tactile feedback and has a better generalization to unseen fabrics compared to methods that use image classifiers. International Conference on Intelligent Robots and Systems (IROS), 2022 - Best Paper at ROMADO-SI [Project Page] [Bibtex] [Abstract] [arXiv] [Code]
	Learning Closed-loop Dough Manipulation using a Differentiable Reset Module Carl Qi, Xingyu Lin, David Held @article{qi2022dough, \nauthor={Qi, Carl and Lin, Xingyu and Held, David},\n\ journal={IEEE Robotics and Automation Letters}, \ntitle={Learning Closed-Loop\ \ Dough Manipulation Using a Differentiable Reset Module}, \nyear={2022},\nvolume={7},\n\ number={4},\npages={9857-9864},\ndoi={10.1109/LRA.2022.3191239}}" Deformable object manipulation has many applications such as cooking and laundry folding in our daily lives. Manipulating elastoplastic objects such as dough is particularly challenging because dough lacks a compact state representation and requires contact-rich interactions. We consider the task of flattening a piece of dough into a specific shape from RGB-D images. While the task is seemingly intuitive for humans, there exist local optima for common approaches such as naive trajectory optimization. We propose a novel trajectory optimizer that optimizes through a differentiable "reset" module, transforming a single-stage, fixed-initialization trajectory into a multistage, multi-initialization trajectory where all stages are optimized jointly. We then train a closed-loop policy on the demonstrations generated by our trajectory optimizer. Our policy receives partial point clouds as input, allowing ease of transfer from simulation to the real world. We show that our policy can perform real-world dough manipulation, flattening a ball of dough into a target shape. Robotics and Automation Letters (RAL) with presentation at the International Conference on Intelligent Robots and Systems (IROS), 2022 [Project Page] [Bibtex] [Abstract] [Talk] [PDF]
	Visual Haptic Reasoning: Estimating Contact Forces by Observing Deformable Object Interactions Yufei Wang, David Held, Zackory Erickson @article{wang2022visual, title={Visual Haptic Reasoning: Estimating Contact Forces by Observing Deformable Object Interactions}, author={Wang, Yufei and Held, David and Erickson, Zackory}, journal={IEEE Robotics and Automation Letters}, volume={7}, number={4}, pages={11426--11433}, year={2022}, publisher={IEEE} } Robotic manipulation of highly deformable cloth presents a promising opportunity to assist people with several daily tasks, such as washing dishes; folding laundry; or dressing, bathing, and hygiene assistance for individuals with severe motor impairments. In this work, we introduce a formulation that enables a collaborative robot to perform visual haptic reasoning with cloth -- the act of inferring the location and magnitude of applied forces during physical interaction. We present two distinct model representations, trained in physics simulation, that enable haptic reasoning using only visual and robot kinematic observations. We conducted quantitative evaluations of these models in simulation for robot-assisted dressing, bathing, and dish washing tasks, and demonstrate that the trained models can generalize across different tasks with varying interactions, human body sizes, and object shapes. We also present results with a real-world mobile manipulator, which used our simulation-trained models to estimate applied contact forces while performing physically assistive tasks with cloth. Robotics and Automation Letters (RAL) with presentation at the International Conference on Intelligent Robots and Systems (IROS), 2022 [Project Page] [Bibtex] [Abstract] [PDF] [Talk]
	Mesh-based Dynamics with Occlusion Reasoning for Cloth Manipulation Zixuan Huang, Xingyu Lin, David Held @inproceedings{huang2022medor,\n title={Mesh-based Dynamics Model\ \ with Occlusion Reasoning for Cloth Manipulation},\n author={Huang,\ \ Zixuan and Lin, Xingyu and Held,David},\n booktitle={Robotics: Science\ \ and Systems (RSS)},\n year={2022}\n }" Self-occlusion is challenging for cloth manipulation, as it makes it difficult to estimate the full state of the cloth. Ideally, a robot trying to unfold a crumpled or folded cloth should be able to reason about the cloth's occluded regions. We leverage recent advances in pose estimation for cloth to build a system that uses explicit occlusion reasoning to unfold a crumpled cloth. Specifically, we first learn a model to reconstruct the mesh of the cloth. However, the model will likely have errors due to the complexities of the cloth configurations and due to ambiguities from occlusions. Our main insight is that we can further refine the predicted reconstruction by performing test-time finetuning with self-supervised losses. The obtained reconstructed mesh allows us to use a mesh-based dynamics model for planning while reasoning about occlusions. We evaluate our system both on cloth flattening as well as on cloth canonicalization, in which the objective is to manipulate the cloth into a canonical pose. Our experiments show that our method significantly outperforms prior methods that do not explicitly account for occlusions or perform test-time optimization. Robotics: Science and Systems (RSS), 2022 [Project Page] [Bibtex] [Abstract] [PDF] [Poster] [Video]
	DiffSkill: Skill Abstraction from Differentiable Physics for Deformable Object Manipulations with Tools Xingyu Lin, Zhiao Huang, Yunzhu Li, Joshua B. Tenenbaum, David Held, Chuang Gan @inproceedings{ lin2022diffskill, title={DiffSkill: Skill Abstraction from Differentiable Physics for Deformable Object Manipulations with Tools}, author={Xingyu Lin and Zhiao Huang and Yunzhu Li and David Held and Joshua B. Tenenbaum and Chuang Gan}, booktitle={International Conference on Learning Representations}, year={2022}, url={https://openreview.net/forum?id=Kef8cKdHWpP}} We consider the problem of sequential robotic manipulation of deformable objects using tools. Previous works have shown that differentiable physics simulators provide gradients to the environment state and help trajectory optimization to converge orders of magnitude faster than model-free reinforcement learning algorithms for deformable object manipulations. However, such gradient-based trajectory optimization typically requires access to the full simulator states and can only solve short-horizon, single-skill tasks due to local optima. In this work, we propose a novel framework, named DiffSkill, that uses a differentiable physics simulator for skill abstraction to solve long-horizon deformable object manipulation tasks from sensory observations. In particular, we first obtain short-horizon skills for using each individual tool from a gradient-based optimizer and then learn a neural skill abstractor from the demonstration videos; Finally, we plan over the skills to solve the long-horizon task. We show the advantages of our method in a new set of sequential deformable object manipulation tasks over previous reinforcement learning algorithms and the trajectory optimizer. International Conference on Learning Representations (ICLR), 2022 [Project Page] [Bibtex] [Abstract] [PDF]
	Self-supervised Transparent Liquid Segmentation for Robotic Pouring Gautham Narayan Narasimhan, Kai Zhang, Ben Eisner, Xingyu Lin, David Held @inproceedings{icra2022pouring, title={Self-supervised Transparent Liquid Segmentation for Robotic Pouring}, author={Gautham Narayan Narasimhan, Kai Zhang, Ben Eisner, Xingyu Lin, David Held}, booktitle={International Conference on Robotics and Automation (ICRA)}, year={2022}} Liquid state estimation is important for robotics tasks such as pouring; however, estimating the state of transparent liquids is a challenging problem. We propose a novel segmentation pipeline that can segment transparent liquids such as water from a static, RGB image without requiring any manual annotations or heating of the liquid for training. Instead, we use a generative model that is capable of translating images of colored liquids into synthetically generated transparent liquid images, trained only on an unpaired dataset of colored and transparent liquid images. Segmentation labels of colored liquids are obtained automatically using background subtraction. Our experiments show that we are able to accurately predict a segmentation mask for transparent liquids without requiring any manual annotations. We demonstrate the utility of transparent liquid segmentation in a robotic pouring task that controls pouring by perceiving the liquid height in a transparent cup. Accompanying video and supplementary materials can be found on our project page. International Conference of Robotics and Automation (ICRA), 2022 [Project Page] [Bibtex] [Abstract] [Code] [PDF] [Poster] [Slides] [Video]
	Learning Visible Connectivity Dynamics for Cloth Smoothing Xingyu Lin, Yufei Wang, Zixuan Huang, David Held @inproceedings{lin2021VCD, title={Learning Visible Connectivity Dynamics for Cloth Smoothing}, author={Lin, Xingyu and Wang, Yufei and Huang, Zixuan and Held, David}, booktitle={Conference on Robot Learning}, year={2021}} Robotic manipulation of cloth remains challenging for robotics due to the complex dynamics of the cloth, lack of a low-dimensional state representation, and self-occlusions. In contrast to previous model-based approaches that learn a pixel-based dynamics model or a compressed latent vector dynamics, we propose to learn a particle-based dynamics model from a partial point cloud observation. To overcome the challenges of partial observability, we infer which visible points are connected on the underlying cloth mesh. We then learn a dynamics model over this visible connectivity graph. Compared to previous learning-based approaches, our model poses strong inductive bias with its particle based representation for learning the underlying cloth physics; it is invariant to visual features; and the predictions can be more easily visualized. We show that our method greatly outperforms previous state-of-the-art model-based and model-free reinforcement learning methods in simulation. Furthermore, we demonstrate zero-shot sim-to-real transfer where we deploy the model trained in simulation on a Franka arm and show that the model can successfully smooth different types of cloth from crumpled configurations. Videos can be found on our project website. Conference on Robot Learning (CoRL), 2021 [Project Page] [Bibtex] [Abstract] [OpenReview] [PDF] [Code]
	FabricFlowNet: Bimanual Cloth Manipulation with a Flow-based Policy Thomas Weng, Sujay Bajracharya, Yufei Wang, David Held @inproceedings{weng2021fabricflownet,\n title={FabricFlowNet: Bimanual Cloth\ \ Manipulation \n with a Flow-based Policy},\n author={Weng, Thomas and Bajracharya,\ \ Sujay and \n Wang, Yufei and Agrawal, Khush and Held, David},\n booktitle={Conference\ \ on Robot Learning},\n year={2021}\n}" We address the problem of goal-directed cloth manipulation, a challenging task due to the deformability of cloth. Our insight is that optical flow, a technique normally used for motion estimation in video, can also provide an effective representation for corresponding cloth poses across observation and goal images. We introduce FabricFlowNet (FFN), a cloth manipulation policy that leverages flow as both an input and as an action representation to improve performance. FabricFlowNet also elegantly switches between dual-arm and single-arm actions based on the desired goal. We show that FabricFlowNet significantly outperforms state-of-the-art model-free and model-based cloth manipulation policies. We also present real-world experiments on a bimanual system, demonstrating effective sim-to-real transfer. Finally, we show that our method generalizes when trained on a single square cloth to other cloth shapes, such as T-shirts and rectangular cloths. Conference on Robot Learning (CoRL), 2021 [Project Page] [Bibtex] [Abstract] [Code] [OpenReview] [PDF] [Poster]
	SoftGym: Benchmarking Deep Reinforcement Learning for Deformable Object Manipulation Xingyu Lin, Yufei Wang, Jake Olkin, David Held @inproceedings{corl2020softgym, title={SoftGym: Benchmarking Deep Reinforcement Learning for Deformable Object Manipulation}, author={Lin, Xingyu and Wang, Yufei and Olkin, Jake and Held, David}, booktitle={Conference on Robot Learning}, year={2020}} Manipulating deformable objects has long been a challenge in robotics due to its high dimensional state representation and complex dynamics. Recent success in deep reinforcement learning provides a promising direction for learning to manipulate deformable objects with data driven methods. However, existing reinforcement learning benchmarks only cover tasks with direct state observability and simple low-dimensional dynamics or with relatively simple image-based environments, such as those with rigid objects. In this paper, we present SoftGym, a set of open-source simulated benchmarks for manipulating deformable objects, with a standard OpenAI Gym API and a Python interface for creating new environments. Our benchmark will enable reproducible research in this important area. Further, we evaluate a variety of algorithms on these tasks and highlight challenges for reinforcement learning algorithms, including dealing with a state representation that has a high intrinsic dimensionality and is partially observable. The experiments and analysis indicate the strengths and limitations of existing methods in the context of deformable object manipulation that can help point the way forward for future methods development. Code and videos of the learned policies can be found on our project website. Conference on Robot Learning (CoRL), 2020 [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	PLAS: Latent Action Space for Offline Reinforcement Learning Wenxuan Zhou, Sujay Bajracharya, David Held @inproceedings{PLAS_corl2020, title={PLAS: Latent Action Space for Offline Reinforcement Learning}, author={Zhou, Wenxuan and Bajracharya, Sujay and Held, David}, booktitle={Conference on Robot Learning}, year={2020} } The goal of offline reinforcement learning is to learn a policy from a fixed dataset, without further interactions with the environment. This setting will be an increasingly more important paradigm for real-world applications of reinforcement learning such as robotics, in which data collection is slow and potentially dangerous. Existing off-policy algorithms have limited performance on static datasets due to extrapolation errors from out-of-distribution actions. This leads to the challenge of constraining the policy to select actions within the support of the dataset during training. We propose to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement is naturally satisfied. We evaluate our method on continuous control benchmarks in simulation and a deformable object manipulation task with a physical robot. We demonstrate that our method provides competitive performance consistently across various continuous control tasks and different types of datasets, outperforming existing offline reinforcement learning methods with explicit constraints. Conference on Robot Learning (CoRL), 2020 - Plenary talk (Selection rate 4.1%) [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	Cloth Region Segmentation for Robust Grasp Selection Jianing Qian, Thomas Weng, Luxin Zhang, Brian Okorn, David Held @inproceedings{Qian_2020_IROS,\n author={Qian, Jianing and Weng, Thomas and Zhang, Luxin and Okorn, Brian and Held, David}, booktitle={2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, title={Cloth Region Segmentation for Robust Grasp Selection}, year={2020}, volume={}, number={}, pages={9553-9560}, doi={10.1109/IROS45743.2020.9341121}}" Cloth detection and manipulation is a common task in domestic and industrial settings, yet such tasks remain a challenge for robots due to cloth deformability. Furthermore, in many cloth-related tasks like laundry folding and bed making, it is crucial to manipulate specific regions like edges and corners, as opposed to folds. In this work, we focus on the problem of segmenting and grasping these key regions. Our approach trains a network to segment the edges and corners of a cloth from a depth image, distinguishing such regions from wrinkles or folds. We also provide a novel algorithm for estimating the grasp location, direction, and directional uncertainty from the segmentation. We demonstrate our method on a real robot system and show that it outperforms baseline methods on grasping success. Video and other supplementary materials are available at: International Conference on Intelligent Robots and Systems (IROS), 2020 [Project Page] [Bibtex] [Abstract] [PDF] [Code]

3D Affordance Reasoning for Object Manipulation

In order for a robot to interact with an object, the robot must infer its “affordances”: how the object moves as the robot interacts with it and how the object can interact with other objects in the environment. We develop robot perception algorithms that learn to estimate these affordances and then use such inferences to learn to manipulate objects to achieve a task.

Relevant Publications

	ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation Yufei Wang, Ziyu Wang, Mino Nakura†, Pratik Bhowal†, Chia-Liang Kuo†, Yi-Ting Chen, Zackory Erickson‡, David Held‡ @INPROCEEDINGS{wang-2025-ArticuBot, title={{ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation}}, author={Wang, Yufei and Wang, Ziyu and Nakura, Mino and Bhowal, Pratik and Kuo, Chia-Liang and Chen, Yi-Ting and Erickson, Zackory and Held, David}, BOOKTITLE={Proceedings of Robotics: Science and Systems (RSS)}, year={2025} } This paper presents ArticuBot, in which a single learned policy enables a robotics system to open diverse categories of unseen articulated objects in the real world. This task has long been challenging for robotics due to the large variations in the geometry, size, and articulation types of such objects. Our system, Articubot, consists of three parts: generating a large number of demonstrations in physics-based simulation, distilling all generated demonstrations into a point cloud-based neural policy via imitation learning, and performing zero-shot sim2real transfer to real robotics systems. Utilizing sampling-based grasping and motion planning, our demonstration generalization pipeline is fast and effective, generating a total of 42.3k demonstrations over 322 training articulated objects. For policy learning, we propose a novel hierarchical policy representation, in which the high-level policy learns the sub-goal for the end-effector, and the low-level policy learns how to move the end-effector conditioned on the predicted goal. We demonstrate that this hierarchical approach achieves much better object-level generalization compared to the non-hierarchical version. We further propose a novel weighted displacement model for the high-level policy that grounds the prediction into the existing 3D structure of the scene, outperforming alternative policy representations. We show that our learned policy can zero-shot transfer to three different real robot settings: a fixed table-top Franka arm across two different labs, and an X-Arm on a mobile base, opening multiple unseen articulated objects across two labs, real lounges, and kitchens. Robotics: Science and Systems (RSS), 2025 [Project Page] [Bibtex] [Abstract] [arXiv] [Code] [Video]
	HACMan++: Spatially-Grounded Motion Primitives for Manipulation Bowen Jiang, Yilin Wu, Wenxuan Zhou, Chris Paxton, David Held @INPROCEEDINGS{Jiang-RSS-24, AUTHOR = {Bowen Jiang AND Yilin Wu AND Wenxuan Zhou AND Chris Paxton AND David Held}, TITLE = {{HACMan++: Spatially-Grounded Motion Primitives for Manipulation}}, BOOKTITLE = {Proceedings of Robotics: Science and Systems}, YEAR = {2024}, ADDRESS = {Delft, Netherlands}, MONTH = {July}, DOI = {10.15607/RSS.2024.XX.129} } We present HACMan++, a reinforcement learning framework using a novel action space of spatially-grounded parameterized motion primitives for manipulation tasks. Robotics: Science and Systems (RSS), 2024 [Project Page] [Bibtex] [Abstract] [arXiv]
	Learning Distributional Demonstration Spaces for Task-Specific Cross-Pose Estimation Jenny Wang, Octavian Donca, David Held @article{wang2024taxposed, title={Learning Distributional Demonstration Spaces for Task-Specific Cross-Pose Estimation}, author={Wang, Jenny and Donca, Octavian and Held, David}, journal={IEEE International Conference on Robotics and Automation (ICRA), 2024}, year={2024} } Relative placement tasks are an important category of tasks in which one object needs to be placed in a desired pose relative to another object. Previous work has shown success in learning relative placement tasks from just a small number of demonstrations, when using relational reasoning networks with geometric inductive biases. However, such methods fail to consider that demonstrations for the same task can be fundamentally multimodal, like a mug hanging on any of n racks. We propose a method that retains the provably translation-invariant and relational properties of prior work but incorporates additional properties that account for multimodal, distributional examples. We show that our method is able to learn precise relative placement tasks with a small number of multimodal demonstrations with no human annotations across a diverse set of objects within a category. International Conference on Robotics and Automation (ICRA), 2024 [Project Page] [Bibtex] [Abstract] [arXiv] [Poster] [Slides] [Code]
	One Policy to Dress Them All: Learning to Dress People with Diverse Poses and Garments Yufei Wang, Zhanyi Sun, Zackory Erickson, David Held @inproceedings{Wang2023One,\n title={One Policy to Dress Them All: Learning to Dress People with Diverse Poses and Garments},\n author={Wang, Yufei and Sun, Zhanyi and Erickson, Zackory and Held, David},\n booktitle={Robotics: Science\ \ and Systems (RSS)},\n year={2023}\n }" Robot-assisted dressing could benefit the lives of many people such as older adults and individuals with disabilities. Despite such potential, robot-assisted dressing remains a challenging task for robotics as it involves complex manipulation of deformable cloth in 3D space. Many prior works aim to solve the robot-assisted dressing task, but they make certain assumptions such as a fixed garment and a fixed arm pose that limit their ability to generalize. In this work, we develop a robot-assisted dressing system that is able to dress different garments on people with diverse poses from partial point cloud observations, based on a learned policy. We show that with proper design of the policy architecture and Q function, reinforcement learning (RL) can be used to learn effective policies with partial point cloud observations that work well for dressing diverse garments. We further leverage policy distillation to combine multiple policies trained on different ranges of human arm poses into a single policy that works over a wide range of different arm poses. We conduct comprehensive real-world evaluations of our system with 510 dressing trials in a human study with 17 participants with different arm poses and dressed garments. Our system is able to dress 86\% of the length of the participants arms on average. Videos can be found on the anonymized project webpage: https://sites.google.com/view/one-policy-dress. Robotics: Science and Systems (RSS), 2023 [Project Page] [Bibtex] [Abstract] [arXiv]
	Neural Grasp Distance Fields for Robot Manipulation Thomas Weng, David Held, Franziska Meier, Mustafa Mukadam @article{weng2023ngdf,\n title={Neural Grasp Distance Fields for Robot Manipulation},\n author={Weng, Thomas and Held, David and Meier, Franziska and Mukadam, Mustafa},\n booktitle={IEEE International Conference on Robotics and Automation (ICRA)},\n year={2023}\n}" We formulate grasp learning as a neural field and present Neural Grasp Distance Fields (NGDF). Here, the input is a 6D pose of a robot end effector and output is a distance to a continuous manifold of valid grasps for an object. In contrast to current approaches that predict a set of discrete candidate grasps, the distance-based NGDF representation is easily interpreted as a cost, and minimizing this cost produces a successful grasp pose. This grasp distance cost can be incorporated directly into a trajectory optimizer for joint optimization with other costs such as trajectory smoothness and collision avoidance. During optimization, as the various costs are balanced and minimized, the grasp target is allowed to smoothly vary, as the learned grasp field is continuous. In simulation benchmarks with a Franka arm, we find that joint grasping and planning with NGDF outperforms baselines by 63% execution success while generalizing to unseen query poses and unseen object shapes. International Conference on Robotics and Automation (ICRA), 2023 [Project Page] [Bibtex] [Abstract] [arXiv] [Code]
	TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, David Held @inproceedings{pan2022tax,\n title={TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation},\n author={Pan, Chuer and Okorn, Brian and Zhang, Harry and Eisner, Ben and Held, David},\n booktitle={Conference on Robot Learning (CoRL)},\n year={2022}\n } How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship “cross-pose” and provide a mathematical definition of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We demonstrate our method’s capability to generalize to unseen objects, in some cases after training on only 10 demonstrations in the real world. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments across a number of tasks. Conference on Robot Learning (CoRL), 2022 [Project Page] [Bibtex] [Abstract] [PDF]
	Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation Xingyu Lin, Carl Qi, Yunchu Zhang, Zhiao Huang, Katerina Fragkiadaki, Yunzhu Li, Chuang Gan, David Held @inproceedings{\n \ lin2022planning,\n title={Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation},\n author={Xingyu Lin and Carl Qi and Yunchu Zhang and Yunzhu Li and Zhiao Huang and Katerina Fragkiadaki and Chuang Gan and David Held},\n booktitle={6th Annual Conference on Robot Learning},\n year={2022},\n url={https://openreview.net/forum?id=tyxyBj2w4vw}\n } Effective planning of long-horizon deformable object manipulation requires suitable abstractions at both the spatial and temporal levels. Previous methods typically either focus on short-horizon tasks or make strong assumptions that full-state information is available, which prevents their use on deformable objects. In this paper, we propose PlAnning with Spatial-Temporal Abstraction (PASTA), which incorporates both spatial abstraction (reasoning about objects and their relations to each other) and temporal abstraction (reasoning over skills instead of low-level actions). Our framework maps high-dimension 3D observations such as point clouds into a set of latent vectors and plans over skill sequences on top of the latent set representation. We show that our method can effectively perform challenging sequential deformable object manipulation tasks in the real world, which require combining multiple tool-use skills such as cutting with a knife, pushing with a pusher, and spreading dough with a roller. Conference on Robot Learning (CoRL), 2022 [Project Page] [Bibtex] [Abstract] [OpenReview] [PDF] [Code]
	Visual Haptic Reasoning: Estimating Contact Forces by Observing Deformable Object Interactions Yufei Wang, David Held, Zackory Erickson @article{wang2022visual, title={Visual Haptic Reasoning: Estimating Contact Forces by Observing Deformable Object Interactions}, author={Wang, Yufei and Held, David and Erickson, Zackory}, journal={IEEE Robotics and Automation Letters}, volume={7}, number={4}, pages={11426--11433}, year={2022}, publisher={IEEE} } Robotic manipulation of highly deformable cloth presents a promising opportunity to assist people with several daily tasks, such as washing dishes; folding laundry; or dressing, bathing, and hygiene assistance for individuals with severe motor impairments. In this work, we introduce a formulation that enables a collaborative robot to perform visual haptic reasoning with cloth -- the act of inferring the location and magnitude of applied forces during physical interaction. We present two distinct model representations, trained in physics simulation, that enable haptic reasoning using only visual and robot kinematic observations. We conducted quantitative evaluations of these models in simulation for robot-assisted dressing, bathing, and dish washing tasks, and demonstrate that the trained models can generalize across different tasks with varying interactions, human body sizes, and object shapes. We also present results with a real-world mobile manipulator, which used our simulation-trained models to estimate applied contact forces while performing physically assistive tasks with cloth. Robotics and Automation Letters (RAL) with presentation at the International Conference on Intelligent Robots and Systems (IROS), 2022 [Project Page] [Bibtex] [Abstract] [PDF] [Talk]
	FlowBot3D: Learning 3D Articulation Flow to Manipulate Articulated Objects Ben Eisner, Harry Zhang, David Held @inproceedings{EisnerZhang2022FLOW,\n title={FlowBot3D: Learning\ \ 3D Articulation Flow to Manipulate Articulated Objects},\n author={Eisner,\ \ Ben and Zhang, Harry and Held,David},\n booktitle={Robotics: Science\ \ and Systems (RSS)},\n year={2022}\n }" We explore a novel method to perceive and manipulate 3D articulated objects that generalizes to enable a robot to articulate unseen classes of objects. We propose a vision-based system that learns to predict the potential motions of the parts of a variety of articulated objects to guide downstream motion planning of the system to articulate the objects. To predict the object motions, we train a neural network to output a dense vector field representing the point-wise motion direction of the points in the point cloud under articulation. We then deploy an analytical motion planner based on this vector field to achieve a policy that yields maximum articulation. We train the vision system entirely in simulation, and we demonstrate the capability of our system to generalize to unseen object instances and novel categories in both simulation and the real world, deploying our policy on a Sawyer robot with no finetuning. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments. Robotics: Science and Systems (RSS), 2022 - Best Paper Finalist (Selection Rate 1.5%) [Project Page] [Bibtex] [Abstract] [PDF]
	Learning Visible Connectivity Dynamics for Cloth Smoothing Xingyu Lin, Yufei Wang, Zixuan Huang, David Held @inproceedings{lin2021VCD, title={Learning Visible Connectivity Dynamics for Cloth Smoothing}, author={Lin, Xingyu and Wang, Yufei and Huang, Zixuan and Held, David}, booktitle={Conference on Robot Learning}, year={2021}} Robotic manipulation of cloth remains challenging for robotics due to the complex dynamics of the cloth, lack of a low-dimensional state representation, and self-occlusions. In contrast to previous model-based approaches that learn a pixel-based dynamics model or a compressed latent vector dynamics, we propose to learn a particle-based dynamics model from a partial point cloud observation. To overcome the challenges of partial observability, we infer which visible points are connected on the underlying cloth mesh. We then learn a dynamics model over this visible connectivity graph. Compared to previous learning-based approaches, our model poses strong inductive bias with its particle based representation for learning the underlying cloth physics; it is invariant to visual features; and the predictions can be more easily visualized. We show that our method greatly outperforms previous state-of-the-art model-based and model-free reinforcement learning methods in simulation. Furthermore, we demonstrate zero-shot sim-to-real transfer where we deploy the model trained in simulation on a Franka arm and show that the model can successfully smooth different types of cloth from crumpled configurations. Videos can be found on our project website. Conference on Robot Learning (CoRL), 2021 [Project Page] [Bibtex] [Abstract] [OpenReview] [PDF] [Code]

Multimodal Learning

Robots should use all of the sensors available to them, such as depth, RGB, and tactile data. We have developed methods to intelligently integrate these sensor modalities.

Relevant Publications

	Force Constrained Visual Policy: Safe Robot-Assisted Dressing via Multi-Modal Sensing Zhanyi Sun, Yufei Wang, David Held†, Zackory Erickson† @article{sun2024force, title={Force-Constrained Visual Policy: Safe Robot-Assisted Dressing via Multi-Modal Sensing}, author={Sun, Zhanyi and Wang, Yufei and Held, David and Erickson, Zackory}, journal={IEEE Robotics and Automation Letters}, year={2024} } Robot-assisted dressing could profoundly enhance the quality of life of adults with physical disabilities. To achieve this, a robot can benefit from both visual and force sensing. The former enables the robot to ascertain human body pose and garment deformations, while the latter helps maintain safety and comfort during the dressing process. In this paper, we introduce a new technique that leverages both vision and force modalities for this assistive task. Our approach first trains a vision-based dressing policy using reinforcement learning in simulation with varying body sizes, poses, and types of garments. We then learn a force dynamics model for action planning to ensure safety. Due to limitations of simulating accurate force data when deformable garments interact with the human body, we learn a force dynamics model directly from real-world data. Our proposed method combines the vision-based policy, trained in simulation, with the force dynamics model, learned in the real world, by solving a constrained optimization problem to infer actions that facilitate the dressing process without applying excessive force on the person. We evaluate our system in simulation and in a real-world human study with 10 participants across 240 dressing trials, showing it greatly outperforms prior baselines. Video demonstrations are available on our project website. Robotics and Automation Letters (RAL), 2024 [Project Page] [Bibtex] [Abstract] [arXiv]
	Learning to Singulate Layers of Cloth based on Tactile Feedback Sashank Tirumala, Thomas Weng, Daniel Seita*, Oliver Kroemer, Zeynep Temel, David Held @inproceedings{tirumala2022reskin, author={Tirumala, Sashank and Weng, Thomas and Seita, Daniel and Kroemer, Oliver and Temel, Zeynep and Held, David}, booktitle={2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, title={Learning to Singulate Layers of Cloth using Tactile Feedback}, year={2022}, volume={}, number={}, pages={7773-7780}, doi={10.1109/IROS47612.2022.9981341} } Robotic manipulation of cloth has applications ranging from fabrics manufacturing to handling blankets and laundry. Cloth manipulation is challenging for robots largely due to their high degrees of freedom, complex dynamics, and severe self-occlusions when in folded or crumpled configurations. Prior work on robotic manipulation of cloth relies primarily on vision sensors alone, which may pose challenges for fine-grained manipulation tasks such as grasping a desired number of cloth layers from a stack of cloth. In this paper, we propose to use tactile sensing for cloth manipulation; we attach a tactile sensor (ReSkin) to one of the two fingertips of a Franka robot and train a classifier to determine whether the robot is grasping a specific number of cloth layers. During test-time experiments, the robot uses this classifier as part of its policy to grasp one or two cloth layers using tactile feedback to determine suitable grasping points. Experimental results over 180 physical trials suggest that the proposed method outperforms baselines that do not use tactile feedback and has a better generalization to unseen fabrics compared to methods that use image classifiers. International Conference on Intelligent Robots and Systems (IROS), 2022 - Best Paper at ROMADO-SI [Project Page] [Bibtex] [Abstract] [arXiv] [Code]
	PLAS: Latent Action Space for Offline Reinforcement Learning Wenxuan Zhou, Sujay Bajracharya, David Held @inproceedings{PLAS_corl2020, title={PLAS: Latent Action Space for Offline Reinforcement Learning}, author={Zhou, Wenxuan and Bajracharya, Sujay and Held, David}, booktitle={Conference on Robot Learning}, year={2020} } The goal of offline reinforcement learning is to learn a policy from a fixed dataset, without further interactions with the environment. This setting will be an increasingly more important paradigm for real-world applications of reinforcement learning such as robotics, in which data collection is slow and potentially dangerous. Existing off-policy algorithms have limited performance on static datasets due to extrapolation errors from out-of-distribution actions. This leads to the challenge of constraining the policy to select actions within the support of the dataset during training. We propose to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement is naturally satisfied. We evaluate our method on continuous control benchmarks in simulation and a deformable object manipulation task with a physical robot. We demonstrate that our method provides competitive performance consistently across various continuous control tasks and different types of datasets, outperforming existing offline reinforcement learning methods with explicit constraints. Conference on Robot Learning (CoRL), 2020 - Plenary talk (Selection rate 4.1%) [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	Cloth Region Segmentation for Robust Grasp Selection Jianing Qian, Thomas Weng, Luxin Zhang, Brian Okorn, David Held @inproceedings{Qian_2020_IROS,\n author={Qian, Jianing and Weng, Thomas and Zhang, Luxin and Okorn, Brian and Held, David}, booktitle={2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, title={Cloth Region Segmentation for Robust Grasp Selection}, year={2020}, volume={}, number={}, pages={9553-9560}, doi={10.1109/IROS45743.2020.9341121}}" Cloth detection and manipulation is a common task in domestic and industrial settings, yet such tasks remain a challenge for robots due to cloth deformability. Furthermore, in many cloth-related tasks like laundry folding and bed making, it is crucial to manipulate specific regions like edges and corners, as opposed to folds. In this work, we focus on the problem of segmenting and grasping these key regions. Our approach trains a network to segment the edges and corners of a cloth from a depth image, distinguishing such regions from wrinkles or folds. We also provide a novel algorithm for estimating the grasp location, direction, and directional uncertainty from the segmentation. We demonstrate our method on a real robot system and show that it outperforms baseline methods on grasping success. Video and other supplementary materials are available at: International Conference on Intelligent Robots and Systems (IROS), 2020 [Project Page] [Bibtex] [Abstract] [PDF] [Code]
	Multi-Modal Transfer Learning for Grasping Transparent and Specular Objects Thomas Weng, Amith Pallankize, Yimin Tang, Oliver Kroemer, David Held @ARTICLE{9001238, author={Thomas Weng and Amith Pallankize and Yimin Tang and Oliver Kroemer and David Held}, journal={IEEE Robotics and Automation Letters}, title={Multi-Modal Transfer Learning for Grasping Transparent and Specular Objects}, year={2020}, volume={5}, number={3}, pages={3791-3798}, doi={10.1109/LRA.2020.2974686}} State-of-the-art object grasping methods rely on depth sensing to plan robust grasps, but commercially available depth sensors fail to detect transparent and specular objects. To improve grasping performance on such objects, we introduce a method for learning a multi-modal perception model by bootstrapping from an existing uni-modal model. This transfer learning approach requires only a pre-existing uni-modal grasping model and paired multi-modal image data for training, foregoing the need for ground-truth grasp success labels nor real grasp attempts. Our experiments demonstrate that our approach is able to reliably grasp transparent and reflective objects. Video and supplementary material are available at Robotics and Automation Letters (RAL) with presentation at the International Conference of Robotics and Automation (ICRA), 2020 [Project Page] [Bibtex] [Abstract] [PDF]

Reinforcement Learning Algorithms

Robots can use data, either from the real world or from a simulator, to learn how to perform a task. This is especially important for tasks which are difficult for robots to achieve via traditional techniques such as motion planning, such as deformable object manipulation. We have developed novel reinforcement learning algorithms to more effectively learn from data.

Relevant Publications

	HACMan++: Spatially-Grounded Motion Primitives for Manipulation Bowen Jiang, Yilin Wu, Wenxuan Zhou, Chris Paxton, David Held @INPROCEEDINGS{Jiang-RSS-24, AUTHOR = {Bowen Jiang AND Yilin Wu AND Wenxuan Zhou AND Chris Paxton AND David Held}, TITLE = {{HACMan++: Spatially-Grounded Motion Primitives for Manipulation}}, BOOKTITLE = {Proceedings of Robotics: Science and Systems}, YEAR = {2024}, ADDRESS = {Delft, Netherlands}, MONTH = {July}, DOI = {10.15607/RSS.2024.XX.129} } We present HACMan++, a reinforcement learning framework using a novel action space of spatially-grounded parameterized motion primitives for manipulation tasks. Robotics: Science and Systems (RSS), 2024 [Project Page] [Bibtex] [Abstract] [arXiv]
	Learning Off-policy for Online Planning Harshit Sikchi, Wenxuan Zhou, David Held @inproceedings{sikchi2021learning, title={Learning Off-policy for Online Planning}, author={Sikchi, Harshit and Zhou, Wenxuan and Held, David}, booktitle={Conference on Robot Learning}, year={2021}} Reinforcement learning (RL) in low-data and risk-sensitive domains requires performant and flexible deployment policies that can readily incorporate constraints during deployment. One such class of policies are the semi-parametric H-step lookahead policies, which select actions using trajectory optimization over a dynamics model for a fixed horizon with a terminal value function. In this work, we investigate a novel instantiation of H-step lookahead with a learned model and a terminal value function learned by a model-free off-policy algorithm, named Learning Off-Policy with Online Planning (LOOP). We provide a theoretical analysis of this method, suggesting a tradeoff between model errors and value function errors and empirically demonstrate this tradeoff to be beneficial in deep reinforcement learning. Furthermore, we identify the "Actor Divergence" issue in this framework and propose Actor Regularized Control (ARC), a modified trajectory optimization procedure. We evaluate our method on a set of robotic tasks for Offline and Online RL and demonstrate improved performance. We also show the flexibility of LOOP to incorporate safety constraints during deployment with a set of navigation environments. We demonstrate that LOOP is a desirable framework for robotics applications based on its strong performance in various important RL settings. Conference on Robot Learning (CoRL), 2021 - Oral presentation (Selection rate 6.5%); Best Paper Finalist [Project Page] [Bibtex] [Abstract] [Code] [OpenReview] [PDF] [Talk]
	PLAS: Latent Action Space for Offline Reinforcement Learning Wenxuan Zhou, Sujay Bajracharya, David Held @inproceedings{PLAS_corl2020, title={PLAS: Latent Action Space for Offline Reinforcement Learning}, author={Zhou, Wenxuan and Bajracharya, Sujay and Held, David}, booktitle={Conference on Robot Learning}, year={2020} } The goal of offline reinforcement learning is to learn a policy from a fixed dataset, without further interactions with the environment. This setting will be an increasingly more important paradigm for real-world applications of reinforcement learning such as robotics, in which data collection is slow and potentially dangerous. Existing off-policy algorithms have limited performance on static datasets due to extrapolation errors from out-of-distribution actions. This leads to the challenge of constraining the policy to select actions within the support of the dataset during training. We propose to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement is naturally satisfied. We evaluate our method on continuous control benchmarks in simulation and a deformable object manipulation task with a physical robot. We demonstrate that our method provides competitive performance consistently across various continuous control tasks and different types of datasets, outperforming existing offline reinforcement learning methods with explicit constraints. Conference on Robot Learning (CoRL), 2020 - Plenary talk (Selection rate 4.1%) [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	Adaptive Auxiliary Task Weighting for Reinforcement Learning Xingyu Lin, Harjatin Baweja, George Kantor, David Held Neural Information Processing Systems (NeurIPS), 2019
	Automatic Goal Generation for Reinforcement Learning Agents Carlos Florensa, David Held, Xinyang Geng*, Pieter Abbeel International Conference on Machine Learning (ICML), 2018
	Reverse Curriculum Generation for Reinforcement Learning Carlos Florensa, David Held, Markus Wulfmeier, Pieter Abbeel Conference on Robot Learning (CoRL), 2017
	Constrained Policy Optimization Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel International Conference on Machine Learning (ICML), 2017

Autonomous Driving

In the domain of autonomous driving, we have developed novel methods for every part of the perception pipeline: segmentation, object detection, tracking, and velocity estimation.

Relevant Publications

	Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting Tarasha Khurana, Peiyun Hu, David Held, Deva Ramanan @inproceedings{Khurana2023point,\n title={Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting},\n author={Khurana, Tarasha and Hu, Peiyun and Held, David and Ramanan, Deva},\n booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and\ \ Pattern Recognition (CVPR)},\n year={2023}\n } Predicting how the world can evolve in the future is crucial for motion planning in autonomous systems. Classical methods are limited because they rely on costly human annotations in the form of semantic class labels, bounding boxes, and tracks or HD maps of cities to plan their motion — and thus are difficult to scale to large unlabeled datasets. One promising self-supervised task is 3D point cloud forecasting from unannotated LiDAR sequences. We show that this task requires algorithms to implicitly capture (1) sensor extrinsics (i.e., the egomotion of the autonomous vehicle), (2) sensor intrinsics (i.e., the sampling pattern specific to the particular LiDAR sensor), and (3) the shape and motion of other objects in the scene. But autonomous systems should make predictions about the world and not their sensors! To this end, we factor out (1) and (2) by recasting the task as one of spacetime (4D) occupancy forecasting. But because it is expensive to obtain ground-truth 4D occupancy, we “render” point cloud data from 4D occupancy predictions given sensor extrinsics and intrinsics, allowing one to train and test occupancy algorithms with unannotated LiDAR sequences. This also allows one to evaluate and compare point cloud forecasting algorithms across diverse datasets, sensors, and vehicles. Conference on Computer Vision and Pattern Recognition (CVPR), 2023 [Bibtex] [Abstract]
	Differentiable Raycasting for Self-supervised Occupancy Forecasting Tarasha Khurana, Peiyun Hu, Achal Dave, Jason Ziglar, David Held, Deva Ramanan European Conference on Computer Vision (ECCV), 2022
	Semi-supervised 3D Object Detection via Temporal Graph Neural Networks Jianren Wang, Haiming Gang, Siddharth Ancha, Yi-ting Chen, and David Held @article{wang2021sodtgnn, title={Semi-supervised 3D Object Detection via Temporal Graph Neural Networks}, author={Wang, Jianren and Gang, Haiming and Ancha, Siddharth and Chen, Yi-ting and Held, David}, journal={International Conference on 3D Vision (3DV)}, year={2021}} 3D object detection plays an important role in autonomous driving and other robotics applications. However, these detectors usually require training on large amounts of annotated data that is expensive and time-consuming to collect. Instead, we propose leveraging large amounts of unlabeled point cloud videos by semi-supervised learning of 3D object detectors via temporal graph neural networks. Our insight is that temporal smoothing can create more accurate detection results on unlabeled data, and these smoothed detections can then be used to retrain the detector. We learn to perform this temporal reasoning with a graph neural network, where edges represent the relationship between candidate detections in different time frames. International Conference on 3D Vision (3DV), 2021 [Project Page] [Bibtex] [Abstract] [Code] [Video (Long)] [Video (Short)]
	Active Safety Envelopes using Light Curtains with Probabilistic Guarantees Siddharth Ancha, Gaurav Pathak, Srinivasa Narasimhan, David Held @inproceedings{Ancha-RSS-21,\n AUTHOR = {Siddharth Ancha AND Gaurav\ \ Pathak AND Srinivasa G. Narasimhan AND David Held},\n TITLE = {Active\ \ Safety Envelopes using Light Curtains with Probabilistic Guarantees},\n BOOKTITLE\ \ = {Proceedings of Robotics: Science and Systems},\n YEAR = {2021},\n\ \ MONTH = {July}\n}" To safely navigate unknown environments, robots must accurately perceive dynamic obstacles. Instead of directly measuring the scene depth with a LiDAR sensor, we explore the use of a much cheaper and higher resolution sensor: Robotics: Science and Systems (RSS), 2021 [Project Page] [Bibtex] [Abstract] [Blog] [Code] [PDF] [Poster] [Talk]
	Safe Local Motion Planning with Self-Supervised Freespace Forecasting Peiyun Hu, Aaron Huang, John Dolan, David Held, Deva Ramanan @inproceedings{cvpr2021husafe,\n title={Safe Local Motion\ \ Planning with Self-Supervised Freespace Forecasting},\n author={Peiyun\ \ Hu, Aaron Huang, John Dolan, David Held, Deva Ramanan},\n \ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern\ \ Recognition (CVPR)},\n year={2021}}" Safe local motion planning for autonomous driving in dynamic environments requires forecasting how the scene evolves. Practical autonomy stacks adopt a semantic object-centric representation of a dynamic scene and build object detection, tracking, and prediction modules to solve forecasting. However, training these modules comes at an enormous human cost of manually annotated objects across frames. In this work, we explore future freespace as an alternative representation to support motion planning. Our key intuition is that it is important to avoid straying into occupied space regardless of what is occupying it. Importantly, computing ground-truth future freespace is annotation-free. First, we explore freespace forecasting as a self-supervised learning task. We then demonstrate how to use forecasted freespace to identify collision-prone plans from off-the-shelf motion planners. Finally, we propose future freespace as an additional source of annotation-free supervision. We demonstrate how to integrate such supervision into the learning process of learning-based planners. Experimental results on nuScenes and CARLA suggest both approaches lead to significant reduction in collision rates. Conference on Computer Vision and Pattern Recognition (CVPR), 2021 [Project Page] [Bibtex] [Abstract] [Code] [Paper] [Poster] [Talk]
	Active Perception using Light Curtains for Autonomous Driving Siddharth Ancha, Yaadhav Raaj, Peiyun Hu, Srinivasa Narasimhan, David Held @inproceedings{Ancha_2020_ECCV,\n author=\"Ancha, Siddharth\n and Raaj,\ \ Yaadhav\n and Hu, Peiyun\n and Narasimhan, Srinivasa G.\n and Held, David\"\ ,\n editor=\"Vedaldi, Andrea\n and Bischof, Horst\n and Brox, Thomas\n and\ \ Frahm, Jan-Michael\",\n title=\"Active Perception Using Light Curtains for\ \ Autonomous Driving\",\n booktitle=\"Computer Vision -- ECCV 2020\",\n year=\"\ 2020\",\n publisher=\"Springer International Publishing\",\n address=\"Cham\"\ ,\n pages=\"751--766\",\n isbn=\"978-3-030-58558-7\"\n}" Most real-world 3D sensors such as LiDARs perform fixed scans of the entire environment, while being decoupled from the recognition system that processes the sensor data. In this work, we propose a method for 3D object recognition using light curtains, a resource-efficient controllable sensor that measures depth at user-specified locations in the environment. Crucially, we propose using prediction uncertainty of a deep learning based 3D point cloud detector to guide active perception. Given a neural network’s uncertainty, we derive an optimization objective to place light curtains using the principle of maximizing information gain. Then, we develop a novel and efficient optimization algorithm to maximize this objective by encoding the physical constraints of the device into a constraint graph and optimizing with dynamic programming. We show how a 3D detector can be trained to detect objects in a scene by sequentially placing uncertainty-guided light curtains to successively improve detection accuracy. European Conference on Computer Vision (ECCV), 2020 - Spotlight presentation (Selection rate 5.3%) [Project Page] [Bibtex] [Abstract] [Code] [Long Talk] [PDF] [Short Talk]
	Uncertainty-aware Self-supervised 3D Data Association Jianren Wang, Siddharth Ancha, Yi-Ting Chen, David Held @inproceedings{jianren20s3da,\n author = \"Wang, Jianren \n and Ancha,\ \ Siddharth \n and Chen, Yi-Ting \n and Held, David\",\n title = \"Uncertainty-aware\ \ Self-supervised 3D Data Association\",\n booktitle = \"IROS\",\n year\ \ = \"2020\"\n}" 3D object trackers usually require training on large amounts of annotated data that is expensive and time-consuming to collect. Instead, we propose leveraging vast unlabeled datasets by self-supervised metric learning of 3D object trackers, with a focus on data association. Large scale annotations for unlabeled data are cheaply obtained by automatic object detection and association across frames. We show how these self-supervised annotations can be used in a principled manner to learn point-cloud embeddings that are effective for 3D tracking. We estimate and incorporate uncertainty in self-supervised tracking to learn more robust embeddings, without needing any labeled data. We design embeddings to differentiate objects across frames, and learn them using uncertainty-aware self-supervised training. Finally, we demonstrate their ability to perform accurate data association across frames, towards effective and accurate 3D tracking. International Conference on Intelligent Robots and Systems (IROS), 2020 [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	3D Multi-Object Tracking: A Baseline and New Evaluation Metrics Xinshuo Weng, Jianren Wang, David Held, Kris Kitani @article{Weng2020_AB3DMOT, \nauthor = {Weng, Xinshuo and Wang, Jianren and\ \ Held, David and Kitani, Kris}, \njournal = {IROS}, \ntitle = {3D Multi-Object\ \ Tracking: A Baseline and New Evaluation Metrics}, \nyear = {2020} \n}" 3D multi-object tracking (MOT) is an essential component for many applications such as autonomous driving and assistive robotics. Recent work on 3D MOT focuses on developing accurate systems giving less attention to practical considerations such as computational cost and system complexity. In contrast, this work proposes a simple real-time 3D MOT system. Our system first obtains 3D detections from a LiDAR point cloud. Then, a straightforward combination of a 3D Kalman filter and the Hungarian algorithm is used for state estimation and data association. Additionally, 3D MOT datasets such as KITTI evaluate MOT methods in the 2D space and standardized 3D MOT evaluation tools are missing for a fair comparison of 3D MOT methods. Therefore, we propose a new 3D MOT evaluation tool along with three new metrics to comprehensively evaluate 3D MOT methods. We show that, although our system employs a combination of classical MOT modules, we achieve state-of-the-art 3D MOT performance on two 3D MOT benchmarks (KITTI and nuScenes). Surprisingly, although our system does not use any 2D data as inputs, we achieve competitive performance on the KITTI 2D MOT leaderboard. Our proposed system runs at a rate of 207.4 FPS on the KITTI dataset, achieving the fastest speed among all modern MOT systems. International Conference on Intelligent Robots and Systems (IROS), 2020 [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	Just Go with the Flow: Self-Supervised Scene Flow Estimation Himangi Mittal, Brian Okorn, David Held @InProceedings{Mittal_2020_CVPR, author = {Mittal, Himangi and Okorn, Brian and Held, David}, title = {Just Go With the Flow: Self-Supervised Scene Flow Estimation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2020} } When interacting with highly dynamic environments, scene flow allows autonomous systems to reason about the non-rigid motion of multiple independent objects. This is of particular interest in the field of autonomous driving, in which many cars, people, bicycles, and other objects need to be accurately tracked. Current state-of-the-art methods require annotated scene flow data from autonomous driving scenes to train scene flow networks with supervised learning. As an alternative, we present a method of training scene flow that uses two self-supervised losses, based on nearest neighbors and cycle consistency. These self-supervised losses allow us to train our method on large unlabeled autonomous driving datasets; the resulting method matches current state-of-the-art supervised performance using no real world annotations and exceeds state-of-the-art performance when combining our self-supervised approach with supervised learning on a smaller labeled dataset. Conference on Computer Vision and Pattern Recognition (CVPR), 2020 - Oral presentation (Selection rate 5.7%) [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	What You See is What You Get: Exploiting Visibility for 3D Object Detection Peiyun Hu, Jason Ziglar, David Held, Deva Ramanan Conference on Computer Vision and Pattern Recognition (CVPR), 2020 - Oral presentation (Selection rate 5.7%)
	Learning to Optimally Segment Point Clouds Peiyun Hu, David Held, Deva Ramanan Robotics and Automation Letters (RAL) with presentation at the International Conference of Robotics and Automation (ICRA), 2020
	PCN: Point Completion Network - Best Paper Honorable Mention Wentao Yuan, Tejas Khot, David Held, Christoph Mertz, Martial Hebert International Conference on 3D Vision (3DV), 2018
	A Probabilistic Framework for Real-time 3D Segmentation using Spatial, Temporal, and Semantic Cues David Held, Devin Guillory, Brice Rebsamen, Sebastian Thrun, Silvio Savarese Robotics: Science and Systems (RSS), 2016

Active Perception

Rather than statically observing a scene, robots can take actions to enable them to better perceive a scene, known as “active perception.”

Relevant Publications

	Active Velocity Estimation using Light Curtains via Self-Supervised Multi-Armed Bandits Siddharth Ancha, Gaurav Pathak, Ji Zhang, Srinivasa Narasimhan, David Held @inproceedings{ancha2023rss,\n title = {Active Velocity Estimation using Light Curtains via Self-Supervised Multi-Armed Bandits},\n author = {Siddharth Ancha AND Gaurav Pathak AND Ji Zhang AND Srinivasa Narasimhan AND David Held},\n booktitle = {Proceedings of Robotics: Science and Systems},\n year = {2023},\n \ address = {Daegu, Republic of Korea},\n \ month = {July},\n } To navigate in an environment safely and autonomously, robots must accurately estimate where obstacles are and how they move. Instead of using expensive traditional 3D sensors, we explore the use of a much cheaper, faster, and higher resolution alternative: programmable light curtains. Light curtains are a controllable depth sensor that sense only along a surface that the user selects. We adapt a probabilistic method based on particle filters and occupancy grids to explicitly estimate the position and velocity of 3D points in the scene using partial measurements made by light curtains. The central challenge is to decide where to place the light curtain to accurately perform this task. We propose multiple curtain placement strategies guided by maximizing information gain and verifying predicted object locations. Then, we combine these strategies using an online learning framework. We propose a novel self-supervised reward function that evaluates the accuracy of current velocity estimates using future light curtain placements. We use a multi-armed bandit framework to intelligently switch between placement policies in real time, outperforming fixed policies. We develop a full-stack navigation system that uses position and velocity estimates from light curtains for downstream tasks such as localization, mapping, path-planning, and obstacle avoidance. This work paves the way for controllable light curtains to accurately, efficiently, and purposefully perceive and navigate complex and dynamic environments. Robotics: Science and Systems (RSS), 2023 [Project Page] [Bibtex] [Abstract] [arXiv] [Talk]
	Active Safety Envelopes using Light Curtains with Probabilistic Guarantees Siddharth Ancha, Gaurav Pathak, Srinivasa Narasimhan, David Held @inproceedings{Ancha-RSS-21,\n AUTHOR = {Siddharth Ancha AND Gaurav\ \ Pathak AND Srinivasa G. Narasimhan AND David Held},\n TITLE = {Active\ \ Safety Envelopes using Light Curtains with Probabilistic Guarantees},\n BOOKTITLE\ \ = {Proceedings of Robotics: Science and Systems},\n YEAR = {2021},\n\ \ MONTH = {July}\n}" To safely navigate unknown environments, robots must accurately perceive dynamic obstacles. Instead of directly measuring the scene depth with a LiDAR sensor, we explore the use of a much cheaper and higher resolution sensor: Robotics: Science and Systems (RSS), 2021 [Project Page] [Bibtex] [Abstract] [Blog] [Code] [PDF] [Poster] [Talk]
	Exploiting & Refining Depth Distributions with Triangulation Light Curtains Yaadhav Raaj, Siddharth Ancha, Robert Tamburo, David Held, Srinivasa Narasimhan @inproceedings{cvpr2021raajexploiting,\n title = {Exploiting & Refining\ \ Depth Distributions with Triangulation Light Curtains},\n author = {Yaadhav\ \ Raaj, Siddharth Ancha, Robert Tamburo, David Held, Srinivasa Narasimhan},\n\ \ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and\ \ Pattern Recognition (CVPR)},\n year = {2021}\n}" Active sensing through the use of Adaptive Depth Sensors is a nascent field, with potential in areas such as Advanced driver-assistance systems (ADAS). They do however require dynamically driving a laser / light-source to a specific location to capture information, with one such class of sensor being the Triangulation Light Curtains (LC). In this work, we introduce a novel approach that exploits prior depth distributions from RGB cameras to drive a Light Curtain's laser line to regions of uncertainty to get new measurements. These measurements are utilized such that depth uncertainty is reduced and errors get corrected recursively. We show real-world experiments that validate our approach in outdoor and driving settings, and demonstrate qualitative and quantitative improvements in depth RMSE when RGB cameras are used in tandem with a Light Curtain. Conference on Computer Vision and Pattern Recognition (CVPR), 2021 [Project Page] [Bibtex] [Abstract] [Code] [PDF] [Talk]
	Active Perception using Light Curtains for Autonomous Driving Siddharth Ancha, Yaadhav Raaj, Peiyun Hu, Srinivasa Narasimhan, David Held @inproceedings{Ancha_2020_ECCV,\n author=\"Ancha, Siddharth\n and Raaj,\ \ Yaadhav\n and Hu, Peiyun\n and Narasimhan, Srinivasa G.\n and Held, David\"\ ,\n editor=\"Vedaldi, Andrea\n and Bischof, Horst\n and Brox, Thomas\n and\ \ Frahm, Jan-Michael\",\n title=\"Active Perception Using Light Curtains for\ \ Autonomous Driving\",\n booktitle=\"Computer Vision -- ECCV 2020\",\n year=\"\ 2020\",\n publisher=\"Springer International Publishing\",\n address=\"Cham\"\ ,\n pages=\"751--766\",\n isbn=\"978-3-030-58558-7\"\n}" Most real-world 3D sensors such as LiDARs perform fixed scans of the entire environment, while being decoupled from the recognition system that processes the sensor data. In this work, we propose a method for 3D object recognition using light curtains, a resource-efficient controllable sensor that measures depth at user-specified locations in the environment. Crucially, we propose using prediction uncertainty of a deep learning based 3D point cloud detector to guide active perception. Given a neural network’s uncertainty, we derive an optimization objective to place light curtains using the principle of maximizing information gain. Then, we develop a novel and efficient optimization algorithm to maximize this objective by encoding the physical constraints of the device into a constraint graph and optimizing with dynamic programming. We show how a 3D detector can be trained to detect objects in a scene by sequentially placing uncertainty-guided light curtains to successively improve detection accuracy. European Conference on Computer Vision (ECCV), 2020 - Spotlight presentation (Selection rate 5.3%) [Project Page] [Bibtex] [Abstract] [Code] [Long Talk] [PDF] [Short Talk]

Self-Supervised Learning for Robotics

Rather than relying on hand-annotated data, self-supervised learning can enable robots to learn from large unlabeled datasets.

Relevant Publications

	Active Velocity Estimation using Light Curtains via Self-Supervised Multi-Armed Bandits Siddharth Ancha, Gaurav Pathak, Ji Zhang, Srinivasa Narasimhan, David Held @inproceedings{ancha2023rss,\n title = {Active Velocity Estimation using Light Curtains via Self-Supervised Multi-Armed Bandits},\n author = {Siddharth Ancha AND Gaurav Pathak AND Ji Zhang AND Srinivasa Narasimhan AND David Held},\n booktitle = {Proceedings of Robotics: Science and Systems},\n year = {2023},\n \ address = {Daegu, Republic of Korea},\n \ month = {July},\n } To navigate in an environment safely and autonomously, robots must accurately estimate where obstacles are and how they move. Instead of using expensive traditional 3D sensors, we explore the use of a much cheaper, faster, and higher resolution alternative: programmable light curtains. Light curtains are a controllable depth sensor that sense only along a surface that the user selects. We adapt a probabilistic method based on particle filters and occupancy grids to explicitly estimate the position and velocity of 3D points in the scene using partial measurements made by light curtains. The central challenge is to decide where to place the light curtain to accurately perform this task. We propose multiple curtain placement strategies guided by maximizing information gain and verifying predicted object locations. Then, we combine these strategies using an online learning framework. We propose a novel self-supervised reward function that evaluates the accuracy of current velocity estimates using future light curtain placements. We use a multi-armed bandit framework to intelligently switch between placement policies in real time, outperforming fixed policies. We develop a full-stack navigation system that uses position and velocity estimates from light curtains for downstream tasks such as localization, mapping, path-planning, and obstacle avoidance. This work paves the way for controllable light curtains to accurately, efficiently, and purposefully perceive and navigate complex and dynamic environments. Robotics: Science and Systems (RSS), 2023 [Project Page] [Bibtex] [Abstract] [arXiv] [Talk]
	Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting Tarasha Khurana, Peiyun Hu, David Held, Deva Ramanan @inproceedings{Khurana2023point,\n title={Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting},\n author={Khurana, Tarasha and Hu, Peiyun and Held, David and Ramanan, Deva},\n booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and\ \ Pattern Recognition (CVPR)},\n year={2023}\n } Predicting how the world can evolve in the future is crucial for motion planning in autonomous systems. Classical methods are limited because they rely on costly human annotations in the form of semantic class labels, bounding boxes, and tracks or HD maps of cities to plan their motion — and thus are difficult to scale to large unlabeled datasets. One promising self-supervised task is 3D point cloud forecasting from unannotated LiDAR sequences. We show that this task requires algorithms to implicitly capture (1) sensor extrinsics (i.e., the egomotion of the autonomous vehicle), (2) sensor intrinsics (i.e., the sampling pattern specific to the particular LiDAR sensor), and (3) the shape and motion of other objects in the scene. But autonomous systems should make predictions about the world and not their sensors! To this end, we factor out (1) and (2) by recasting the task as one of spacetime (4D) occupancy forecasting. But because it is expensive to obtain ground-truth 4D occupancy, we “render” point cloud data from 4D occupancy predictions given sensor extrinsics and intrinsics, allowing one to train and test occupancy algorithms with unannotated LiDAR sequences. This also allows one to evaluate and compare point cloud forecasting algorithms across diverse datasets, sensors, and vehicles. Conference on Computer Vision and Pattern Recognition (CVPR), 2023 [Bibtex] [Abstract]
	Self-supervised Cloth Reconstruction via Action-conditioned Cloth Tracking Zixuan Huang, Xingyu Lin, David Held @inproceedings{huang2023act,\n title={Self-supervised Cloth Reconstruction via Action-conditioned Cloth Tracking},\n author={Huang, Zixuan and Lin, Xingyu and Held, David},\n booktitle={IEEE International Conference on Robotics and Automation (ICRA), 2023},\n year={2023}\n } State estimation is one of the greatest challenges for cloth manipulation due to cloth's high dimensionality and self-occlusion. Prior works propose to identify the full state of crumpled clothes by training a mesh reconstruction model in simulation. However, such models are prone to suffer from a sim-to-real gap due to differences between cloth simulation and the real world. In this work, we propose a self-supervised method to finetune a mesh reconstruction model in the real world. Since the full mesh of crumpled cloth is difficult to obtain in the real world, we design a special data collection scheme and an action-conditioned model-based cloth tracking method to generate pseudo-labels for self-supervised learning. By finetuning the pretrained mesh reconstruction model on this pseudo-labeled dataset, we show that we can improve the quality of the reconstructed mesh without requiring human annotations, and improve the performance of downstream manipulation task. International Conference on Robotics and Automation (ICRA), 2023 [Project Page] [Bibtex] [Abstract] [arXiv]
	Differentiable Raycasting for Self-supervised Occupancy Forecasting Tarasha Khurana, Peiyun Hu, Achal Dave, Jason Ziglar, David Held, Deva Ramanan European Conference on Computer Vision (ECCV), 2022
	Self-supervised Transparent Liquid Segmentation for Robotic Pouring Gautham Narayan Narasimhan, Kai Zhang, Ben Eisner, Xingyu Lin, David Held @inproceedings{icra2022pouring, title={Self-supervised Transparent Liquid Segmentation for Robotic Pouring}, author={Gautham Narayan Narasimhan, Kai Zhang, Ben Eisner, Xingyu Lin, David Held}, booktitle={International Conference on Robotics and Automation (ICRA)}, year={2022}} Liquid state estimation is important for robotics tasks such as pouring; however, estimating the state of transparent liquids is a challenging problem. We propose a novel segmentation pipeline that can segment transparent liquids such as water from a static, RGB image without requiring any manual annotations or heating of the liquid for training. Instead, we use a generative model that is capable of translating images of colored liquids into synthetically generated transparent liquid images, trained only on an unpaired dataset of colored and transparent liquid images. Segmentation labels of colored liquids are obtained automatically using background subtraction. Our experiments show that we are able to accurately predict a segmentation mask for transparent liquids without requiring any manual annotations. We demonstrate the utility of transparent liquid segmentation in a robotic pouring task that controls pouring by perceiving the liquid height in a transparent cup. Accompanying video and supplementary materials can be found on our project page. International Conference of Robotics and Automation (ICRA), 2022 [Project Page] [Bibtex] [Abstract] [Code] [PDF] [Poster] [Slides] [Video]
	OSSID: Online Self-Supervised Instance Detection by (and for) Pose Estimation Qiao Gu, Brian Okorn, David Held @article{ral2022ossid,\n author={Gu, Qiao and Okorn, Brian and Held, David},\n\ journal={IEEE Robotics and Automation Letters}, \n title={OSSID: Online Self-Supervised\ \ Instance Detection by (And For) Pose Estimation}, \n year={2022},\n volume={7},\n\ \ number={2},\n pages={3022-3029},\n doi={10.1109/LRA.2022.3145488}}" Real-time object pose estimation is necessary for many robot manipulation algorithms. However, state-of-the-art methods for object pose estimation are trained for a specific set of objects; these methods thus need to be retrained to estimate the pose of each new object, often requiring tens of GPU-days of training for optimal performance. In this paper, we propose the OSSID framework, leveraging a slow zero-shot pose estimator to self-supervise the training of a fast detection algorithm. This fast detector can then be used to filter the input to the pose estimator, drastically improving its inference speed. We show that this self-supervised training exceeds the performance of existing zero-shot detection methods on two widely used object pose estimation and detection datasets, without requiring any human annotations. Further, we show that the resulting method for pose estimation has a significantly faster inference speed, due to the ability to filter out large parts of the image. Thus, our method for self-supervised online learning of a detector (trained using pseudo-labels from a slow pose estimator) leads to accurate pose estimation at real-time speeds, without requiring human annotations. Robotics and Automation Letters (RAL) with presentation at the International Conference of Robotics and Automation (ICRA), 2022 [Project Page] [Bibtex] [Abstract] [Code] [PDF] [Video]
	Self-Supervised Point Cloud Completion via Inpainting Himangi Mittal, Brian Okorn, Arpit Jangid, David Held @article{mittal2021self,\n title={Self-Supervised Point Cloud Completion\ \ via Inpainting},\n author={Mittal, Himangi and Okorn, Brian and Jangid, Arpit\ \ and Held, David},\n journal={British Machine Vision Conference (BMVC), 2021},\n\ year={2021}\n}" When navigating in urban environments, many of the objects that need to be tracked and avoided are heavily occluded. Planning and tracking using these partial scans can be challenging. The aim of this work is to learn to complete these partial point clouds, giving us a full understanding of the object's geometry using only partial observations. Previous methods achieve this with the help of complete, ground-truth annotations of the target objects, which are available only for simulated datasets. However, such ground truth is unavailable for real-world LiDAR data. In this work, we present a self-supervised point cloud completion algorithm, PointPnCNet, which is trained only on partial scans without assuming access to complete, ground-truth annotations. Our method achieves this via inpainting. We remove a portion of the input data and train the network to complete the missing region. As it is difficult to determine which regions were occluded in the initial cloud and which were synthetically removed, our network learns to complete the full cloud, including the missing regions in the initial partial cloud. We show that our method outperforms previous unsupervised and weakly-supervised methods on both the synthetic dataset, ShapeNet, and real-world LiDAR dataset, Semantic KITTI. British Machine Vision Conference (BMVC), 2021 - Oral presentation (Selection rate 3.3%) [Project Page] [Bibtex] [Abstract] [PDF] [Video]
	Safe Local Motion Planning with Self-Supervised Freespace Forecasting Peiyun Hu, Aaron Huang, John Dolan, David Held, Deva Ramanan @inproceedings{cvpr2021husafe,\n title={Safe Local Motion\ \ Planning with Self-Supervised Freespace Forecasting},\n author={Peiyun\ \ Hu, Aaron Huang, John Dolan, David Held, Deva Ramanan},\n \ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern\ \ Recognition (CVPR)},\n year={2021}}" Safe local motion planning for autonomous driving in dynamic environments requires forecasting how the scene evolves. Practical autonomy stacks adopt a semantic object-centric representation of a dynamic scene and build object detection, tracking, and prediction modules to solve forecasting. However, training these modules comes at an enormous human cost of manually annotated objects across frames. In this work, we explore future freespace as an alternative representation to support motion planning. Our key intuition is that it is important to avoid straying into occupied space regardless of what is occupying it. Importantly, computing ground-truth future freespace is annotation-free. First, we explore freespace forecasting as a self-supervised learning task. We then demonstrate how to use forecasted freespace to identify collision-prone plans from off-the-shelf motion planners. Finally, we propose future freespace as an additional source of annotation-free supervision. We demonstrate how to integrate such supervision into the learning process of learning-based planners. Experimental results on nuScenes and CARLA suggest both approaches lead to significant reduction in collision rates. Conference on Computer Vision and Pattern Recognition (CVPR), 2021 [Project Page] [Bibtex] [Abstract] [Code] [Paper] [Poster] [Talk]
	Visual Self-Supervised Reinforcement Learning with Object Reasoning Yufei Wang, Gautham Narayan Narasimhan, Xingyu Lin, Brian Okorn, David Held @inproceedings{corl2020roll, title={ROLL: Visual Self-Supervised Reinforcement Learning with Object Reasoning}, author={Wang, Yufei and Narasimhan Gautham and Lin, Xingyu and Okorn, Brian and Held, David}, booktitle={Conference on Robot Learning}, year={2020} } Current image-based reinforcement learning (RL) algorithms typically operate on the whole image without performing object-level reasoning. This leads to inefficient goal sampling and ineffective reward functions. In this paper, we improve upon previous visual self-supervised RL by incorporating object-level reasoning and occlusion reasoning. Specifically, we use unknown object segmentation to ignore distractors in the scene for better reward computation and goal generation; we further enable occlusion reasoning by employing a novel auxiliary loss and training scheme. We demonstrate that our proposed algorithm, ROLL (Reinforcement learning with Object Level Learning), learns dramatically faster and achieves better final performance compared with previous methods in several simulated visual control tasks. Project video and code are available at https://sites.google.com/andrew.cmu.edu/roll. Conference on Robot Learning (CoRL), 2020 [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	Uncertainty-aware Self-supervised 3D Data Association Jianren Wang, Siddharth Ancha, Yi-Ting Chen, David Held @inproceedings{jianren20s3da,\n author = \"Wang, Jianren \n and Ancha,\ \ Siddharth \n and Chen, Yi-Ting \n and Held, David\",\n title = \"Uncertainty-aware\ \ Self-supervised 3D Data Association\",\n booktitle = \"IROS\",\n year\ \ = \"2020\"\n}" 3D object trackers usually require training on large amounts of annotated data that is expensive and time-consuming to collect. Instead, we propose leveraging vast unlabeled datasets by self-supervised metric learning of 3D object trackers, with a focus on data association. Large scale annotations for unlabeled data are cheaply obtained by automatic object detection and association across frames. We show how these self-supervised annotations can be used in a principled manner to learn point-cloud embeddings that are effective for 3D tracking. We estimate and incorporate uncertainty in self-supervised tracking to learn more robust embeddings, without needing any labeled data. We design embeddings to differentiate objects across frames, and learn them using uncertainty-aware self-supervised training. Finally, we demonstrate their ability to perform accurate data association across frames, towards effective and accurate 3D tracking. International Conference on Intelligent Robots and Systems (IROS), 2020 [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	Just Go with the Flow: Self-Supervised Scene Flow Estimation Himangi Mittal, Brian Okorn, David Held @InProceedings{Mittal_2020_CVPR, author = {Mittal, Himangi and Okorn, Brian and Held, David}, title = {Just Go With the Flow: Self-Supervised Scene Flow Estimation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2020} } When interacting with highly dynamic environments, scene flow allows autonomous systems to reason about the non-rigid motion of multiple independent objects. This is of particular interest in the field of autonomous driving, in which many cars, people, bicycles, and other objects need to be accurately tracked. Current state-of-the-art methods require annotated scene flow data from autonomous driving scenes to train scene flow networks with supervised learning. As an alternative, we present a method of training scene flow that uses two self-supervised losses, based on nearest neighbors and cycle consistency. These self-supervised losses allow us to train our method on large unlabeled autonomous driving datasets; the resulting method matches current state-of-the-art supervised performance using no real world annotations and exceeds state-of-the-art performance when combining our self-supervised approach with supervised learning on a smaller labeled dataset. Conference on Computer Vision and Pattern Recognition (CVPR), 2020 - Oral presentation (Selection rate 5.7%) [Project Page] [Bibtex] [Abstract] [Code] [PDF]

Previous Directions

Object tracking

Tracking involves consistently locating an object as it moves across a scene, or consistently locating a point on an object as it moves. In order to understand how robots should interact with objects, the robot must be able to track them as they change in position, viewpoint, lighting, occlusions, and other factors. Improvements in this area should enable autonomous vehicles to interact more safely around dynamic objects (e.g. pedestrians, bicyclists, and other vehicles).

Relevant Publications

	Self-supervised Cloth Reconstruction via Action-conditioned Cloth Tracking Zixuan Huang, Xingyu Lin, David Held @inproceedings{huang2023act,\n title={Self-supervised Cloth Reconstruction via Action-conditioned Cloth Tracking},\n author={Huang, Zixuan and Lin, Xingyu and Held, David},\n booktitle={IEEE International Conference on Robotics and Automation (ICRA), 2023},\n year={2023}\n } State estimation is one of the greatest challenges for cloth manipulation due to cloth's high dimensionality and self-occlusion. Prior works propose to identify the full state of crumpled clothes by training a mesh reconstruction model in simulation. However, such models are prone to suffer from a sim-to-real gap due to differences between cloth simulation and the real world. In this work, we propose a self-supervised method to finetune a mesh reconstruction model in the real world. Since the full mesh of crumpled cloth is difficult to obtain in the real world, we design a special data collection scheme and an action-conditioned model-based cloth tracking method to generate pseudo-labels for self-supervised learning. By finetuning the pretrained mesh reconstruction model on this pseudo-labeled dataset, we show that we can improve the quality of the reconstructed mesh without requiring human annotations, and improve the performance of downstream manipulation task. International Conference on Robotics and Automation (ICRA), 2023 [Project Page] [Bibtex] [Abstract] [arXiv]
	3D Multi-Object Tracking: A Baseline and New Evaluation Metrics Xinshuo Weng, Jianren Wang, David Held, Kris Kitani @article{Weng2020_AB3DMOT, \nauthor = {Weng, Xinshuo and Wang, Jianren and\ \ Held, David and Kitani, Kris}, \njournal = {IROS}, \ntitle = {3D Multi-Object\ \ Tracking: A Baseline and New Evaluation Metrics}, \nyear = {2020} \n}" 3D multi-object tracking (MOT) is an essential component for many applications such as autonomous driving and assistive robotics. Recent work on 3D MOT focuses on developing accurate systems giving less attention to practical considerations such as computational cost and system complexity. In contrast, this work proposes a simple real-time 3D MOT system. Our system first obtains 3D detections from a LiDAR point cloud. Then, a straightforward combination of a 3D Kalman filter and the Hungarian algorithm is used for state estimation and data association. Additionally, 3D MOT datasets such as KITTI evaluate MOT methods in the 2D space and standardized 3D MOT evaluation tools are missing for a fair comparison of 3D MOT methods. Therefore, we propose a new 3D MOT evaluation tool along with three new metrics to comprehensively evaluate 3D MOT methods. We show that, although our system employs a combination of classical MOT modules, we achieve state-of-the-art 3D MOT performance on two 3D MOT benchmarks (KITTI and nuScenes). Surprisingly, although our system does not use any 2D data as inputs, we achieve competitive performance on the KITTI 2D MOT leaderboard. Our proposed system runs at a rate of 207.4 FPS on the KITTI dataset, achieving the fastest speed among all modern MOT systems. International Conference on Intelligent Robots and Systems (IROS), 2020 [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	Just Go with the Flow: Self-Supervised Scene Flow Estimation Himangi Mittal, Brian Okorn, David Held @InProceedings{Mittal_2020_CVPR, author = {Mittal, Himangi and Okorn, Brian and Held, David}, title = {Just Go With the Flow: Self-Supervised Scene Flow Estimation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2020} } When interacting with highly dynamic environments, scene flow allows autonomous systems to reason about the non-rigid motion of multiple independent objects. This is of particular interest in the field of autonomous driving, in which many cars, people, bicycles, and other objects need to be accurately tracked. Current state-of-the-art methods require annotated scene flow data from autonomous driving scenes to train scene flow networks with supervised learning. As an alternative, we present a method of training scene flow that uses two self-supervised losses, based on nearest neighbors and cycle consistency. These self-supervised losses allow us to train our method on large unlabeled autonomous driving datasets; the resulting method matches current state-of-the-art supervised performance using no real world annotations and exceeds state-of-the-art performance when combining our self-supervised approach with supervised learning on a smaller labeled dataset. Conference on Computer Vision and Pattern Recognition (CVPR), 2020 - Oral presentation (Selection rate 5.7%) [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	A Probabilistic Framework for Real-time 3D Segmentation using Spatial, Temporal, and Semantic Cues David Held, Devin Guillory, Brice Rebsamen, Sebastian Thrun, Silvio Savarese Robotics: Science and Systems (RSS), 2016

	ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation Yufei Wang, Ziyu Wang, Mino Nakura†, Pratik Bhowal†, Chia-Liang Kuo†, Yi-Ting Chen, Zackory Erickson‡, David Held‡ @INPROCEEDINGS{wang-2025-ArticuBot, title={{ArticuBot: Learning Universal Articulated Object Manipulation Policy via Large Scale Simulation}}, author={Wang, Yufei and Wang, Ziyu and Nakura, Mino and Bhowal, Pratik and Kuo, Chia-Liang and Chen, Yi-Ting and Erickson, Zackory and Held, David}, BOOKTITLE={Proceedings of Robotics: Science and Systems (RSS)}, year={2025} } This paper presents ArticuBot, in which a single learned policy enables a robotics system to open diverse categories of unseen articulated objects in the real world. This task has long been challenging for robotics due to the large variations in the geometry, size, and articulation types of such objects. Our system, Articubot, consists of three parts: generating a large number of demonstrations in physics-based simulation, distilling all generated demonstrations into a point cloud-based neural policy via imitation learning, and performing zero-shot sim2real transfer to real robotics systems. Utilizing sampling-based grasping and motion planning, our demonstration generalization pipeline is fast and effective, generating a total of 42.3k demonstrations over 322 training articulated objects. For policy learning, we propose a novel hierarchical policy representation, in which the high-level policy learns the sub-goal for the end-effector, and the low-level policy learns how to move the end-effector conditioned on the predicted goal. We demonstrate that this hierarchical approach achieves much better object-level generalization compared to the non-hierarchical version. We further propose a novel weighted displacement model for the high-level policy that grounds the prediction into the existing 3D structure of the scene, outperforming alternative policy representations. We show that our learned policy can zero-shot transfer to three different real robot settings: a fixed table-top Franka arm across two different labs, and an X-Arm on a mobile base, opening multiple unseen articulated objects across two labs, real lounges, and kitchens. Robotics: Science and Systems (RSS), 2025 [Project Page] [Bibtex] [Abstract] [arXiv] [Code] [Video]
	HACMan++: Spatially-Grounded Motion Primitives for Manipulation Bowen Jiang, Yilin Wu, Wenxuan Zhou, Chris Paxton, David Held @INPROCEEDINGS{Jiang-RSS-24, AUTHOR = {Bowen Jiang AND Yilin Wu AND Wenxuan Zhou AND Chris Paxton AND David Held}, TITLE = {{HACMan++: Spatially-Grounded Motion Primitives for Manipulation}}, BOOKTITLE = {Proceedings of Robotics: Science and Systems}, YEAR = {2024}, ADDRESS = {Delft, Netherlands}, MONTH = {July}, DOI = {10.15607/RSS.2024.XX.129} } We present HACMan++, a reinforcement learning framework using a novel action space of spatially-grounded parameterized motion primitives for manipulation tasks. Robotics: Science and Systems (RSS), 2024 [Project Page] [Bibtex] [Abstract] [arXiv]
	Learning Distributional Demonstration Spaces for Task-Specific Cross-Pose Estimation Jenny Wang, Octavian Donca, David Held @article{wang2024taxposed, title={Learning Distributional Demonstration Spaces for Task-Specific Cross-Pose Estimation}, author={Wang, Jenny and Donca, Octavian and Held, David}, journal={IEEE International Conference on Robotics and Automation (ICRA), 2024}, year={2024} } Relative placement tasks are an important category of tasks in which one object needs to be placed in a desired pose relative to another object. Previous work has shown success in learning relative placement tasks from just a small number of demonstrations, when using relational reasoning networks with geometric inductive biases. However, such methods fail to consider that demonstrations for the same task can be fundamentally multimodal, like a mug hanging on any of n racks. We propose a method that retains the provably translation-invariant and relational properties of prior work but incorporates additional properties that account for multimodal, distributional examples. We show that our method is able to learn precise relative placement tasks with a small number of multimodal demonstrations with no human annotations across a diverse set of objects within a category. International Conference on Robotics and Automation (ICRA), 2024 [Project Page] [Bibtex] [Abstract] [arXiv] [Poster] [Slides] [Code]
	One Policy to Dress Them All: Learning to Dress People with Diverse Poses and Garments Yufei Wang, Zhanyi Sun, Zackory Erickson, David Held @inproceedings{Wang2023One,\n title={One Policy to Dress Them All: Learning to Dress People with Diverse Poses and Garments},\n author={Wang, Yufei and Sun, Zhanyi and Erickson, Zackory and Held, David},\n booktitle={Robotics: Science\ \ and Systems (RSS)},\n year={2023}\n }" Robot-assisted dressing could benefit the lives of many people such as older adults and individuals with disabilities. Despite such potential, robot-assisted dressing remains a challenging task for robotics as it involves complex manipulation of deformable cloth in 3D space. Many prior works aim to solve the robot-assisted dressing task, but they make certain assumptions such as a fixed garment and a fixed arm pose that limit their ability to generalize. In this work, we develop a robot-assisted dressing system that is able to dress different garments on people with diverse poses from partial point cloud observations, based on a learned policy. We show that with proper design of the policy architecture and Q function, reinforcement learning (RL) can be used to learn effective policies with partial point cloud observations that work well for dressing diverse garments. We further leverage policy distillation to combine multiple policies trained on different ranges of human arm poses into a single policy that works over a wide range of different arm poses. We conduct comprehensive real-world evaluations of our system with 510 dressing trials in a human study with 17 participants with different arm poses and dressed garments. Our system is able to dress 86\% of the length of the participants arms on average. Videos can be found on the anonymized project webpage: https://sites.google.com/view/one-policy-dress. Robotics: Science and Systems (RSS), 2023 [Project Page] [Bibtex] [Abstract] [arXiv]
	Neural Grasp Distance Fields for Robot Manipulation Thomas Weng, David Held, Franziska Meier, Mustafa Mukadam @article{weng2023ngdf,\n title={Neural Grasp Distance Fields for Robot Manipulation},\n author={Weng, Thomas and Held, David and Meier, Franziska and Mukadam, Mustafa},\n booktitle={IEEE International Conference on Robotics and Automation (ICRA)},\n year={2023}\n}" We formulate grasp learning as a neural field and present Neural Grasp Distance Fields (NGDF). Here, the input is a 6D pose of a robot end effector and output is a distance to a continuous manifold of valid grasps for an object. In contrast to current approaches that predict a set of discrete candidate grasps, the distance-based NGDF representation is easily interpreted as a cost, and minimizing this cost produces a successful grasp pose. This grasp distance cost can be incorporated directly into a trajectory optimizer for joint optimization with other costs such as trajectory smoothness and collision avoidance. During optimization, as the various costs are balanced and minimized, the grasp target is allowed to smoothly vary, as the learned grasp field is continuous. In simulation benchmarks with a Franka arm, we find that joint grasping and planning with NGDF outperforms baselines by 63% execution success while generalizing to unseen query poses and unseen object shapes. International Conference on Robotics and Automation (ICRA), 2023 [Project Page] [Bibtex] [Abstract] [arXiv] [Code]
	TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, David Held @inproceedings{pan2022tax,\n title={TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation},\n author={Pan, Chuer and Okorn, Brian and Zhang, Harry and Eisner, Ben and Held, David},\n booktitle={Conference on Robot Learning (CoRL)},\n year={2022}\n } How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship “cross-pose” and provide a mathematical definition of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We demonstrate our method’s capability to generalize to unseen objects, in some cases after training on only 10 demonstrations in the real world. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments across a number of tasks. Conference on Robot Learning (CoRL), 2022 [Project Page] [Bibtex] [Abstract] [PDF]
	Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation Xingyu Lin, Carl Qi, Yunchu Zhang, Zhiao Huang, Katerina Fragkiadaki, Yunzhu Li, Chuang Gan, David Held @inproceedings{\n \ lin2022planning,\n title={Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation},\n author={Xingyu Lin and Carl Qi and Yunchu Zhang and Yunzhu Li and Zhiao Huang and Katerina Fragkiadaki and Chuang Gan and David Held},\n booktitle={6th Annual Conference on Robot Learning},\n year={2022},\n url={https://openreview.net/forum?id=tyxyBj2w4vw}\n } Effective planning of long-horizon deformable object manipulation requires suitable abstractions at both the spatial and temporal levels. Previous methods typically either focus on short-horizon tasks or make strong assumptions that full-state information is available, which prevents their use on deformable objects. In this paper, we propose PlAnning with Spatial-Temporal Abstraction (PASTA), which incorporates both spatial abstraction (reasoning about objects and their relations to each other) and temporal abstraction (reasoning over skills instead of low-level actions). Our framework maps high-dimension 3D observations such as point clouds into a set of latent vectors and plans over skill sequences on top of the latent set representation. We show that our method can effectively perform challenging sequential deformable object manipulation tasks in the real world, which require combining multiple tool-use skills such as cutting with a knife, pushing with a pusher, and spreading dough with a roller. Conference on Robot Learning (CoRL), 2022 [Project Page] [Bibtex] [Abstract] [OpenReview] [PDF] [Code]
	Visual Haptic Reasoning: Estimating Contact Forces by Observing Deformable Object Interactions Yufei Wang, David Held, Zackory Erickson @article{wang2022visual, title={Visual Haptic Reasoning: Estimating Contact Forces by Observing Deformable Object Interactions}, author={Wang, Yufei and Held, David and Erickson, Zackory}, journal={IEEE Robotics and Automation Letters}, volume={7}, number={4}, pages={11426--11433}, year={2022}, publisher={IEEE} } Robotic manipulation of highly deformable cloth presents a promising opportunity to assist people with several daily tasks, such as washing dishes; folding laundry; or dressing, bathing, and hygiene assistance for individuals with severe motor impairments. In this work, we introduce a formulation that enables a collaborative robot to perform visual haptic reasoning with cloth -- the act of inferring the location and magnitude of applied forces during physical interaction. We present two distinct model representations, trained in physics simulation, that enable haptic reasoning using only visual and robot kinematic observations. We conducted quantitative evaluations of these models in simulation for robot-assisted dressing, bathing, and dish washing tasks, and demonstrate that the trained models can generalize across different tasks with varying interactions, human body sizes, and object shapes. We also present results with a real-world mobile manipulator, which used our simulation-trained models to estimate applied contact forces while performing physically assistive tasks with cloth. Robotics and Automation Letters (RAL) with presentation at the International Conference on Intelligent Robots and Systems (IROS), 2022 [Project Page] [Bibtex] [Abstract] [PDF] [Talk]
	FlowBot3D: Learning 3D Articulation Flow to Manipulate Articulated Objects Ben Eisner, Harry Zhang, David Held @inproceedings{EisnerZhang2022FLOW,\n title={FlowBot3D: Learning\ \ 3D Articulation Flow to Manipulate Articulated Objects},\n author={Eisner,\ \ Ben and Zhang, Harry and Held,David},\n booktitle={Robotics: Science\ \ and Systems (RSS)},\n year={2022}\n }" We explore a novel method to perceive and manipulate 3D articulated objects that generalizes to enable a robot to articulate unseen classes of objects. We propose a vision-based system that learns to predict the potential motions of the parts of a variety of articulated objects to guide downstream motion planning of the system to articulate the objects. To predict the object motions, we train a neural network to output a dense vector field representing the point-wise motion direction of the points in the point cloud under articulation. We then deploy an analytical motion planner based on this vector field to achieve a policy that yields maximum articulation. We train the vision system entirely in simulation, and we demonstrate the capability of our system to generalize to unseen object instances and novel categories in both simulation and the real world, deploying our policy on a Sawyer robot with no finetuning. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments. Robotics: Science and Systems (RSS), 2022 - Best Paper Finalist (Selection Rate 1.5%) [Project Page] [Bibtex] [Abstract] [PDF]
	Learning Visible Connectivity Dynamics for Cloth Smoothing Xingyu Lin, Yufei Wang, Zixuan Huang, David Held @inproceedings{lin2021VCD, title={Learning Visible Connectivity Dynamics for Cloth Smoothing}, author={Lin, Xingyu and Wang, Yufei and Huang, Zixuan and Held, David}, booktitle={Conference on Robot Learning}, year={2021}} Robotic manipulation of cloth remains challenging for robotics due to the complex dynamics of the cloth, lack of a low-dimensional state representation, and self-occlusions. In contrast to previous model-based approaches that learn a pixel-based dynamics model or a compressed latent vector dynamics, we propose to learn a particle-based dynamics model from a partial point cloud observation. To overcome the challenges of partial observability, we infer which visible points are connected on the underlying cloth mesh. We then learn a dynamics model over this visible connectivity graph. Compared to previous learning-based approaches, our model poses strong inductive bias with its particle based representation for learning the underlying cloth physics; it is invariant to visual features; and the predictions can be more easily visualized. We show that our method greatly outperforms previous state-of-the-art model-based and model-free reinforcement learning methods in simulation. Furthermore, we demonstrate zero-shot sim-to-real transfer where we deploy the model trained in simulation on a Franka arm and show that the model can successfully smooth different types of cloth from crumpled configurations. Videos can be found on our project website. Conference on Robot Learning (CoRL), 2021 [Project Page] [Bibtex] [Abstract] [OpenReview] [PDF] [Code]

	Force Constrained Visual Policy: Safe Robot-Assisted Dressing via Multi-Modal Sensing Zhanyi Sun, Yufei Wang, David Held†, Zackory Erickson† @article{sun2024force, title={Force-Constrained Visual Policy: Safe Robot-Assisted Dressing via Multi-Modal Sensing}, author={Sun, Zhanyi and Wang, Yufei and Held, David and Erickson, Zackory}, journal={IEEE Robotics and Automation Letters}, year={2024} } Robot-assisted dressing could profoundly enhance the quality of life of adults with physical disabilities. To achieve this, a robot can benefit from both visual and force sensing. The former enables the robot to ascertain human body pose and garment deformations, while the latter helps maintain safety and comfort during the dressing process. In this paper, we introduce a new technique that leverages both vision and force modalities for this assistive task. Our approach first trains a vision-based dressing policy using reinforcement learning in simulation with varying body sizes, poses, and types of garments. We then learn a force dynamics model for action planning to ensure safety. Due to limitations of simulating accurate force data when deformable garments interact with the human body, we learn a force dynamics model directly from real-world data. Our proposed method combines the vision-based policy, trained in simulation, with the force dynamics model, learned in the real world, by solving a constrained optimization problem to infer actions that facilitate the dressing process without applying excessive force on the person. We evaluate our system in simulation and in a real-world human study with 10 participants across 240 dressing trials, showing it greatly outperforms prior baselines. Video demonstrations are available on our project website. Robotics and Automation Letters (RAL), 2024 [Project Page] [Bibtex] [Abstract] [arXiv]
	Learning to Singulate Layers of Cloth based on Tactile Feedback Sashank Tirumala, Thomas Weng, Daniel Seita*, Oliver Kroemer, Zeynep Temel, David Held @inproceedings{tirumala2022reskin, author={Tirumala, Sashank and Weng, Thomas and Seita, Daniel and Kroemer, Oliver and Temel, Zeynep and Held, David}, booktitle={2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, title={Learning to Singulate Layers of Cloth using Tactile Feedback}, year={2022}, volume={}, number={}, pages={7773-7780}, doi={10.1109/IROS47612.2022.9981341} } Robotic manipulation of cloth has applications ranging from fabrics manufacturing to handling blankets and laundry. Cloth manipulation is challenging for robots largely due to their high degrees of freedom, complex dynamics, and severe self-occlusions when in folded or crumpled configurations. Prior work on robotic manipulation of cloth relies primarily on vision sensors alone, which may pose challenges for fine-grained manipulation tasks such as grasping a desired number of cloth layers from a stack of cloth. In this paper, we propose to use tactile sensing for cloth manipulation; we attach a tactile sensor (ReSkin) to one of the two fingertips of a Franka robot and train a classifier to determine whether the robot is grasping a specific number of cloth layers. During test-time experiments, the robot uses this classifier as part of its policy to grasp one or two cloth layers using tactile feedback to determine suitable grasping points. Experimental results over 180 physical trials suggest that the proposed method outperforms baselines that do not use tactile feedback and has a better generalization to unseen fabrics compared to methods that use image classifiers. International Conference on Intelligent Robots and Systems (IROS), 2022 - Best Paper at ROMADO-SI [Project Page] [Bibtex] [Abstract] [arXiv] [Code]
	PLAS: Latent Action Space for Offline Reinforcement Learning Wenxuan Zhou, Sujay Bajracharya, David Held @inproceedings{PLAS_corl2020, title={PLAS: Latent Action Space for Offline Reinforcement Learning}, author={Zhou, Wenxuan and Bajracharya, Sujay and Held, David}, booktitle={Conference on Robot Learning}, year={2020} } The goal of offline reinforcement learning is to learn a policy from a fixed dataset, without further interactions with the environment. This setting will be an increasingly more important paradigm for real-world applications of reinforcement learning such as robotics, in which data collection is slow and potentially dangerous. Existing off-policy algorithms have limited performance on static datasets due to extrapolation errors from out-of-distribution actions. This leads to the challenge of constraining the policy to select actions within the support of the dataset during training. We propose to simply learn the Policy in the Latent Action Space (PLAS) such that this requirement is naturally satisfied. We evaluate our method on continuous control benchmarks in simulation and a deformable object manipulation task with a physical robot. We demonstrate that our method provides competitive performance consistently across various continuous control tasks and different types of datasets, outperforming existing offline reinforcement learning methods with explicit constraints. Conference on Robot Learning (CoRL), 2020 - Plenary talk (Selection rate 4.1%) [Project Page] [Bibtex] [Abstract] [Code] [PDF]
	Cloth Region Segmentation for Robust Grasp Selection Jianing Qian, Thomas Weng, Luxin Zhang, Brian Okorn, David Held @inproceedings{Qian_2020_IROS,\n author={Qian, Jianing and Weng, Thomas and Zhang, Luxin and Okorn, Brian and Held, David}, booktitle={2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, title={Cloth Region Segmentation for Robust Grasp Selection}, year={2020}, volume={}, number={}, pages={9553-9560}, doi={10.1109/IROS45743.2020.9341121}}" Cloth detection and manipulation is a common task in domestic and industrial settings, yet such tasks remain a challenge for robots due to cloth deformability. Furthermore, in many cloth-related tasks like laundry folding and bed making, it is crucial to manipulate specific regions like edges and corners, as opposed to folds. In this work, we focus on the problem of segmenting and grasping these key regions. Our approach trains a network to segment the edges and corners of a cloth from a depth image, distinguishing such regions from wrinkles or folds. We also provide a novel algorithm for estimating the grasp location, direction, and directional uncertainty from the segmentation. We demonstrate our method on a real robot system and show that it outperforms baseline methods on grasping success. Video and other supplementary materials are available at: International Conference on Intelligent Robots and Systems (IROS), 2020 [Project Page] [Bibtex] [Abstract] [PDF] [Code]
	Multi-Modal Transfer Learning for Grasping Transparent and Specular Objects Thomas Weng, Amith Pallankize, Yimin Tang, Oliver Kroemer, David Held @ARTICLE{9001238, author={Thomas Weng and Amith Pallankize and Yimin Tang and Oliver Kroemer and David Held}, journal={IEEE Robotics and Automation Letters}, title={Multi-Modal Transfer Learning for Grasping Transparent and Specular Objects}, year={2020}, volume={5}, number={3}, pages={3791-3798}, doi={10.1109/LRA.2020.2974686}} State-of-the-art object grasping methods rely on depth sensing to plan robust grasps, but commercially available depth sensors fail to detect transparent and specular objects. To improve grasping performance on such objects, we introduce a method for learning a multi-modal perception model by bootstrapping from an existing uni-modal model. This transfer learning approach requires only a pre-existing uni-modal grasping model and paired multi-modal image data for training, foregoing the need for ground-truth grasp success labels nor real grasp attempts. Our experiments demonstrate that our approach is able to reliably grasp transparent and reflective objects. Video and supplementary material are available at Robotics and Automation Letters (RAL) with presentation at the International Conference of Robotics and Automation (ICRA), 2020 [Project Page] [Bibtex] [Abstract] [PDF]