Berkeley Researchers Turn Everyday Videos Into Robot Training Data

Do As I Do

Insider Brief

  • UC Berkeley researchers developed Do as I Do, a software system that turns ordinary human hand-object videos into training data for robots performing delicate manipulation tasks.
  • The system reconstructs hand and object motion from two-dimensional video, then retargets that motion through a physics simulator so a robotic hand can carry out the task.
  • The researchers report that the full system succeeded on 71% of reconstructed internet-video tasks, compared with 25% for a baseline method, but found that only about 5% of filtered online hand-object footage may be directly useful for robot training.
  • Image: Researchers introduce DO AS I DO, an algorithm that takes in-the-wild monocular RGB videos of
    hand-object interaction (top) and generates dexterous hand manipulation data (bottom).

Researchers at the University of California, Berkeley have built software that converts ordinary videos of people using their hands into data that can train robots to perform delicate, multi-fingered tasks.

The system, described in a paper posted to the preprint server arXiv, reportedly can take a clip pulled from the web, reconstruct what the person’s hand and the object were doing, and produce movements that a robotic hand can actually carry out.

The work targets one of robotics’ most stubborn bottlenecks. Machine-learning systems improve as they consume more data, but data showing robots how to manipulate objects has been scarce and costly to collect. Most of it comes from teleoperation, in which a human operator drives the robot by hand, or from simulated trial and error. Teleoperation is limited by operator skill and the expense of running the equipment, the researchers write, while simulation requires engineers to design environments and reward rules by hand.

Human videos offer a way around both problems. People constantly record themselves cooking, cleaning, drawing and building, and that footage already shows hands doing exactly the kind of fine manipulation roboticists want to replicate. The trouble, the researchers report, is that almost all of this footage is plain two-dimensional video shot with a single camera, which makes it hard for a computer to recover the three-dimensional motion of a hand gripping an object, and harder still to map that motion onto a robot hand shaped nothing like a human one.

How the System Works

The system, called Do as I Do, tackles the problem in two stages. The first stage reconstructs the scene. The software estimates the shape of the object, tracks how the hand and object move through space over time, and recovers depth and camera information, all from flat video. The team built on recent advances in computer vision, including a hand-tracking model called HaWoR and a generative model named SAM 3D that can produce a three-dimensional shape of an object from a single image, even when part of it is hidden behind fingers.

To keep the object steady from one video frame to the next, the researchers repurposed SAM 3D, which was designed to generate objects from still images, into a video tracker. They fixed the object’s shape early and then nudged the model to update only its position and orientation in each frame, guided by where the object sat in the previous frame. The team reports that human reviewers preferred this approach over the prior leading method.

The second stage, which the team calls retargeting, transforms the reconstructed human motion into commands a robot can run. Because a robot hand has different proportions and joint limits than a human hand, and because video alone cannot reveal the forces involved in a grip, the researchers ran the motion through a physics simulator and searched for a sequence of robot actions that reproduced the task while obeying real-world physics. The method builds on a technique called SPIDER and adds several refinements meant to cope with the noisy, imperfect motion estimates that come out of casual video.

Accuracy Jump

The researchers measured the system against existing methods on two fronts. For reconstructing hand-object interaction, they tested on standard benchmark datasets and on a new collection of 150 clips drawn from internet videos, recordings made from a wearer’s point of view, and footage produced by artificial-intelligence video generators. Because the internet clips lack a verified answer to compare against, the team asked volunteers to judge which method tracked objects more faithfully. Reviewers preferred Do as I Do 67% of the time, compared with 18% for FoundationPose, the prior leading tracker, with the remaining 15% rated as ties.

The retargeting stage showed an even larger gain. On the reconstructed internet footage, the full system succeeded 71% of the time, up from 25% for the baseline approach, the researchers report. On a separate dataset of cleanly recorded human motion, success climbed to 81% from 72%. The biggest single contributor was a step the team calls warmup, which lets the robot adjust its grip before it begins tracking the human motion, rather than starting from a shaky first frame that may be impossible to recover from.

In total, the pipeline produced 500 manipulation trajectories that human reviewers verified as high quality, drawn 53% from internet video, 31% from first-person recordings and 16% from generated footage. The team then ran a representative set on a real robot, a two-armed setup with UR3e arms and 22-joint Sharpa Wave hands. The robot performed 10 tasks, including whisking, pouring, dusting, squeezing, hammering, spreading and picking up objects.

A Warning About Data Quality

The paper offers a caution to the growing number of labs racing to mine human video for robotics. To gauge how much online footage is actually usable, the researchers sampled 2,000 ten-second clips from a public dataset that had already been filtered for hand-object interaction. Only 187 clips, or 9%, contained meaningful interaction, the researchers write. After discarding clips in which the hand or object left the frame, in which the camera moved too much, or in which the reconstruction failed, just 83 clips, about 4%, survived. Even allowing for future improvements, the team estimates that roughly 5% of such footage will prove directly useful for teaching robots to manipulate objects.

That figure implies not properly preprocessing and filtering internet videos for robot learning results in a 20x penalty, the researchers write. The lesson is that raw volume of video is misleading and that careful screening matters as much as quantity.

There are some limitations with the approach and some areas where future research work should be aimed. First, the system assumes objects are rigid, so it may fail on soft or articulated items. It also reconstructs only the hand and a single object rather than the full scene, which means it cannot reason about obstacles or other constraints in the environment. And because physics simulators model the real world only approximately, that gap places a ceiling on how well the method can ultimately perform.

For a deeper, more technical dive, please review the paper on arXiv. It’s important to note that arXiv is a pre-print server, which allows researchers to receive quick feedback on their work. However, it is not — nor is this article, itself — official peer-review publications. Peer-review is an important step in the scientific process to verify results.

The paper lists Bhawna Paliwal, Haritheja Etukuru and William Liang as equal lead contributors, with co-authors Pieter Abbeel, Nur Muhammad “Mahi” Shafiullah and Jitendra Malik.

Need Deeper Intelligence on the AI Market?

AI Insider's Market Intelligence platform tracks funding rounds, competitive landscapes, and technology trends across the global AI ecosystem in real time. Get the data and insights your organization needs to make informed decisions.

Related Articles

Stacks of hundred dollar bills are shown.
Equal AI Announces $30M in Funding to scale AI-powered Voice Screening and Conversational Intelligence Platform Globally

Equal AI, a Hyderabad-based developer of AI-native phone screening and conversational intelligence software, has closed a $30 million Series B round co-led by Prosus Ventures

a neon neon sign that is on the side of a wall
Qorelo Secures €3M to Automate AI-powered SAP Migrations as 2027 Deadline Creates Global Delivery Crisis

Qorelo, a Berlin-based startup building an AI engine for enterprise resource planning delivery, has raised €3 million ($3.5 million) in seed funding just five months

fan of 100 U.S. dollar banknotes
AI Lab Radical Numerics Launches with $50M Seed Round To Build General Biological Intelligence

Insider Brief PRESS RELEASE — Radical Numerics, an AI research lab building general biological intelligence, has announced its launch with $50 million in seed funding.

Stay Updated with AI Insider

Get the latest AI funding news, market intelligence, and industry insights delivered to your inbox weekly.

$ 0 M

Seed round tracked

Gitar — Code Validation

Get the Weekly Briefing

Funding analysis, market intelligence, and industry trends delivered to your inbox every week.

Need bespoke intelligence?

Our team combines real-time data with decades of sector experience to guide your decisions.

Subscribe today for the latest news about the AI landscape