2025-03-21 17:05:51 -07:00
2025-03-21 17:05:51 -07:00
sam2 @ 2b90b9f5ce
2025-03-21 13:27:33 -07:00
2025-03-21 17:00:58 -07:00
2025-03-21 17:00:58 -07:00
2025-03-21 17:00:58 -07:00

Open Phantom: Training Robots Using Only Human Videos

Overview

Open Phantom is a fully open-source implementation of the approach described in the paper "Phantom: Training Robots Without Robots Using Only Human Videos." This project focuses on the data collection component of the Phantom pipeline, enabling anyone with a standard RGB camera to generate training data for robot learning without requiring actual robot hardware.

Key Features

  • Camera-Only Data Collection: Capture hand movements using any standard RGB camera
  • 3D Hand Tracking: Convert 2D video to 3D hand poses using MediaPipe landmarks
  • Advanced Depth Estimation: Generate depth maps from monocular RGB input using ml-depth-pro
  • Hand Segmentation: Precisely isolate hand regions with Meta's SAM2 for better depth estimation
  • ICP Registration: Align hand mesh with depth point cloud for improved 3D accuracy
  • Anatomical Constraints: Apply natural hand constraints to ensure realistic movements
  • Robot Action Extraction: Transform hand poses into robot control parameters (position, orientation, gripper width)
  • Visualization Pipeline: Debug-friendly visualization of each processing stage
  • Commercial-Friendly: Built entirely with open-source, commercially usable components

Project Status

⚠️ Work in Progress: Open Phantom is currently under active development. Core functionality is implemented, but the codebase is still being refined and tested. We welcome early adopters and contributors to help improve the project.

Known limitations:

  • ICP registration still being optimized for better alignment
  • Depth estimation quality varies with lighting conditions
  • Performance optimizations needed for real-time processing

Background

The original Phantom paper demonstrated that robots could learn tasks from human demonstrations without any robot-specific data collection. By capturing hand movements in diverse environments and converting them to robot action parameters, it's possible to train robot policies that perform effectively during zero-shot deployment.

Unlike the original implementation which relies on MANO (a hand model not available for commercial use), Open Phantom is built entirely with open-source components that can be used in commercial applications.

How It Works

  1. Video Capture: Record video of your hand performing a task using a standard RGB camera
  2. Hand Tracking: Track hand landmarks in the video
  3. Depth Estimation: Estimate depth information from the monocular RGB input
  4. Segmentation: Segment the hand using SAM2 (Segment Anything Model 2)
  5. 3D Reconstruction: Create a 3D hand model from the landmarks and depth information
  6. Robot Parameters: Extract position, orientation, and gripper parameters for robot control

Installation

# Clone the repository with submodules
git clone https://github.com/yourusername/open-phantom.git
cd open-phantom

# Create and activate conda environment
conda env create -f environment.yml
conda activate open-phantom

# Initialize and update submodules
git submodule update --init --recursive

# Install dependencies for SAM2
cd external/sam2
pip install -e .
cd ../..

# Install dependencies for ML-Depth-Pro
cd external/ml-depth-pro
pip install -e .
cd ../..

Usage

# Run the main script to record and process a video
python open_phantom/main.py

Contributing

We welcome contributions from the community! This project is intended as a resource for researchers and developers interested in robot learning from human demonstrations. Whether you're improving the hand tracking, depth estimation, or adding new features, your contributions help advance the goal of more accessible robot learning.

Citation

If you use Open Phantom in your research, please cite the original Phantom paper:

@article{lepert2025phantom,
  title={Phantom: Training Robots Without Robots Using Only Human Videos},
  author={Lepert, Marion and Fang, Jiaying and Bohg, Jeannette},
  journal={arXiv preprint arXiv:2503.00779},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • This project is based on the research presented in "Phantom: Training Robots Without Robots Using Only Human Videos"
  • We use Meta's SAM2 (Segment Anything Model 2) for hand segmentation
  • ML-Depth-Pro from Apple provides advanced depth estimation

Disclaimer

Open Phantom is a community implementation focused on the data collection aspects of the Phantom approach. The original paper authors are not affiliated with this specific implementation.

Description
Open-source implementation for robotic data collection without robots
Readme MIT 974 KiB
Languages
Python 99.7%
CMake 0.3%