OmniRobotHome:Human-robot collaboration has been studied primarily in dyadic or sequential settings. However, real homes require multiadic collaboration, where multiple humans and robots share a workspace, acting concurrently on interleaved subtasks with tight spatial and temporal coupling. This regime remains underexplored because close-proximity interaction between humans, robots, and objects creates persistent occlusion and rapid state changes, making reliable real-time 3D tracking the central bottleneck. No existing platform provides the real-time, occlusion-robust, room-scale perception needed to make this regime experimentally tractable.
We present OmniRobotHome, the first room-scale residential platform that unifies wide-area real-time 3D human and object perception with coordinated multi-robot actuation in a shared world frame. The system instruments a natural home environment with 48 hardware-synchronized RGB cameras for markerless, occlusion-robust tracking of multiple humans and objects, temporally aligned with two Franka arms that act on live scene state. Continuous capture within this consistent frame further supports long-horizon human behavior modeling from accumulated trajectories.
The platform makes the multiadic collaboration regime experimentally tractable. We focus on two central problems: safety in shared human-robot environments and human-anticipatory robotic assistance, and show that real-time perception and accumulated behavior memory each yield measurable gains in both.
System Overview of OmniRobotHome. 48 hardware-synchronized cameras across 12 edge nodes provide real-time markerless 3D perception of humans, objects, and robots in a unified world frame.
48 synchronized cameras blanket the living space. Drag the slider to reduce active cameras and see how spatial coverage degrades. Click any camera to see its field of view and thumbnail.
Markerless 3D skeleton estimation from the multi-camera system. Switch between single-person and multi-person capture scenes.
48 hardware-synchronized RGB cameras provide real-time markerless multi-human 3D pose tracking and 6D object pose estimation across a 23.1 m² living space. All cameras and robot arms share a single consistent world frame, enabling robot actions conditioned on live human and object state rather than replayed data.
Two Franka arms temporally aligned with the shared perception stream act concurrently on task-coupled dependencies. Robots relay objects across spatially separated regions while humans move freely through the workspace.
Multi-view — synthesized from the 48-camera system covering the full workspace
Single-view — the same task captured from a single RGB camera
Accumulated behavioral trajectories from continuous markerless capture become priors that collaboration policies consume directly. A VLM infers human intent from real-time 3D pose and scene context, and the robot proactively delivers the needed object with handover targets continuously updated from live 3D hand keypoints.
The arms dynamically yield to approaching humans by conditioning on real-time 3D pose, and resume once the workspace is clear. Behavior memory enables preemptive avoidance by anticipating human motion before it occurs.
Coming soon.