OmniRobotHome:One RGB camera ↔ the real-time, markerless 3D perception it feeds — full-body human pose, live (drag to compare). Keep scrolling to pull back to all 48 ↓
Robots in homes must continuously sense the people around them, yet most prior work relies on limited or offline perception. We argue that perception quality is the dominant factor governing what interaction is achievable at home, and build a testbed to test this claim.
OmniRobotHome instruments a furnished home with 48 hardware-synchronized cameras and three manipulators in a unified world frame, delivering real-time markerless full-body human pose, 6D object pose, anticipatory motion forecasting, and a social avatar agent that converses with residents.
Using the platform, we treat perception quality as an experimental variable across safety, human assistance, and social interaction, and find that interaction quality degrades measurably as real-timeness, granularity, coverage, accuracy, forecasting, or memory is weakened. All code and data will be released.
Hover a marker to see each component in action, anchored to where it lives in the room. 48 hardware-synchronized cameras provide real-time markerless 3D perception of humans, objects, and robots in a unified world frame.
— or, how it all connects —
48 synchronized cameras blanket the living space. Drag the slider to reduce active cameras and see how spatial coverage degrades. Click any camera to see its field of view and thumbnail.
Markerless 3D skeleton estimation from the multi-camera system. Switch between single-person and multi-person capture scenes.
Manipulation targets need full 6-DoF pose, not just location. Four calibrated stereo pairs over the workspaces — hardware-synchronized to the rest of the rig — drive a TensorRT pipeline: Fast-FoundationStereo for dense metric depth, YOLOE segmentation, and FoundationPose for marker-free 6D tracking from a template mesh. End-to-end at ~16 Hz, the live pose transforms each stored object-relative grasp into the current world frame for closed-loop manipulation.
48 hardware-synchronized RGB cameras provide real-time markerless multi-human 3D pose tracking and 6D object pose estimation across a 23.1 m² living space. All cameras and robot arms share a single consistent world frame, enabling robot actions conditioned on live human and object state rather than replayed data.
Manipulation runs on a library of object-relative grasps conditioned on live 6D object pose and planned with collision-aware trajectories — a catalog of household skills (turning off the stove, clearing objects, opening drawers, pouring…) shared, arm-agnostically, across all three manipulators.
The arms dynamically yield to approaching humans by conditioning on real-time 3D pose, and resume once the workspace is clear. Long-term motion forecasting enables preemptive avoidance: from the recent 3s of tracking history, the forecaster samples diverse 5s futures, and the safety policy triggers on predicted intersection before contact occurs.
The VLM-based social avatar agent reasons about the resident's activity and apparent intent and converses with them. When help is needed, the same model emits a high-level command that invokes the corresponding manipulation skill. Here the resident says the soup is burning, and the agent selects and dispatches the turn-off-stove skill — assistance requested through conversation, not a separate interface.
The system engages whoever engages it: the avatar's head orientation follows whichever person it is attending to, driven directly by the live 3D human pose stream. Mutual gaze makes the system's attention legible — turning passive sensing into social presence.
Third-person view, with the resident's egocentric (headset) view inset — both of the same mutual-gaze moment.
Coming Soon