OmniRobotHome LogoOmniRobotHome:
A Multi-Camera Home Platform for Real-Time Human-Robot Interaction

1Seoul National University   2RLWRLD
*Indicates Equal Contribution
Raw camera 3D perception 10× speed

One RGB camera ↔ the real-time, markerless 3D perception it feeds — full-body human pose, live (drag to compare). Keep scrolling to pull back to all 48

48synchronized
cameras
3manipulators
(2 Franka + xArm)
30 Hzreal-time
3D pose
23.1 m²living
space

TL;DR: A room-scale platform with 48 synchronized cameras and three manipulators delivering the real-time 3D human perception that home robots need for safety, assistance, and social interaction.

Abstract

Robots in homes must continuously sense the people around them, yet most prior work relies on limited or offline perception. We argue that perception quality is the dominant factor governing what interaction is achievable at home, and build a testbed to test this claim.

OmniRobotHome instruments a furnished home with 48 hardware-synchronized cameras and three manipulators in a unified world frame, delivering real-time markerless full-body human pose, 6D object pose, anticipatory motion forecasting, and a social avatar agent that converses with residents.

Using the platform, we treat perception quality as an experimental variable across safety, human assistance, and social interaction, and find that interaction quality degrades measurably as real-timeness, granularity, coverage, accuracy, forecasting, or memory is weakened. All code and data will be released.

System Overview

Hover a marker to see each component in action, anchored to where it lives in the room. 48 hardware-synchronized cameras provide real-time markerless 3D perception of humans, objects, and robots in a unified world frame.

OmniRobotHome room

— or, how it all connects —

Sense
48 RGB Cameras
hardware-synchronized
Perceive
Real-time 3D
human pose · 6D object pose
Reason
VLM
intent & commands
Act
Robot Control · Social
manipulation & dialogue
gated in real time by Human Forecasting & Safety Check

Room-Scale Camera Coverage

48 synchronized cameras blanket the living space. Drag the slider to reduce active cameras and see how spatial coverage degrades. Click any camera to see its field of view and thumbnail.

Interactive · drag slider, click a camera
Active Cameras
Drag to reduce active cameras
48 / 48 cameras
Point coverage
FewMany
Camera View
📷
Click a camera
to view its feed

3D Pose Tracking

Markerless 3D skeleton estimation from the multi-camera system. Switch between single-person and multi-person capture scenes.

Interactive · play, scrub, rotate
0.0s

6D Object Pose Tracking

Manipulation targets need full 6-DoF pose, not just location. Four calibrated stereo pairs over the workspaces — hardware-synchronized to the rest of the rig — drive a TensorRT pipeline: Fast-FoundationStereo for dense metric depth, YOLOE segmentation, and FoundationPose for marker-free 6D tracking from a template mesh. End-to-end at ~16 Hz, the live pose transforms each stored object-relative grasp into the current world frame for closed-loop manipulation.

Drag to compare · depth ↔ 6D pose
Stereo depth 6D pose 10× speed

Real-Time 3D Perception

48 hardware-synchronized RGB cameras provide real-time markerless multi-human 3D pose tracking and 6D object pose estimation across a 23.1 m² living space. All cameras and robot arms share a single consistent world frame, enabling robot actions conditioned on live human and object state rather than replayed data.

Robot Manipulation

Manipulation runs on a library of object-relative grasps conditioned on live 6D object pose and planned with collision-aware trajectories — a catalog of household skills (turning off the stove, clearing objects, opening drawers, pouring…) shared, arm-agnostically, across all three manipulators.

10× speed
View 1
10× speed
View 2

Safety-Aware Coexistence

The arms dynamically yield to approaching humans by conditioning on real-time 3D pose, and resume once the workspace is clear. Long-term motion forecasting enables preemptive avoidance: from the recent 3s of tracking history, the forecaster samples diverse 5s futures, and the safety policy triggers on predicted intersection before contact occurs.

Coexistence demo — the arm yields to the approaching human and resumes once clear
Forecasting map — sampled 5s human future motions, driving preemptive avoidance

VLM-Guided Assistance

The VLM-based social avatar agent reasons about the resident's activity and apparent intent and converses with them. When help is needed, the same model emits a high-level command that invokes the corresponding manipulation skill. Here the resident says the soup is burning, and the agent selects and dispatches the turn-off-stove skill — assistance requested through conversation, not a separate interface.

20× speed
Resident
“Hey, the soup is boiling!”
Social Avatar Agent · VLM
high-level command → skill: turn off the stove

Mutual-Gaze Engagement

The system engages whoever engages it: the avatar's head orientation follows whichever person it is attending to, driven directly by the live 3D human pose stream. Mutual gaze makes the system's attention legible — turning passive sensing into social presence.

Third-person view, with the resident's egocentric (headset) view inset — both of the same mutual-gaze moment.

BibTeX

Coming Soon