OmniRobotHome: A Multi-Camera Home Platform for Real-Time Human-Robot Interaction

OmniRobotHome:
A Multi-Camera Home Platform for Real-Time Human-Robot Interaction

Junyoung Lee^1*, Inhee Lee^1*, Sookwan Han^1*, Jeonghwan Kim^1*, Kyungwon Cho¹, Mingi Choi¹, Lee Chae-Yeon¹, Wonjung Woo¹, Gunhee Kim¹, Jisoo Kim¹, Jeonghyeon Na¹, Hanbyul Joo^1,2

¹Seoul National University ²RLWRLD
^*Indicates Equal Contribution

Paper arXiv Code (Coming Soon)

Raw camera 3D perception 10× speed

⇆

One RGB camera ↔ the real-time, markerless 3D perception it feeds — full-body human pose, live (drag to compare). Keep scrolling to pull back to all 48 ↓

48synchronized
cameras

3manipulators
(2 Franka + xArm)

30 Hzreal-time
3D pose

23.1 m²living
space

TL;DR: A room-scale platform with 48 synchronized cameras and three manipulators delivering the real-time 3D human perception that home robots need for safety, assistance, and social interaction.

Abstract

Robots in homes must continuously sense the people around them, yet most prior work relies on limited or offline perception. We argue that perception quality is the dominant factor governing what interaction is achievable at home, and build a testbed to test this claim.

OmniRobotHome instruments a furnished home with 48 hardware-synchronized cameras and three manipulators in a unified world frame, delivering real-time markerless full-body human pose, 6D object pose, anticipatory motion forecasting, and a social avatar agent that converses with residents.

Using the platform, we treat perception quality as an experimental variable across safety, human assistance, and social interaction, and find that interaction quality degrades measurably as real-timeness, granularity, coverage, accuracy, forecasting, or memory is weakened. All code and data will be released.

System Overview

Hover a marker to see each component in action, anchored to where it lives in the room. 48 hardware-synchronized cameras provide real-time markerless 3D perception of humans, objects, and robots in a unified world frame.

— or, how it all connects —

Sense

48 RGB Cameras

hardware-synchronized

Perceive

Real-time 3D

human pose · 6D object pose

Reason

VLM

intent & commands

Act

Robot Control · Social

manipulation & dialogue

gated in real time by Human Forecasting & Safety Check

Room-Scale Camera Coverage

48 synchronized cameras blanket the living space. Drag the slider to reduce active cameras and see how spatial coverage degrades. Click any camera to see its field of view and thumbnail.

Interactive · drag slider, click a camera

Active Cameras

Drag to reduce active cameras

48 / 48 cameras

Point coverage

FewMany

3D Pose Tracking

Markerless 3D skeleton estimation from the multi-camera system. Switch between single-person and multi-person capture scenes.

Interactive · play, scrub, rotate

0.0s

6D Object Pose Tracking

Manipulation targets need full 6-DoF pose, not just location. Four calibrated stereo pairs over the workspaces — hardware-synchronized to the rest of the rig — drive a TensorRT pipeline: Fast-FoundationStereo for dense metric depth, YOLOE segmentation, and FoundationPose for marker-free 6D tracking from a template mesh. End-to-end at ~16 Hz, the live pose transforms each stored object-relative grasp into the current world frame for closed-loop manipulation.

Drag to compare · depth ↔ 6D pose

Stereo depth 6D pose 10× speed

⇆

Real-Time 3D Perception

48 hardware-synchronized RGB cameras provide real-time markerless multi-human 3D pose tracking and 6D object pose estimation across a 23.1 m² living space. All cameras and robot arms share a single consistent world frame, enabling robot actions conditioned on live human and object state rather than replayed data.

Robot Manipulation

Manipulation runs on a library of object-relative grasps conditioned on live 6D object pose and planned with collision-aware trajectories — a catalog of household skills (turning off the stove, clearing objects, opening drawers, pouring…) shared, arm-agnostically, across all three manipulators.

10× speed

10× speed

Safety-Aware Coexistence

The arms dynamically yield to approaching humans by conditioning on real-time 3D pose, and resume once the workspace is clear. Long-term motion forecasting enables preemptive avoidance: from the recent 3s of tracking history, the forecaster samples diverse 5s futures, and the safety policy triggers on predicted intersection before contact occurs.

Coexistence demo — the arm yields to the approaching human and resumes once clear

Forecasting map — sampled 5s human future motions, driving preemptive avoidance

VLM-Guided Assistance

The VLM-based social avatar agent reasons about the resident's activity and apparent intent and converses with them. When help is needed, the same model emits a high-level command that invokes the corresponding manipulation skill. Here the resident says the soup is burning, and the agent selects and dispatches the turn-off-stove skill — assistance requested through conversation, not a separate interface.

20× speed

Resident

“Hey, the soup is boiling!”

Social Avatar Agent · VLM

high-level command → skill: turn off the stove

Mutual-Gaze Engagement

The system engages whoever engages it: the avatar's head orientation follows whichever person it is attending to, driven directly by the live 3D human pose stream. Mutual gaze makes the system's attention legible — turning passive sensing into social presence.

Third-person view, with the resident's egocentric (headset) view inset — both of the same mutual-gaze moment.

BibTeX

Coming Soon

OmniRobotHome:A Multi-Camera Home Platform for Real-Time Human-Robot Interaction