Humanoid self model learned from vision and touch

Image / arXiv Humanoid/Bipedal Query
A humanoid robot learns self from vision and touch. The work shows a robot can distinguish itself from others using proprioceptive signals paired with visual input, without relying on identity labels or prebuilt kinematic models. That self-other distinction then fuels a predictive self-model that maps joint configurations to three-dimensional body occupancy, effectively describing how the robot’s body moves through space as it acts.
In controlled, multi-agent scenes that include humans or morphologically identical robots, the system reliably identifies itself, builds a 3D self-model, and enables a set of practical capabilities. The authors report that the self-model supports target reaching, collision-aware motion planning, and human-to-robot motion retargeting, all without hand-tuned identities or explicit body models. Taken together, the results sketch a concrete path toward bodily self-representation in robots that must operate and coordinate in shared physical environments, rather than remain isolated in lab cages or purely scripted tasks.
For engineers and operators, the implications are meaningful. By tying self-recognition to proprioceptive-visual correspondence, the approach reduces the dependency on engineered kinematic models when robots need to adapt to new partners or slightly different morphologies. A robot can, in effect, learn what its own body looks like from its own sensations and its camera, rather than being fed a detailed blueprint of joints and limbs. In practice this could streamline onboarding of robots into mixed teams of humans and hardware, and it lays groundwork for safer, more fluid collaboration.
Yet the leap from lab demonstration to production reality comes with sharp constraints. The method hinges on the integrity of sensing: accurate joint readings and reliable visual streams are essential for a trustworthy self-model to emerge. Occlusions, lighting changes, or sensor drift could impair the self-identification process, creating mislabeling risks in crowded environments. Real-time performance is another consideration; while the study shows the concept, turning a self-modeling loop into a continuously tuned planner for busy workplaces will demand careful balance of computation, sensing latency, and power budgets. And while the results cover reaching and retargeting, broader deployment will need robust handling of diverse human motions, clutter, and variable robot morphologies in dynamic settings.
From a practitioner standpoint, several watchpoints emerge. First, the shift away from hand-built kinematic models is a clear incentive for faster integration with new robots and partners, but it trades some determinism for adaptivity; operators will want predictable performance as the self-models update in real time. Second, collision avoidance becomes more intrinsic when a robot knows its own body occupancy in 3D space, yet the approach must prove resilient to rapid human movements and complex scenes. Third, the success of human-to-robot motion retargeting depends on faithful mapping across partners, a practical hurdle as human and robot motions diverge in scale or timing. Finally, the path to production will hinge on extending the approach beyond controlled experiments to robust, scalable deployments across multiple robot platforms and morphologies.
If these hurdles are managed, the study positions bodily self-representation as a foundational capability for safe, cooperative humanoids in shared workspaces. The next milestones to watch include extending the self-model to more varied bodies, ensuring real-time stability under sensor noise, and integrating with higher-level planners that orchestrate long-horizon collaborative tasks.
- Proprioceptive-visual correspondence enables self-other distinction in humanoid robotsarXiv Humanoid/Bipedal Query / Primary source / Published JUN 11, 2026 / Accessed JUN 14, 2026