Open vocabulary robot perception passes real world tests

By Sophia ChenJUN 26, 20262 min read

AnomNOVIC labels unseen objects without prompts and hits 82.6% in wild tests with 48 objects.

Testing shows that in a lab evaluation on the NICOL humanoid robot, the system operates as a two stage known workspace framework. A masked autoencoder trained for anomaly detection first proposes generic, object-agnostic bounding boxes, then NOVIC, a real time open vocabulary image classifier, assigns labels to the highlighted regions. The result is a perception stack that can handle open world scenes without a fixed candidate list, a long sought goal for service robots that must cope with unfamiliar items on real tables and shelves. Documentation indicates that the prompt free variant achieved 47.1% average precision (AP) and 57.5% AP50 for open vocabulary recognition on NICOL, while providing class candidates boosts performance to 59.0% AP and 72.5% AP50. The authors emphasize that the first stage eliminates the need to guess what might appear next, while the second stage handles naming, filtering, and disambiguation in real time.

Beyond the NICOL tabletop test bed, the study reports strong results across additional datasets, including an in the wild set featuring 48 unique objects. AnomNOVIC reaches up to 82.6% prompt free detection and classification, a notable leap over current open vocabulary baselines such as YOLO World v2, OWLv2, and YOLOE. The paper frames the achievement as a proof of feasibility for prompt free open vocabulary perception in practical robotics, moving the field closer to systems that can recognize what they have never been explicitly trained to identify.

For engineers and operators, the numbers point to several hard realities about bringing open vocabulary perception into everyday robot use. First, the two stage design clarifies a practical constraint: perception quality depends on the quality of the proposed bounding boxes. The MAE driven anomaly detector must consistently sketch meaningful regions; if the boxes miss relevant parts of a scene or merge distinct items, NOVIC will struggle to assign correct labels. In cluttered or occluded scenes, bounding boxes become both bottleneck and choke point, making robust box generation critical to overall accuracy. Second, there is a clear trade off between prompt free operation and labeled accuracy. The 82.6% figure in wild tests demonstrates strong capability without predefined classes, but the boosted metrics when class candidates are supplied show that domain experts can tune performance by providing curated vocabularies. In fast moving environments, that trade off will shape deployment choices for fields like hospitality robots or warehouse assistants.

A third practitioner takeaway is about deployment practicality. The NICOL humanoid setting signals a laby, controlled stage rather than a full production rollout. Real time open vocabulary perception remains feasible, but practitioners should expect a need for scalable compute and careful monitoring of failure modes. The authors highlight potential next steps such as refining anomaly bounding boxes in highly dynamic scenes, expanding open vocabulary coverage without sacrificing speed, and stress testing on more varied hardware platforms to understand latency budgets in real world operation.

Overall, the report frames open vocabulary perception as a concrete engineering challenge with measurable gains. The AnomNOVIC approach demonstrates that a two stage, prompt free system can outperform established baselines while offering a practical path toward robots that can discuss what they see without a fixed dictionary.

Open vocabulary robot perception passes real world tests

The Robotics Briefing