Not able to get even half-decent object recognition

Pito Salas

unread,

Mar 14, 2026, 9:12:19 AMMar 14

to rssc...@googlegroups.com

Greetings,

I am finally trying to get my Oak-D-Light to do some object detection/recognition. “All” I want to do is to detect a soda can or coffee cup or even an orange cone to be detected. So far it’s quite bad.

I am still in the early stages. I have found some pretty good documentation. But there is still plenty of black magic required. Yes there are catalogs of different models. Using them and configuring them is not obvious.

Has anyone had success or experience with this kind of thing? I mean in “theory” it should be doable. But you know, “"In theory there is no difference between theory and practice - in practice there is”.

Best,

Pito

Boston Robot Hackers &&

Comp. Sci Faculty, Brandeis University (Emeritus)

Chris Albertson

unread,

Mar 14, 2026, 12:27:21 PMMar 14

to Pito Salas, RSSC-list

My experience with YOLO is that you should not try to act on the results of one frame. A “person” will be detected in one frame and then not in the next. Some of this is just because the person is obstructed and can’t be seen, but just as likely, the objects get misidentified. So what you do is use a scheme like “an object must be detected 5 times in 8 frames and at the same location or along the same path of motion (after subtracting any motion of the robot) before it is “real”. At 30 frames per second, this means it takes perhaps 1/4 second to be sure of what and where something is.

Actually, you might use a more complex algorithm that maintains an object list and assigns “certainty” values to each object based on the number of times detected and YOLO’s score on each detection.

It gets harder when objects move if you want to track them.

I did not invent this. I worked on several radar systems that tracked aircraft, and all of them are centered around the concept of a “track file.” That file collects statistics of each aircraft over time. We would talk about “promoting” “candidate tracks” to “tracks” and so on. Yes, people were doing this in the 1980s. YOLO data is not as noisy as radar returns, but still, the statistics of small numbers would kill you. You need to see it a statistically significant number of times, and the observations have to correlate based on what you know about the objects (refrigerators don’t move, but cats can move).

Finally, people and animals do exactly this. We have to watch a scene before we start noticing all the objects in the scene. “Important” objects jump out first, then the others later.

Sorry, the algorithm of “acting on each frame” is so easy that we all want it to work. But it can’t. Many times, we detect things by how they act in contact. Soda cans are on the floor; we’d be surprised to see one flying like a bird.

I wrote the above by hand in Python. Maybe a better solution, but 10,000 times more work is to train an AI to accept YOLO detections and then populate a database of “known objects”. YOLO, i d decent ut if lake knowlade of the object's behavior over time. You get really poor results if you ignore this.

This is why robots are hard.

On Mar 14, 2026, at 6:12 AM, Pito Salas <pito...@gmail.com> wrote:

Greetings,

I am finally trying to get my Oak-D-Light to do some object detection/recognition. “All” I want to do is to detect a soda can or coffee cup or even an orange cone to be detected. So far it’s quite bad.

<Screenshot 2026-03-13 at 6.00.05 PM.png><Screenshot 2026-03-13 at 7.56.57 PM.png>

I am still in the early stages. I have found some pretty good documentation. But there is still plenty of black magic required. Yes there are catalogs of different models. Using them and configuring them is not obvious.

Has anyone had success or experience with this kind of thing? I mean in “theory” it should be doable. But you know, “"In theory there is no difference between theory and practice - in practice there is”.

Best,

Pito

Boston Robot Hackers &&
Comp. Sci Faculty, Brandeis University (Emeritus)

--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rssc-list/5CDE89EC-6F54-4713-976B-A85304F75D4E%40gmail.com.

Pito Salas

unread,

Mar 14, 2026, 3:15:27 PMMar 14

to Chris Albertson, RSSC-list, Boston Robot Hackers

Hi Chris

Those are good insights. As I think about it, my targets won’t move themselves, but indeed the identification made by the NN definitely jumps around frame to frame.

What I will look into is using ROS2 and TFs to come up with 3d position in a world coordinate system for a hypothetical location for the identified soda can, and allow ROS and localization to update that as the robot moves and adjust the confidence that the can was indeed found.

I could make the job easier (but less interesting) by identifying a bright orange can or a fiducial but I want to try it the hard way a little longer :)

Best,

Pito

Boston Robot Hackers &&
Comp. Sci Faculty, Brandeis University (Emeritus)

Chris Albertson

unread,

Mar 14, 2026, 11:22:13 PMMar 14

to Pito Salas, RSSC-list, Boston Robot Hackers

> On Mar 14, 2026, at 12:15 PM, Pito Salas <pito...@gmail.com> wrote:
>
> Hi Chris
>
> Those are good insights. As I think about it, my targets won’t move themselves, but indeed the identification made by the NN definitely jumps around frame to frame.
>
> What I will look into is using ROS2 and TFs to come up with 3d position in a world coordinate system for a hypothetical location for the identified soda can, and allow ROS and localization to update that as the robot moves and adjust the confidence that the can was indeed found.
>
> I could make the job easier (but less interesting) by identifying a bright orange can or a fiducial but I want to try it the hard way a little longer :)

Yes, exactly. The TF system can help, but your robot moves, so localization is also needed to “subtract out robot motion”.

YOLO gives you a bounding box in pixel coordinates but only in two dimensions. You have to transform pixels to real-world x,y,z. For that, you need depth. There are many depth techniques, but for soda cans, the distance from the camera is proportionate to the apparent size of the can.

In general, depth from a monocular camera is not hard. If the camera moves, then every two frames is “stereo.” The stero baseline vector is just (robot velocity)/(frame rate).

I tried the algorithm where I would react to every frame, and all you get is a robot that vibrates at 30Hz. The way to go is that YOLO informs an object database, and the robot acts on the database. Now you have decoupled planning rate from frame rate. Also, with this model, you can add more cameras. Tesla does this and uses about 8 or 10 cameras to build its object database.

I admit, I did not build the robot. I used video from Hollywood movies as the camera input to YOLO.

Reply all

Reply to author

Forward