My experience with YOLO is that you should not try to act on the results of one frame. A “person” will be detected in one frame and then not in the next. Some of this is just because the person is obstructed and can’t be seen, but just as likely, the objects get misidentified. So what you do is use a scheme like “an object must be detected 5 times in 8 frames and at the same location or along the same path of motion (after subtracting any motion of the robot) before it is “real”. At 30 frames per second, this means it takes perhaps 1/4 second to be sure of what and where something is.
Actually, you might use a more complex algorithm that maintains an object list and assigns “certainty” values to each object based on the number of times detected and YOLO’s score on each detection.
It gets harder when objects move if you want to track them.
I did not invent this. I worked on several radar systems that tracked aircraft, and all of them are centered around the concept of a “track file.” That file collects statistics of each aircraft over time. We would talk about “promoting” “candidate tracks” to “tracks” and so on. Yes, people were doing this in the 1980s. YOLO data is not as noisy as radar returns, but still, the statistics of small numbers would kill you. You need to see it a statistically significant number of times, and the observations have to correlate based on what you know about the objects (refrigerators don’t move, but cats can move).
Finally, people and animals do exactly this. We have to watch a scene before we start noticing all the objects in the scene. “Important” objects jump out first, then the others later.
Sorry, the algorithm of “acting on each frame” is so easy that we all want it to work. But it can’t. Many times, we detect things by how they act in contact. Soda cans are on the floor; we’d be surprised to see one flying like a bird.
I wrote the above by hand in Python. Maybe a better solution, but 10,000 times more work is to train an AI to accept YOLO detections and then populate a database of “known objects”. YOLO, i d decent ut if lake knowlade of the object's behavior over time. You get really poor results if you ignore this.
This is why robots are hard.