Thanks guys,
Yes, I suppose barcodes, QR codes, and OCR could be read. My bots hands aren't quite big enough to pick up the beer, it could perhaps pull a handle on a keg on tap though.
Here's a pic with some other objects. These are all being classified with embeddings only - no points. In this case it was 100% (9 for 9) but that is not the norm when doing so many smaller objects at once. The norm might be 7-8 for 9...so far.
I could easily improve the accuracy if I stored more than 1 image per object, but I think there are things that can be done first to normalize the training images better (to handle lighting and rotation differences better which seem to be an issue for some classes) Lighting can be normalized, and ROIs could be de-rotated and de-skewed when needed for ground objects. Perhaps it would help to normalize the actual size of the training images too.
The smaller objects are more challenging (less reliable), like the circuit boards. My theory is these objects don't lend themselves as well to the "verbal" nature of the embeddings either. The reliability goes way down as the objects are more distant too...the "large crimping pliers" in the pic would likely get misclassified if it was much further away. Maybe having a training image at various distances would help that though.
Ultimately, while much can be done with better algos, the quality and type of camera being used can be a limiting and very significant factor too. In the past, there was only so much you could do with a VGA res image and the objects being some distance away, most techniques simply would not work beyond a certain distance as the individual ROI images get really grainy/out of focus.
Probable Next Step: Better HD Camera with Depth Stream
I'm probably going to swap out the USB cam with an Orbbec Depth cam soon. This will let me get 4X the resolution and also get the approx distance to each object from the depth stream. This will also allow me to estimate its overall x,y,z dimensions in space (inches, etc). This will allow me to search only embeddings that fit the overall size estimate, and reduce false positives. All the techniques (like points) work better when there is more resolution. I could also perhaps build a classifier based on 3D shape or shape in the ground plane, lots of possibilities with depth.
Using the depth data combined with the pose of the robot, I can estimate position of the objects in 3D space and remember them. One could even localize based on this. If a "chair" is recognized in a particular area...the system should likely classify that as a chair when it sees it again. Even if the embedding winner is something else at that moment, the chair class should be given more weight.
Much like verbal, there are lots of different contexts in vision. Lots of algos to write and experimentation to be done. I feel like computer vision is not much beyond "Wright Brothers" stage / early flights.
Bringing the TMI,
Martin