The VIG: Object Classification using VLM Embeddings

543 views
Skip to first unread message

Martin Triplett

unread,
Aug 28, 2023, 11:53:03 AM8/28/23
to HomeBrew Robotics Club
This is some bits of what we talked about in the new "Visual" VIG meeting the other night.

Problem Overview:
Off-the-shelf vision models like YOLO recognize "classes" that they were trained on (typically 80 or less).  They do not recognize specific objects lying around your home.  Of the 80 classes, they are good at recognizing a few classes (like people, cats, dogs) and very bad at recognizing most of the others.  Besides, many of the classes are simply not applicable in the home.  Most of the time, they recognize a region and class for an object (like a bottle or remote) incorrectly or too generically to be very useful.  They also completely ignore some ROIs in the FOV that obviously have objects.  What if the robot's task is to pick up a specific brand of beer or find a particular circuit board or tool?  One can train a new model but this would require gathering many thousands of images and many hours of manual work.  

The Question:  How can a robot see a given object once, take a picture or two of it, and recognize it reliably after that?

Example Results - specific classifications from 1-2 training pics for each class
object recog.png
Note, the robot is recognizing specific beers and objects with high confidence (low "d" values).  The d values represent vector distances.  It is not recognizing "a joystick", it is recognizing "my joystick".  While this example uses Yolo initially, the initial "bottle" classification coming from Yolo is refined...using VLM embeddings to determine the specific object (brand of beer, etc.).  This works for tools, robot parts, and many other unique household objects, even unfolded towels or specific pets that change shape, rotation, etc.  Also, the technique (VLM embeddings) works regardless of how ROIs are proposed, so the use of yolo is entirely optional.

Solution Overview:
Identify ROIs using one or more techniques and use a VLM (Visual Language Model) and a set of training images (possibly only 1 image for each class) so that you can recognize specific objects without having to train a new model or build large training sets.  I use Clip for this (a VLM) and Faiss (a vector database/search library).

Pseudo-Code
  1. Identify Regions of Interest using 1 or more techniques
    1. In my example, I used YOLO to identify ROIs.  Yolo also returns an an initial classification for each region.  These classifications are mostly wrong or not useful.
    2. ROIs can also be identified using various other techniques.  I used an algo that identifies areas based on color/light differences and/or background removal.
  2. Classify Each Region using VLM Embeddings (primary technique)
    1. If it was a YOLO class and it was classified as one of the classes that yolo is good at (person, cat, dog) with a high enough probability, then leave it alone, it is good enough (unless you want it to classify/recognize a specific cat/dog in your house.
    2. Otherwise, classify each region by getting VLM embeddings for it (from a Clip model) and do a nearest neighbor search (using Faiss) against all the embeddings from a set of labelled training images.  Update the classification (the label for the ROI) using the winner of this search.  (Layer 2)  Among other things, this will turn a generic "bottle" classification from Yolo into a specific beer classification...if you have a training image for that beer.
      1. Note, I found it to be MUCH better to use PNG images with (with an alpha channel...RGBA) and a background removal model to remove the backgrounds from the roi image and the training images.  The vector search gets MUCH more accurate/confident as the resulting vector distances get much smaller.  Background removal is however the slowest part of my process, as I am using a third party model for that...that is very good but very slow.
  3. Optional Step:  Refine the classifications for SOME more detailed and specialized classes (Edge Cases).  If the classification resulting from step 2 is a member of one of a set of special groups (like playing cards, circuit boards, books, or art).  For these types of objects, you can run a point and feature matching algo (or some other technique) against a given library (Layer 3) that you have images for.  This is old school and tricky but it can be made to work and can be useful for flatter objects that have lots of detail.  This is not necessary or desirable for most objects, as VLM embeddings are good enough for most of the objects I am interested in.  This extra step using points can excel when more details in the objects need to be considered.  A higher res camera and training images could also solve or be part of the solution though.  I am using a low res camera right now and the objects are small in the FOV.
    1. In the example picture, this technique is only being used for the "Queen of Spades".  The VLM may have recognized the queen correctly but often it can confuse the suite of the card, the number, etc., as these cards can be pretty close in VLM vector space.  A point and descriptor algo is tricky but can work for discerning the detailed differences and improving the probability of a successful classification.
Final Thoughts
This may be TMI for most, but I thought it might be helpful for some.  The current state of vision models is pretty bad, especially for object recognition.  Training your own models for many labels is time prohibitive.  I think VLMs are a significant improvement for both recognition itself and human time savings.   They are by no means perfect.  They have their own issues which we will have to identify and work through until something better comes along.  I think many of those issues can be mitigated for now.  I mentioned how much "background removal" can help.  There are other issues that probably result from the underlying nature of the "verbal" component of the VLM training.  

In the VIG, we also talked about building a shape classifier and a color spectrum analyzer.  The longer term goal is to build a visual "voting" mechanism that can take into account the results from the VLM combined with size, shape, color, depth, points, texture, etc.  I am still working towards that goal.

Alan Downing

unread,
Aug 29, 2023, 8:03:19 PM8/29/23
to hbrob...@googlegroups.com
Nice write-up.  Thanks Martin!

Alan

--
You received this message because you are subscribed to the Google Groups "HomeBrew Robotics Club" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hbrobotics+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hbrobotics/7672f5c2-8a41-4960-bd7d-9124132bccaen%40googlegroups.com.

Dave Everett

unread,
Aug 29, 2023, 10:31:23 PM8/29/23
to hbrob...@googlegroups.com
There's never too much information when it comes to this stuff.

For some items, barcodes could be read.

Dave

Rafael Skodlar

unread,
Aug 30, 2023, 1:55:34 AM8/30/23
to hbrob...@googlegroups.com
Very impressive Martin. One way to teach the robot is by having a camera at the "house input", i.e. kitchen area where the objects are stored. Pictures would be stored in "house AI system" to be shared with a robot so that the visual recognition would be easier. Another possibility would be to take things out of a bag or small crate and pass them to the robot to store them in the pantry. That would help with keeping inventory also.

Questions come to mind. Is the robot old enough to handle beer? Two stores asked me for a government issued ID just to buy a beer recently. Some of the computers I worked on are now in the Computer History Museum :-)

Suppose parents go away for a weekend. Their teenager(s) invite friends over to show them how the family robot can sort beer ...
Prediction: insurance companies and lawyers will ask for their share of beer.
Strange times we live in.

Rafael

On Mon, Aug 28, 2023 at 8:53 AM Martin Triplett <martint...@gmail.com> wrote:

Martin Triplett

unread,
Aug 30, 2023, 11:45:51 AM8/30/23
to HomeBrew Robotics Club
Thanks guys,

Yes, I suppose barcodes, QR codes, and OCR could be read.  My bots hands aren't quite big enough to pick up the beer, it could perhaps pull a handle on a keg on tap though.   

Here's a pic with some other objects.  These are all being classified with embeddings only - no points.  In this case it was 100% (9 for 9) but that is not the norm when doing so many smaller objects at once.  The norm might be 7-8 for 9...so far. 

object recog 2.png
I could easily improve the accuracy if I stored more than 1 image per object, but I think there are things that can be done first to normalize the training images better (to handle lighting and rotation differences better which seem to be an issue for some classes)  Lighting can be normalized, and ROIs could be de-rotated and de-skewed when needed for ground objects.  Perhaps it would help to normalize the actual size of the training images too.

The smaller objects are more challenging (less reliable), like the circuit boards.   My theory is these objects don't lend themselves as well to the "verbal" nature of the embeddings either.   The reliability goes way down as the objects are more distant too...the "large crimping pliers" in the pic would likely get misclassified if it was much further away.  Maybe having a training image at various distances would help that though.

Ultimately, while much can be done with better algos, the quality and type of camera being used can be a limiting and very significant factor too.  In the past, there was only so much you could do with a VGA res image and the objects being some distance away, most techniques simply would not work beyond a certain distance as the individual ROI images get really grainy/out of focus.
 
Probable Next Step:  Better HD Camera with Depth Stream

I'm probably going to swap out the USB cam with an Orbbec Depth cam soon.  This will let me get 4X the resolution and also get the approx distance to each object from the depth stream.  This will also allow me to estimate its overall x,y,z dimensions in space (inches, etc).  This will allow me to search only embeddings that fit the overall size estimate, and reduce false positives.  All the techniques (like points) work better when there is more resolution.  I could also perhaps build a classifier based on 3D shape or shape in the ground plane, lots of possibilities with depth.

Using the depth data combined with the pose of the robot, I can estimate position of the objects in 3D space and remember them.  One could even localize based on this.  If a "chair" is recognized in a particular area...the system should likely classify that as a chair when it sees it again. Even if the embedding winner is something else at that moment, the chair class should be given more weight.

Much like verbal, there are lots of different contexts in vision.  Lots of algos to write and experimentation to be done.  I feel like computer vision is not much beyond "Wright Brothers" stage / early flights.

Bringing the TMI,
Martin

Scott Monaghan

unread,
Aug 30, 2023, 11:57:25 AM8/30/23
to hbrob...@googlegroups.com
Hi Martin,

This is awesome. 

I'm a bit out of the loop as I've been very busy with new job and other projects.

Are you using k-nearest-neighbor for your embeddings, (plus I assume some extra Martin cleverness: e.g. color, light/dark ratio) for these embeddings?


Scott Monaghan

unread,
Aug 30, 2023, 11:59:28 AM8/30/23
to hbrob...@googlegroups.com
I see you mention using a "Clip model". 

I'm not familiar with that term.

Gmail

unread,
Aug 30, 2023, 12:39:29 PM8/30/23
to hbrob...@googlegroups.com
Martin,

When you say “vector distance,” I assume you are not talking about a physics vector. What does it mean in this context? 



Thomas

-
Want to learn more about ROBOTS?









On Aug 29, 2023, at 5:03 PM, Alan Downing <downi...@gmail.com> wrote:



Alan Downing

unread,
Aug 30, 2023, 5:00:04 PM8/30/23
to hbrob...@googlegroups.com
Scott, CLIP is an open-vocabulary model that takes an image and returns the results to a prompt like "a photo of a {label}".  More detail can be found here:

Martin, you could track the objects over a series of images to improve the accuracy (even if you don't move the camera.)  Can you let me know whenever you check any of this in?

Thanks,
Alan

Martin Triplett

unread,
Aug 31, 2023, 10:55:39 AM8/31/23
to HomeBrew Robotics Club
Hi Scott,

You asked about Clip.  I won't go into detail on Clip as Alan shared the doc for it.  To use clip and other related models that return comparable vectors, I use the sentence-transformers library, as a layer on top.

You can install it with "pip install sentence-transformers"
     Here is a doc link:   https://www.sbert.net/

The various models will return 512D vectors.  The following lines create two models using sentence transformers, one for images and one for text.  Side Note:  Sentence vectors are VERY useful for other very valuable verbal purposes, like finding and building relevant contexts for chat gpt (a topic for another time).

    image_model = SentenceTransformer('clip-Vit-B-32')
    text_model = SentenceTransformer('sentence-transformers/clip-Vit-B-32-multilingual-v1')

The thing to realize about this is that Clip vectors were trained to converge to sentence vectors coming from other sources.  This means you can use and compare the vectors from one model (the image model) to another model (the text model) under the right circumstances.  This means you can in theory classify things with no training images, by comparing an ROI image to a set of vector embeddings coming from "words, phrases, or text sentences".  I have tried this a bit and it is a really interesting area to study, just not my area of focus right now.

To get embeddings for one or more things at the same time (as a batch), you pass in a list of the things you want to get the embeddings for.  These should be "lazy loaded" and stored somewhere so they can be used later for searching/classifying. You will still need to get embeddings at runtime for a given ROI, just not for the "object/training images"

image_embeddings = image_model.encode(images)

In the case of images, I use PIL images...that were created from the PIL library, from PNG images with 4 channels (RGBA).  Having an alpha channel and using it is important IMO.  I'm not sure if using PIL is a requirement, but it was the only way I got it to work.

If you want to do this type of thing, you will probably want to load multiple image "libraries" at startup.  A "library" in this lingo is simply a dict containing a list of the classes (labels) in it, and an "index" of the embeddings.  I use faiss to build the index, and derive the labels from the names of the image files.

Here are most of the key bits for how to search a given library, given a library_name and an roi_image.  The "libraries" in this example are created and indexed elsewhere.

    library = libraries.get(library_name)
    index = library.get("index")
    class_names = library.get("class_names")

    query_image = Image.fromarray(roi_image)
    query_embeds = image_model.encode([query_image])

    k = 3
    D, I = index.search(query_embeds, k)

    class_id = int(I[0][0])
    distance = D[0][0]

    class_name = class_names[class_id]

Basically, you get back sorted class_ids and distances...which is what you want.  You could look at the winner only or take into account some of the runner-ups and the relative distances too.

Martin Triplett

unread,
Aug 31, 2023, 12:37:48 PM8/31/23
to HomeBrew Robotics Club
Hi Thomas...you asked about vector "distance".  This is all in "my terms" as I layperson.

I think of it like using the Pythagorean Theorem...a distance in "Euclidean" space but with a lot more dimensions.  512 instead of 2 or 3.

To find the length of the hypotenuse "diagonal" of a 2D right triangle, you calculate the square root of the sum of the squares of the dimensions (2 for a triangle).  This math works in 2D space, in 3D space, and can be used in higher ND spaces too.  The distance is still the square root of the sum of the squares of all the dimensions (values in the embedding vector).  

This is what lets us compare embeddings, provided the embeddings are built for this.  In this context, I think of the distance as a measure of error or difference.  There are various ways to do the calc and it depends on whether the values are normalized first.  Here is an example of calculating it using a numpy dot product with a single line of code.  Mathematicians could talk forever on what a dot product is and how it is calculated but that would be very long and boring.

    def vector_similarity(self, x: list[float], y: list[float]) -> float:
        return np.dot(np.array(x), np.array(y))

Regardless of technique used, I think the underlying concept is the same, a measure of distance in a dimensional space.  I wish I had learned more about using vectors and matrices in school.  I didn't need it for my degrees.  

I think you can use other measures of distance too, not just Euclidean, sort of like you can use different measures of "error" in stats.

Martin Triplett

unread,
Aug 31, 2023, 1:11:38 PM8/31/23
to HomeBrew Robotics Club
Hi Alan,

I totally agree about looking at the series "over time".

I'll let you know when I check it in.
Reply all
Reply to author
Forward
0 new messages