GPT4o and robot arm control

Alan Downing

unread,

Jun 13, 2024, 10:09:37 PMJun 13

to hbrob...@googlegroups.com, RSSC-list

For tonight's VIG SIG, I tried an experiment to have GPT4o try to control a low-end robotic arm. For details, take a look at the README for the repository:

https://github.com/downingbots/GPT4o_with_WidowX-Arm

Alan

Chris Albertson

unread,

Jun 13, 2024, 11:32:11 PMJun 13

to Alan Downing, hbrob...@googlegroups.com, RSSC-list

GTP is overkill for the task for finding and locating objects. Could you imagine how bad a self-driving car would be if it used GTP to do lane keeping and automatic braking? For finding objects try “Yolo”. It works really well of object it was trained on. And if not trained, fine tuning can be done on home type computers. Yolo is dramatically smaller then any LLM.

I used a version of YOLO that was ported to TensorFlow and the TFLite to quantize the model to int-8. The Corel dongle only works with Int-8.

I did a clasic experiment with Yolo and fed it the chase sequence from the James Bond film “Casino Royal” and it did sort of ok. The model used was the stock one not turned on any of the objects in the 007 film. It was able to follow the fast-cut action film and draw return the screen coordinates of most objects on scren like cars, motorcycles, guns, neckties and chairs and so on. This was running on a Raspberry Pi4 and Corel dongle. Was about the same on an older Xeon and Nvidia 1040ti.

Oh, just remembered “in the old-day when CNNs were the big deal, they used the term “transfer learning” to means the same as what we now call “fine tune”. SO if you Google to process use “transfer learn, YOLO”.

Then once given the object class and bounding box you can feed that to normal (non-AI) OpenCV and get the object centroid. Now you have a pixel and can have the depth camers tell you the depth.

Yes, I know, you have to write some Python code. But not much.

I only tested this with dummy data, the 007 film. I figured the chase scene as 100X harder then when a real robot would see and this film is a common test for object detection

ROS’ “tf” wil translate camera relative to real world so you don’t have to do that yourself

--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rssc-list/CAAvYDnF14QuU5iK0UiuGXe70gLaKt-dbnvnwA_SB64LeQWmXOQ%40mail.gmail.com.

Alan Downing

unread,

Jun 14, 2024, 12:01:30 AMJun 14

to Chris Albertson, hbrob...@googlegroups.com, RSSC-list

The goal is to give a high level command like "pick up the white spoon and move it to the right of the pan" and have the the command executed. The commands/skills executed by Google's "SayCan", DeepMind's "RT-X", Berkeley's "BridgeData" or Stanford's Aloha 2 can't be performed by using just YOLO. Also, you may want to consider using an open-vocabulary object detection like ViLD to produce the bounding boxes (see the SayCan demo.)

Thanks,

Alan

Chris Albertson

unread,

Jun 14, 2024, 1:52:06 AMJun 14

to Alan Downing, hbrob...@googlegroups.com, RSSC-list

On Jun 13, 2024, at 9:01 PM, Alan Downing <downi...@gmail.com> wrote:

The goal is to give a high level command like "pick up the white spoon and move it to the right of the pan" and have the the command executed. The commands/skills executed by Google's "SayCan", DeepMind's "RT-X", Berkeley's "BridgeData" or Stanford's Aloha 2 can't be performed by using just YOLO. Also, you may want to consider using an open-vocabulary object detection like ViLD to produce the bounding boxes (see the SayCan demo.)

Yes, of course YOLO can't do anything other than provide class labels and bounding boxes in camera coordinates. What I did was saved all of those, then I had a database of every object the robot could detect. I actually went a little further and said the object had to be seen in 3 or 5 frames before I believed it was real and put it in the list of detected objects. This improves the accuracy of YOLO.

At the time I did not know about LLMs and certainly had no access to them.

Now for moving the arm. I assumed I’d used some kind of formal command like Terry Winograd’s “SHRDLU” did. I wanted to be able to say “pick up the cellphone and place it on the plate.” Winograd did this and a lot more in the 1970. SHRDLU could stack and unstack blocks and answer questions about them like “What is under the green block?” So my idea was to not use blocks but a database of known objects.

Fast forward to 2024 and we have an LLM running locally on my computer and it is very good at understanding short commands in English and several other languages.

The LLM I run here presents the OpenAI API on localhost:8080 and accepts a blank key. I can also run the Whisper model and have tested it with English and a couple other languages. It outputs good English text even if you speak in Japanese.

Status: I have not yet connected Whisper to the LLM, I manually cut and paste the text at present. I’ve tested the API, but that’s it. The YOLO to object database works well enough to prove the concept but I’m using video pulled from YouTube to test it. There is no camera connected.

Plans: start gluing the parts together. The first big step is to teach the LLM to output functions in JSON format to manipulate objects in the database. You can set a flag so that Open AI outputs JSON rather than plain English which makes parsing the LLM output trivial.

It is going to be a lot of work but the pieces are mostly there

SHRDLU shows us how this can work. first you build a dumb, non-AI robot that can execute a few verbs like grasp, move, place, and rotate and these verbs operate on objects in the internal database. That is about all SHRDLU did. Things like “lift the block that is resting on the large green block.” Once you have that 1970’s tech working then you train (fine tune) the LLM to output using those verbs (and only those verbs) and in JSON.

For more complex tasks then lifting and moving objects, I think this is where the LLM will be really put to use. It will have to be trained to decompose big jobs (clean the bathroom) into smaller and smaller jobs until it gets to verbs the non-AI robot controller can do.

Another Idea I like is to replace my really dumb “database” which is just a Python dictionary, with Prolog. Prolog can do an explicit decomposition that you hand-code. You can’t possibly hand-code the entire world. Someone tried that in about 1980. But I can hand code a few hundred rules and give the LLM some higher-level primitives to work with.

My basic theory is that LLMs are good with language and basic high-level knowledge but just horrible at controlling servo motors, for that specialized models or hard-coded rules, or trigonometry is best.An example is humans and animals. Our legs move and hearts beat using so called “cyclic pattern generators” these are hard-coded neurons and run a kind of loop. However the parameters of the loop are controlled by high levels. We only have to tink about “run faster” then the nerves send adjustments to the pattern. We inherited this system from invertible worms. The idea has been around for a long time.

The other thing about LLMs is that they are slow, even on fast hardware. You will never get an LLM to look at video frames at 30 frames per second. But Yolo can do 30 fps even on a 10 year old laptop.

Alan Downing

unread,

Jun 15, 2024, 7:36:26 PMJun 15

to Chris Albertson, hbrob...@googlegroups.com, RSSC-list

FYI. For those interested in playing with an open-vocabulary YOLO, you can give YOLO-World a try at:

https://huggingface.co/spaces/stevengrove/YOLO-World

Alan

Alan Downing

unread,

Jun 15, 2024, 8:17:46 PMJun 15

to Chris Albertson, hbrob...@googlegroups.com, RSSC-list

As an experiment, you could upload a picture to ChatGPT4o by clicking the paperclip. This is the free version of ChatGPT, but you'll probably need to create a login.

Enter the prompt: "list the nouns that you see in this photo, separated by commas"

In another tab, goto: https://huggingface.co/spaces/stevengrove/YOLO-World

and upload the same picture. Under the categories to be detected, cut-and-paste the list of nouns from ChatGPT, and see what you get.

Alan

Reply all

Reply to author

Forward