Guiding a robot 🦾 arm with generative AI 🧠

Gmail

unread,

Jun 14, 2024, 12:05:22 AMJun 14

to RSSC-List

Today we held our weekly VIG SIG meeting. One of the discussion topics was “guiding a robot arm with a large language model”. It was noted that with GPT 3.5, one was not able to get a spacial location of an object.

After the meeting, I attempted to have GPT-4o give me the location of a red cup. It was quite successful! Although this was a simplistic use case, it does indicate that the newest version can locate objects in 3-D space. And if I can locate an object in 3-D, it might be able to grasp/move/hit that object in 3-D. One thing to note though, is that each response took several seconds. Obviously, this length of a delay wouldn’t work for many use cases.

See the images below.

Thomas

-

Want to learn more about ROBOTS?

https://www.youtube.com/user/thomasfromriverside

Alan Downing

unread,

Jun 14, 2024, 12:53:39 AMJun 14

to Gmail, RSSC-List

I tried a chess board with pieces on it and ChatGPT4o seemed to hallucinate the pieces and their positions.

Alan

--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rssc-list/B8430D5E-1E6E-4AF6-98DC-40EBDEE35AB7%40gmail.com.

Chris Albertson

unread,

Jun 14, 2024, 2:11:48 AMJun 14

to Alan Downing, Gmail, RSSC-list

Assuming you use my idea to keep a database of YOLO-detected objects, then you ask GPT to “Pick up the red cup”. you train it to replace objects names with database queries and to replace your verb with the “official” verb. So GPT is doing a translation from “Pickup the red cup” to “Function: GRASP <querry result: red cup at approximation location X>” This causes the robo'ts “grasp_object function to be called on the result of a database query and the database returns the real-world x,y,z location and type of object.

GPT should be smart enough to also accept “pick up the plastic container that is to the left” or “Grab that Solo Cup” and it would output the same thing as above.

The robot needs a fixed set of basic operations grasp, walk, lift, move, go to X,…. The LLM has to be trained to output using ONLY these primitives.

It does not matter if the translation takes a few seconds. It is only done once for each command and you could always run the LLM locally to reduce that latency.

Gmail

unread,

Jun 14, 2024, 3:13:13 AMJun 14

to Chris Albertson, Alan Downing, RSSC-list

You make good points. Still, I’m going to see how far a large language model by itself can take me.

Thomas

-

Want to learn more about ROBOTS?

https://www.youtube.com/user/thomasfromriverside

On Jun 13, 2024, at 11:11 PM, Chris Albertson <alberts...@gmail.com> wrote:

Assuming you use my idea to keep a database of YOLO-detected objects, then you ask GPT to “Pick up the red cup”. you train it to replace objects names with database queries and to replace your verb with the “official” verb. So GPT is doing a translation from “Pickup the red cup” to “Function: GRASP <querry result: red cup at approximation location X>” This causes the robo'ts “grasp_object function to be called on the result of a database query and the database returns the real-world x,y,z location and type of object.

Thomas Messerschmidt

unread,

Jun 17, 2024, 11:31:44 PMJun 17

to Alan Downing, RSSC-List

Maybe you are asking too much of Chat GPT. 🙂

From: Alan Downing <downi...@gmail.com>
Sent: Thursday, June 13, 2024 9:53 PM
To: Gmail <thomas...@gmail.com>
Cc: RSSC-List <rssc...@googlegroups.com>
Subject: Re: [RSSC-List] Guiding a robot 🦾 arm with generative AI 🧠 - ChatGPT4o

Alan Downing

unread,

Jun 18, 2024, 12:25:07 AMJun 18

to Thomas Messerschmidt, RSSC-List

Yes, but how do you know if you're asking too much and GPT is hallucinating? You need some type of validation and affordance. A possible subject for this week's VIG SIG.

Alan

Gmail

unread,

Jun 18, 2024, 2:37:15 AMJun 18

to Alan Downing, RSSC-List

Sure, that sounds good. And playing chess with ChatGPT4 Omni might be an interesting experiment to perform at our meeting.

Thomas

-

Want to learn more about ROBOTS?

https://www.youtube.com/user/thomasfromriverside

On Jun 17, 2024, at 9:25 PM, Alan Downing <downi...@gmail.com> wrote:

Yes, but how do you know if you're asking too much and GPT is hallucinating? You need some type of validation and affordance. A possible subject for this week's VIG SIG.

Alan

On Mon, Jun 17, 2024 at 8:31 PM Thomas Messerschmidt <thomas...@gmail.com> wrote:

Maybe you are asking too much of Chat GPT. 🙂

From: Alan Downing <downi...@gmail.com>
Sent: Thursday, June 13, 2024 9:53 PM
To: Gmail <thomas...@gmail.com>
Cc: RSSC-List <rssc...@googlegroups.com>
Subject: Re: [RSSC-List] Guiding a robot 🦾 arm with generative AI 🧠 - ChatGPT4o

I tried a chess board with pieces on it and ChatGPT4o seemed to hallucinate the pieces and their positions.

Alan

On Thu, Jun 13, 2024 at 9:05 PM Gmail <thomas...@gmail.com> wrote:

Today we held our weekly VIG SIG meeting. One of the discussion topics was “guiding a robot arm with a large language model”. It was noted that with GPT 3.5, one was not able to get a spacial location of an object.

After the meeting, I attempted to have GPT-4o give me the location of a red cup. It was quite successful! Although this was a simplistic use case, it does indicate that the newest version can locate objects in 3-D space. And if I can locate an object in 3-D, it might be able to grasp/move/hit that object in 3-D. One thing to note though, is that each response took several seconds. Obviously, this length of a delay wouldn’t work for many use cases.

See the images below.

<image3.jpeg>
<image4.jpeg>
<image5.jpeg>

Thomas

-

Want to learn more about ROBOTS?

https://www.youtube.com/user/thomasfromriverside

Chris Albertson

unread,

Jun 22, 2024, 12:32:47 AMJun 22

to Gmail, Alan Downing, RSSC-list

Chess and finding an object in photos seems to be getting far ahead of the game. Has anyone gotten a demo prototype where you can tell GPT “Move the servo to 25 degrees at a speed of one degree per second” and then the servo slowly moves? The next step is to say “Move the hand up by 25 mm” and then all six servos move in a coordinated way so the hand moves up”.

Just getting the first “move to 25 degrees” means that somehow GPT is connected to a microcontroller pin that generates the correct PWM signal that changes over time from an English language command. Just doing that would be a breakthrough

I used to ask my biology students. “Raise your hand if you believe that mental power can cause physical objects to move?” Some would raise their hands others would not. Then I’d say those who raised their hands just proved it possible, now how does this work….” We can talk about neurons and muscles and the brain’s motor cortex but no one knows how language connects to this. We have the same problem with robots, we do not have a good way to do this that will generalize to “please clean the kitchen.”

To view this discussion on the web visit https://groups.google.com/d/msgid/rssc-list/EF8956DA-E5E0-450E-B088-D29E5361C7F4%40gmail.com.

Gmail

unread,

Jun 22, 2024, 5:57:05 AMJun 22

to Chris Albertson, Alan Downing, RSSC-list

Yes Chris. I have successfully accomplished something similar. I used google voice for the speech to text, a Windows 10 pc for the main code, an Arduino (serial communication over USB) to interface with servos, and windows text to speech for feedback.

Specifically, my python program on my laptop records a spoken command via a microphone, and sends it to google voice. Google then sends back the text string to my program. The program then sends the text to OpenAI over WIFI. OpenAI returns a command for a servo to a second program (Arduino C) on the PC via WIFI. The program on the laptop then parses the response and then sends a command to the Arduino via a USB serial cable. The program on the Arduino then takes the commands and moves a servo. The laptop program then runs text to speech outputting a verbal confirmation. Pretty straightforward really.

Getting the right commands sent seems reasonably reliable with GPT4 omni. And based on some experiments that I performed at the VIG SIG last Thursday night, photo analysis, while it did NOT work so well on games (tic tac toe, and checkers) it exceeded my expectations when it came to locating objects in 3D space and guiding a “robot arm” to touch the object.

(Please pardon me if I left anything out. It’s a bit late and I’m a bit tired 😴.)

I believe both Martin and Jim D have both had similar successes.

Thomas

-

Want to learn more about ROBOTS?

https://www.youtube.com/user/thomasfromriverside

On Jun 21, 2024, at 9:32 PM, Chris Albertson <alberts...@gmail.com> wrote:

Chess and finding an object in photos seems to be getting far ahead of the game. Has anyone gotten a demo prototype where you can tell GPT “Move the servo to 25 degrees at a speed of one degree per second” and then the servo slowly moves? The next step is to say “Move the hand up by 25 mm” and then all six servos move in a coordinated way so the hand moves up”.

Carl

unread,

Jun 25, 2024, 6:15:25 PMJun 25

to RSSC-List

Do you know if the spacial results you showed with GPT-4o are part of the LLM, or did they add a secondary system for that, like arithmetic calculators?

Gmail

unread,

Jun 25, 2024, 6:33:43 PMJun 25

to Carl, RSSC-List

I really don’t know to tell you the truth. Still the LLM’s estimation of distance was uncanny. About as good as many humans. It was accurate to +- an inch, in both the x plane and the y plane.

Still, I am planning on repeating this many times to get a better idea if it was dumb luck or a real accomplishment and advancement of ChatGPT4 o.

Thomas

-

Want to learn more about ROBOTS?

https://www.youtube.com/user/thomasfromriverside

On Jun 25, 2024, at 3:15 PM, Carl <cfsu...@gmail.com> wrote:

Do you know if the spacial results you showed with GPT-4o are part of the LLM, or did they add a secondary system for that, like arithmetic calculators?

--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/rssc-list/416c69f6-282a-429f-be79-b257908ce07bn%40googlegroups.com.

Chris Albertson

unread,

Jun 25, 2024, 7:07:29 PMJun 25

to Carl, RSSC-list

On Jun 25, 2024, at 3:15 PM, Carl <cfsu...@gmail.com> wrote:

Do you know if the spacial results you showed with GPT-4o are part of the LLM, or did they add a secondary system for that, like arithmetic calculators?

From what I see and read, it appears to be descriptive. It has answers such as “to the right of” or “in front of”. I don’t see GPT40 giving coordinates to the centroid of the object or exact depths in meters. Whatever spatial awareness it has, I think, is a result of being trained on photo captions.

The thing to remember about GPT and other LLMs is that there is no “algorithm”. They have no loops and no “if this then that” logic. All they do is multiply add and compare numbers.

If you look at the task of a domestic robot, getting depth should be easy. We might aim a depth camera at the scene and read out the depth directly with no need to use “AI”. Or if we have no room on the robot for a specialized depth camera we can do “depth from motion”. This means we move the camera relative to the subject and take two or more images and use them as a stereo pair. Also, I think we can get GPT4 to tell us the exact location of an object in a photo. What if you cropped the image from the left until the object was no longer detected, then cropped from the right, bottom, and top. You can find a bounding box for (say) the beer bottle. There are better ways to do this like using specialized object detector networks, but if you were forced to do it all with GPT4o cropping could work although it would be slow.

On Thu, Jun 13, 2024 at 9:05 PM Gmail <thomas...@gmail.com> wrote:

Today we held our weekly VIG SIG meeting. One of the discussion topics was “guiding a robot arm with a large language model”. It was noted that with GPT 3.5, one was not able to get a spacial location of an object.

After the meeting, I attempted to have GPT-4o give me the location of a red cup. It was quite successful! Although this was a simplistic use case, it does indicate that the newest version can locate objects in 3-D space. And if I can locate an object in 3-D, it might be able to grasp/move/hit that object in 3-D. One thing to note though, is that each response took several seconds. Obviously, this length of a delay wouldn’t work for many use cases.

See the images below.

<image3.jpeg>
<image4.jpeg>
<image5.jpeg>

Thomas

-

Want to learn more about ROBOTS?

https://www.youtube.com/user/thomasfromriverside

--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/rssc-list/416c69f6-282a-429f-be79-b257908ce07bn%40googlegroups.com.

Chris Albertson

unread,

Jun 25, 2024, 7:24:35 PMJun 25

to Gmail, Carl, RSSC-list

Years ago I heard a talk on AI. The lecturer was holding a glass of water and said. “If I dumped this water on the floor, how do you know not to run for the exits to escape a flood?” “I don’t think anyone stops to do the volumetric calculations to compute the water level in the room, you know from experience that it takes a lot more water than a glass full to flood a large room.”

Humans looking at the photo of a jar on a counter just know that a jar is a certain size. They recognize other objects they know the size of and use their experience to scale the scene. I doubt any of us do the math. If we werte looking at an actual jar in real life (not in a photo) then it is easier because we have stereo vision and see parallax when we move. But in photos, we have to know the size of objects. GPY is doing this the same way we do. It is using the vast experiance that is embodied in the training data.

Like use it doe not have to think, it just knows because it is a familiar situation.

To view this discussion on the web visit https://groups.google.com/d/msgid/rssc-list/BE60CDE9-0562-4D4E-9568-F054F338DE84%40gmail.com.

Reply all

Reply to author

Forward