Yes, of course YOLO can't do anything other than provide class labels and bounding boxes in camera coordinates. What I did was saved all of those, then I had a database of every object the robot could detect. I actually went a little further and said the object had to be seen in 3 or 5 frames before I believed it was real and put it in the list of detected objects. This improves the accuracy of YOLO.
At the time I did not know about LLMs and certainly had no access to them.
Now for moving the arm. I assumed I’d used some kind of formal command like Terry Winograd’s “SHRDLU” did. I wanted to be able to say “pick up the cellphone and place it on the plate.” Winograd did this and a lot more in the 1970. SHRDLU could stack and unstack blocks and answer questions about them like “What is under the green block?” So my idea was to not use blocks but a database of known objects.
Fast forward to 2024 and we have an LLM running locally on my computer and it is very good at understanding short commands in English and several other languages.
The LLM I run here presents the OpenAI API on localhost:8080 and accepts a blank key. I can also run the Whisper model and have tested it with English and a couple other languages. It outputs good English text even if you speak in Japanese.
Status: I have not yet connected Whisper to the LLM, I manually cut and paste the text at present. I’ve tested the API, but that’s it. The YOLO to object database works well enough to prove the concept but I’m using video pulled from YouTube to test it. There is no camera connected.
Plans: start gluing the parts together. The first big step is to teach the LLM to output functions in JSON format to manipulate objects in the database. You can set a flag so that Open AI outputs JSON rather than plain English which makes parsing the LLM output trivial.
It is going to be a lot of work but the pieces are mostly there
SHRDLU shows us how this can work. first you build a dumb, non-AI robot that can execute a few verbs like grasp, move, place, and rotate and these verbs operate on objects in the internal database. That is about all SHRDLU did. Things like “lift the block that is resting on the large green block.” Once you have that 1970’s tech working then you train (fine tune) the LLM to output using those verbs (and only those verbs) and in JSON.
For more complex tasks then lifting and moving objects, I think this is where the LLM will be really put to use. It will have to be trained to decompose big jobs (clean the bathroom) into smaller and smaller jobs until it gets to verbs the non-AI robot controller can do.
Another Idea I like is to replace my really dumb “database” which is just a Python dictionary, with Prolog. Prolog can do an explicit decomposition that you hand-code. You can’t possibly hand-code the entire world. Someone tried that in about 1980. But I can hand code a few hundred rules and give the LLM some higher-level primitives to work with.
My basic theory is that LLMs are good with language and basic high-level knowledge but just horrible at controlling servo motors, for that specialized models or hard-coded rules, or trigonometry is best.An example is humans and animals. Our legs move and hearts beat using so called “cyclic pattern generators” these are hard-coded neurons and run a kind of loop. However the parameters of the loop are controlled by high levels. We only have to tink about “run faster” then the nerves send adjustments to the pattern. We inherited this system from invertible worms. The idea has been around for a long time.
The other thing about LLMs is that they are slow, even on fast hardware. You will never get an LLM to look at video frames at 30 frames per second. But Yolo can do 30 fps even on a 10 year old laptop.