Not only do we see videos of smooth motion, where I live we see Wamos driving smoothly in traffic. So surely it is possible.
Read up on how the car companies do it. Tesla and Wamo use very different methods, but from a high enough level, they are similar. They both break the loop from sensor to driving by using some kind of data structure. The sensor input to the structure, and the driver reads from it. This breaks the direct coupling.
Wamo uses linguistic structure where the sensors say “pedestrian is standing at the curb, give the guy some room” while Tesla says “with this picture in the video, human drive will move the left a few inches.”
Tesla is like a human who is “zoned out” and driving on reflexes without thinking, and Wamo is like a driver who tells himself and then does what he says. I am sure in reality, the difference is not so black and white.
Finally, in the end, each of them simply created waypoints on the road and then fits a second-order function to three points (1) current location, (2) the closest waypoint, and (3) the next point. This parametric equation gives (x,y) as a function of time, and then the car can use a PID loop to keep the car right on that function. The planner just has to output waypoints. Each new point means the equation has to be re-solved. But the points can be computed at any random time asynchronously
In Wamo’s case, the sensor could say, “ Pedestrian just stepped in front of the car and generate a waypoint with zero velocity a foot in front of the car.” The equation recomputes, and the PID tries hard to get to zero velocity.
Wamo needs a ton more compute power because it is using LLM and language, but Wamo has the budget because the computer saves the company the driver’s salary for 2 shifts per day. Maybe $100K saving per year. So Wamo can spend 100K on a computer and still make money. Tesla has to build affordable cars, so they can’t even budget a $10,000 computer.
The trouble with a hobby project is that you can research the algorithms but can’t afford the hardware to run them. But we can try and do something like what the big players do. And that is to cut the system in half and not have video data directly control steering angle. That will ALWAYS oscillate. Sensor to data structure then structure to driver breaks it into two asynchronous loops.
Theory: All oscillators are in fact feedback systems with an amplifier in the feedback loop. This applies to EVERY case from electronics to social/policital systems to robots.
Solution: Don’t build that kind of loop.
When I took a class on self-driving cars, we did behavior cloning where we trained a car to lane-keep from human driving data. It works OK EXCEPT if it ever gets into a situation it was not trained on like “car t-boned the guard rail”. This is where the linguistic system shines. It can reason its way out of situations it has not seen. That said, if you have enough data, you don’t need reasoning. But no hobbyist will ever get to own that much data.
Drivng is an Interresting problem, but I’ve moved on to something different: a self-driving house. Fun thing I just learned: a $10 device can track human motion by how they block and reflect 5Hz WiFi signals. It is called “WiFi Sensing” and is good enough that a robot with a WiFi receiver can detect human gestures. This is VERY SPOOKY as you could in theory track what people in a house were doing from the sidewalk using how their motion distorted the radio waves.
(See GitHub project Spectre to try it.). This seem pperfect for robots. They do not =need to spray-out energy with a LIDAR or ultra sonic “pings" if the house is already flooded with WiFi radio.