Nvidia DreamZero World-Action Model (Groot N2)

38 views
Skip to first unread message

Alan Downing

unread,
Mar 16, 2026, 8:30:35 PM (7 days ago) Mar 16
to hbrob...@googlegroups.com
For those following or experimenting with Nvidia Groot, Nvidia has announced that Groot N2 will be released later this year. It will be based on DreamZero's World Action Model and it looks impressive!  "Unlike VLAs, WAMs learn physical dynamics by jointly predicting future world states and actions, using video as a dense representation of how the world evolves."  Zero-shot generalization, much less robot-specific or task-specific training, handling previously unseen tasks... For more details, look at:

Groot 1.7 should also be released soon (available now with early-access Commercial licenses.) 

Enjoy!
Alan

Alan Timm

unread,
Mar 19, 2026, 1:16:54 PM (4 days ago) Mar 19
to hbrob...@googlegroups.com
Right on, I'll add this to my watchlist.

I was able to do alot with N1.6 with an Orin AGX, I'm approaching 6fps inference speed with 4 camera feeds and a bunch of optimizations.

I'm still missing something though, there's more oscillations in neural output than I can filter out for smooth motion.  It has to be possible because I keep seeing videos of smooth motion.  Hopefully Claude and I can figure something out.

How have you been?

Alan

--
You received this message because you are subscribed to the Google Groups "HomeBrew Robotics Club" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hbrobotics+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/hbrobotics/CAAvYDnFo_TU5jAVKFL2f%2BWmCxWxOU6NG7HBWfDmUFNHgYPgrig%40mail.gmail.com.

Alan Downing

unread,
Mar 19, 2026, 2:15:08 PM (4 days ago) Mar 19
to hbrob...@googlegroups.com
You're ahead of me.  I'm working on a wheeled humanoid robot (Roswell2).  It's got a magni base, 2 dynamixel with Aloha/

Alan Downing

unread,
Mar 19, 2026, 2:31:34 PM (4 days ago) Mar 19
to hbrob...@googlegroups.com
Sorry, I accidentally pressed send.

As I was saying, you're ahead of me.  I'm working on a wheeled humanoid robot (Roswell2).  It's got a magni base, 2 dynamixel arms, Aloha/Velo/other grippers, two 2-D lidars, 3 realsense cameras and a linear actuator to vertically move the head/arms.  I also have low-end dynamixel arms that I plan to use for teleoperation.  All of the above were repurposed from previous robots.  For the brain, I have an Nvidia Thor developer kit.  I'm currently finishing up the ROS2 setup.  Then I'll do some more testing/improvements before moving on to Groot. 

Alan

Chris Albertson

unread,
Mar 19, 2026, 6:44:26 PM (4 days ago) Mar 19
to hbrob...@googlegroups.com

On Mar 19, 2026, at 10:16 AM, Alan Timm <gest...@gmail.com> wrote:

Right on, I'll add this to my watchlist.

I was able to do alot with N1.6 with an Orin AGX, I'm approaching 6fps inference speed with 4 camera feeds and a bunch of optimizations.

I'm still missing something though, there's more oscillations in neural output than I can filter out for smooth motion.  It has to be possible because I keep seeing videos of smooth motion.  Hopefully Claude and I can figure something out.


Not only do we see videos of smooth motion, where I live we see Wamos driving smoothly in traffic.   So surely it is possible.


Read up on how the car companies do it.    Tesla and Wamo use very different methods, but from a high enough level, they are similar. They both break the loop from sensor to driving by using some kind of data structure.     The sensor input to the structure, and the driver reads from it.  This breaks the direct coupling.

Wamo uses linguistic structure where the sensors say “pedestrian is standing at the curb, give the guy some room” while Tesla says “with this  picture in the video, human drive will move the left a few inches.”

Tesla is like a human who is “zoned out” and driving on reflexes without thinking, and Wamo is like a driver who tells himself and then does what he says.    I am sure in reality, the difference is not so black and white.

Finally, in the end, each of them simply created waypoints on the road and then fits a second-order function to three points (1) current location, (2) the closest waypoint, and (3) the next point.   This parametric equation gives (x,y) as a function of time, and then the car can use a PID loop to keep the car right on that function.       The planner just has to output waypoints.   Each new point means the equation has to be re-solved.  But the points can be computed at any random time asynchronously

In Wamo’s case, the sensor could say, “  Pedestrian just stepped in front of the car and generate a waypoint with zero velocity a foot in front of the car.” The equation recomputes, and the PID tries hard to get to zero velocity.

Wamo needs a ton more compute power because it is using LLM and language, but Wamo has the budget because the computer saves the company the driver’s salary for 2 shifts per day. Maybe $100K saving per year.    So Wamo can spend 100K on a computer and still make money.  Tesla has to build affordable cars, so they can’t even budget a $10,000 computer.   

The trouble with a hobby project is that you can research the algorithms but can’t afford the hardware to run them.       But we can try and do something like what the big players do.    And that is to cut the system in half and not have video data directly control steering angle.  That will ALWAYS  oscillate.   Sensor to data structure then structure to driver breaks it into two asynchronous loops.

Theory:   All oscillators are in fact feedback systems with an amplifier in the feedback loop.        This applies to EVERY case from electronics to social/policital systems to robots.

Solution:  Don’t build that kind of loop.


When I took a class on self-driving cars, we did behavior cloning where we trained a car to lane-keep from human driving data.  It works OK EXCEPT if it ever gets into a situation it was not trained on like “car t-boned the guard rail”.     This is where the linguistic system shines.  It can reason its way out of situations it has not seen.    That said, if you have enough data, you don’t need reasoning.   But no hobbyist will ever get to own that much data.

Drivng is an Interresting problem, but I’ve moved on to something different:  a self-driving house.   Fun thing I just learned:   a $10 device can track human motion by how they block and reflect 5Hz WiFi signals.   It is called “WiFi Sensing” and is good enough that a robot with a WiFi receiver can detect human gestures.    This is VERY SPOOKY as you could in theory track what people in a house were doing from the sidewalk using how their motion distorted the radio waves.     
  (See GitHub project Spectre to try it.).  This seem pperfect for robots.    They do not =need to spray-out energy with a LIDAR or ultra sonic “pings" if the house is already flooded with WiFi radio.    

Reply all
Reply to author
Forward
0 new messages