Alfie build thread

237 views
Skip to first unread message

Alan Timm

unread,
Apr 15, 2025, 11:38:01 PM4/15/25
to RSSC-List
Hey there!

I'm getting closer to (re)assembling alfie.  The 12 20a buck converter is working well, although I think it's time to shorten a whole bunch of cables so that everything fits (it doesn't right yet).

Also I've fallen into a bit of a rabbit hole wrt on-board processing.  I rage-quit my indiedroid nova SBC and have moved on to the Radxa Rock 5C with 16gb ram.

There are some compelling options for on-device speech synthesis, speech recognition?!, and large/small language models?!  It's crazy that you can run these on a raspberry pi sized device.
I think? the qwen models are capable of tool use, but you can run several combinations of these on an 8gb ram sbc, and the whole stack with room to spare on a 16gb device.

Here's a sample of libretts_r_medium voice 4 (there's 903 total voices available) linked in the message.

PXL_20250416_005108390.jpg
assistant.mp3

Gmail

unread,
Apr 18, 2025, 10:18:33 PM4/18/25
to Alan Timm, RSSC-List
Alan,

ChatGPT 4o says, 

 I can identify and classify thousands of object types in uploaded photos, common categories include:

  • People (faces, age/gender estimates, activities)
  • Animals (species, breeds)
  • Plants (trees, flowers, leaves)
  • Food (types of dishes, ingredients)
  • Text (printed/handwritten, languages)
  • Vehicles (cars, planes, bikes)
  • Buildings (types, landmarks)
  • Everyday objects (furniture, tools, electronics)
  • Clothing (styles, colors, accessories)
  • Signs and labels (road signs, logos, warnings)”

Can you recommend a similar (free) on-device image classification model? I mean something more like chatgpt and less like YOLO. I am ok if it requires a standard or even a gaming laptop with a high end GPU. 


Thomas

-  

Need something prototyped, built or coded? I’ve been building prototypes for companies for 15 years. I am now incorporating generative AI into products.

-

Need a great hardworking engineer? I am currently looking for a new job opportunity in robotics and/ or AI. 

Contact me directly or through LinkedIn:   


On Apr 15, 2025, at 8:38 PM, Alan Timm <gest...@gmail.com> wrote:

on-device

Chris Albertson

unread,
Apr 18, 2025, 11:21:22 PM4/18/25
to Gmail, gestalt73, RSSC-list

On Apr 18, 2025, at 7:18 PM, Gmail <thomas...@gmail.com> wrote:

Alan,

ChatGPT 4o says, 

 I can identify and classify thousands of object types in uploaded photos, common categories include:

  • People (faces, age/gender estimates, activities)
  • Animals (species, breeds)
  • Plants (trees, flowers, leaves)
  • Food (types of dishes, ingredients)
  • Text (printed/handwritten, languages)
  • Vehicles (cars, planes, bikes)
  • Buildings (types, landmarks)
  • Everyday objects (furniture, tools, electronics)
  • Clothing (styles, colors, accessories)
  • Signs and labels (road signs, logos, warnings)”


Can you recommend a similar (free) on-device image classification model? I mean something more like chatgpt and less like YOLO. I am ok if it requires a standard or even a gaming laptop with a high end GPU. 

What you need is not so much the powerfull GPU, but one with huge amounts of VRAM.  The models that can identify all those things are huge, many billions of parameters.  And it really has to be vram that the GPU can access.      Even a “‘small” 20 billion parameter model will require 20GB VRAM.   Not to be found on a noteboook PC GPU.     Possibly an Apple Mac could work because of its unified RAM model.

But if you are after effecency the YOLO-like model that is trained on the images you care about is the best as it can run on a Raspbnerry Pi with one of those “AI Chips” attached.

OK, but you want a publically available open source LLM,….   Go to Hugging Face and serch for one.  

Alan Timm

unread,
Apr 19, 2025, 1:18:36 PM4/19/25
to RSSC-List
Hey Thomas,

Good morning!

There are a couple of ways to answer your question, and it all depends on how much iron you're willing to throw at the problem.

My current rabbit hole involves running these models on an rk3588 sbc with 16gb of ram, so this 3.8B llava-phi3 model caught my eye:

It's generating text responses at about 6 tokens per second, but I haven't tried the image capabilities yet.   It's taking up about 9GB of ram at the moment

as well as this rkllama library that purports to run models on the rk3588 npu:

I'm not sure if/how much faster that will be than taking up all 8 cpu cores. I'll probably take a closer look this week.

But...  There's probably a near future when I need to add in an nvidia jetson board for some more GPU horsepower, in which case you might be looking at the 16gb orin nx and carrier board:


I'd probably start with that llava-phi3 model and work your way upwards from there.

Alan

Alan Timm

unread,
Apr 19, 2025, 1:28:41 PM4/19/25
to RSSC-List
Here's the result of passing in the attached image and asking "What's in the image?" on my Radxa Rock 5C, 15GB ram 8 core sbc @ 1.8Ghz
The round trip time was almost 2 minutes.  So not fast, but maybe useful?

>>> what is in /home/alfie/Pictures/homeoffice.jpg
Added image '/home/alfie/Pictures/homeoffice.jpg'
The image shows an old school desktop computer setup with a yellow plastic chair in front of it. The laptop
screen displays "03:58" and the mouse is black. There are two mugs next to the keyboard - one is green and
the other is white. On the desk, there is also a potted plant with green leaves.

total duration:       1m57.419420595s
load duration:        4.535755612s
prompt eval count:    716 token(s)
prompt eval duration: 1m38.395394584s
prompt eval rate:     7.28 tokens/s
eval count:           73 token(s)
eval duration:        14.425655452s
eval rate:            5.06 tokens/s
homeoffice.jpg

Chris Albertson

unread,
Apr 19, 2025, 2:55:05 PM4/19/25
to gestalt73, RSSC-list
Here is the problem or really the choise you have.  

(1) you can use LLM-based technology and, after two minutes get a written paragraph that nicely describes the image.  You would then have to process the paragraph to extract information.   This is good because it shows the model can accept just about anything you show it. Or,…

(2) you can run a version of YOLO and it will return a list of objects with bounding box coordinates but it will only see objects that it is trained to see.   But it runs on modest hardware.  I was able to get 30 frames per second on a Linux PC.   This means YOLO was able to process live video in real time. (my test data was a Hollywood action film)   The objects and the boxes were stored in a database-like list that could be queried.

I think what you do depends on what the task is.   A navigation task need the coordinates in (x,y) of each object and can’t wait 2 minutes,  By “navigation” I mean not just rolling on wheels but an arm grasping an object.

But perhaps the robot’s job is to answer questions like. “Robbie, did UPS deliver my package? is it on the porch?” Then the LLM would be ideal   But to open the door and pick up that box, you need more classic vision, photogrammetry, not AI.

It is interesting to see how Tesla handles this.  The cameras run at about 30 FPS and then data is sent to about 5 different models and each model is run independently, in parallel. Each model turns ther image frames into data.  This may be the solution for robots.  Don’t choose.  The correct answer is “all of the above”.

On Apr 19, 2025, at 10:28 AM, Alan Timm <gest...@gmail.com> wrote:

Here's the result of passing in the attached image and asking "What's in the image?" on my Radxa Rock 5C, 15GB ram 8 core sbc @ 1.8Ghz
The round trip time was almost 2 minutes.  So not fast, but maybe useful?

>>> what is in /home/alfie/Pictures/homeoffice.jpg
Added image '/home/alfie/Pictures/homeoffice.jpg'
The image shows an old school desktop computer setup with a yellow plastic chair in front of it. The laptop 
screen displays "03:58" and the mouse is black. There are two mugs next to the keyboard - one is green and 
the other is white. On the desk, there is also a potted plant with green leaves.

total duration:       1m57.419420595s
load duration:        4.535755612s
prompt eval count:    716 token(s)
prompt eval duration: 1m38.395394584s
prompt eval rate:     7.28 tokens/s
eval count:           73 token(s)
eval duration:        14.425655452s
eval rate:            5.06 tokens/s
<homeoffice.jpg>
-- 
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rssc-list/9afb46ba-07e8-49fc-a4f2-56cfe9083706n%40googlegroups.com.
<homeoffice.jpg>

Jim DiNunzio

unread,
Apr 19, 2025, 3:04:12 PM4/19/25
to Alan Timm, RSSC-List
I got a nice Easter present after a 4 month wait.  I definitely want to try out a vision model like that with this running 67 TOPs and max of 25 watts. After that figure out a robot to wield it. 
Jim

--
20250419_113901.jpg

Alan Timm

unread,
Apr 19, 2025, 9:02:40 PM4/19/25
to RSSC-List
Hey Chris & Thomas,
   Yep, it all depends on what problem(s) you're trying to solve, how fast you need the feedback, and ultimately where you want the processing to occur.  Usually you optimize for two and put up with whatever's left for the third.

For alfie, I'll host a handful of these SLM models on the SBC for a bit to see if there's any practical use for them.  so far piper tts is faster than real time with < 1 second latency to first utterance.  I'll check out faster-whisper next.

Hey Jim,
   Ohmygosh, you got one?!?!?  Nice!  There's a software unlock to update all the jetson nano and orin boards to super status with a corresponding increase in power use and TOPs.

For alfie, after I complete systems integration and get the ros scaffolding up it'll be time for "operation: hi five!" to train a neural model for him that gives high fives whenever someone holds up their hand the right way.  That'll tell me alot more about what type of processing power i need to have on board, and I have the orin and carrier board on a wishlist.  It'll connect to the rock 5c over 1gb ethernet and will be nestled on the base under the battery. 

Alan

Gmail

unread,
Apr 21, 2025, 4:07:27 PM4/21/25
to Alan Timm, RSSC-List
Hey Alan,

Thanks for this and your other replies. When I get a few minutes (Hours? Days?) I will attempt to download, install, configure, and try out that model sometime later this week. 

Did you say that you found it took about 2 minutes for analysis of a photo? 

I’m going to be running this on my gaming laptop with its 4070 gpu. 

  • Intel® Core™ i9-14900HX Processor
  • NVIDIA® GeForce RTX™ 4070 Laptop GPU,
  • 8GB GDDR6
  • 32GB DDR5 Memory
  • 1TB NVMe PCle SSD
  • 16" 16:10 FHD+ (1920x1200), 144Hz
  • Thunderbolt™ 4



Thanks again! Wish me luck 🍀! 


Thomas


Thomas

-  

Need something prototyped, built or coded? I’ve been building prototypes for companies for 15 years. I am now incorporating generative AI into products.

-

Need a great hardworking engineer? I am currently looking for a new job opportunity in robotics and/ or AI. 

Contact me directly or through LinkedIn:   


On Apr 19, 2025, at 10:18 AM, Alan Timm <gest...@gmail.com> wrote:

Hey Thomas,
--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

Alan Timm

unread,
Apr 21, 2025, 8:59:01 PM4/21/25
to RSSC-List
Running on an rtx 4070?  That 3.8b vision model will run ALOT faster.  It took two minutes on my raspberry pi.  :-)

Alan

Alan Timm

unread,
Apr 21, 2025, 9:05:38 PM4/21/25
to RSSC-List
Alfie can shrug now!

The tales that I could tell (and probably will next month) about what I ran into while getting this to work.

The initialize procedure uses one of the three opto switches from the delta printer to detect max down position, then travels 390mm to the top position.

That shrug at the top?  That's a flourish.  Totally unnecessary, but that's why I go up to 390mm and not 400mm.  You have to leave a little bit of room for the occasional shrug.  :-)

screenshot_21042025_180100.jpg

Gmail

unread,
Apr 21, 2025, 9:20:19 PM4/21/25
to Alan Timm, RSSC-List
Well, I don’t know. 🤷🏻 
😆



Thomas

-  

Need something prototyped, built or coded? I’ve been building prototypes for companies for 15 years. I am now incorporating generative AI into products.

-

Need a great hardworking engineer? I am currently looking for a new job opportunity in robotics and/ or AI. 

Contact me directly or through LinkedIn:   


On Apr 21, 2025, at 6:05 PM, Alan Timm <gest...@gmail.com> wrote:

Alfie can shrug now!

The tales that I could tell (and probably will next month) about what I ran into while getting this to work.

The initialize procedure uses one of the three opto switches from the delta printer to detect max down position, then travels 390mm to the top position.

That shrug at the top?  That's a flourish.  Totally unnecessary, but that's why I go up to 390mm and not 400mm.  You have to leave a little bit of room for the occasional shrug.  :-)

<screenshot_21042025_180100.jpg>


On Tuesday, April 15, 2025 at 8:38:01 PM UTC-7 Alan Timm wrote:
Hey there!

I'm getting closer to (re)assembling alfie.  The 12 20a buck converter is working well, although I think it's time to shorten a whole bunch of cables so that everything fits (it doesn't right yet).

Also I've fallen into a bit of a rabbit hole wrt on-board processing.  I rage-quit my indiedroid nova SBC and have moved on to the Radxa Rock 5C with 16gb ram.

There are some compelling options for on-device speech synthesis, speech recognition?!, and large/small language models?!  It's crazy that you can run these on a raspberry pi sized device.
I think? the qwen models are capable of tool use, but you can run several combinations of these on an 8gb ram sbc, and the whole stack with room to spare on a 16gb device.

Here's a sample of libretts_r_medium voice 4 (there's 903 total voices available) linked in the message.

PXL_20250416_005108390.jpg

--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

Gmail

unread,
Apr 21, 2025, 9:24:49 PM4/21/25
to Alan Timm, RSSC-List
I hope “a LOT faster” means under 15 seconds. 



Thomas








-  

Need something prototyped, built or coded? I’ve been building prototypes for companies for 15 years. I am now incorporating generative AI into products.

-

Need a great hardworking engineer? I am currently looking for a new job opportunity in robotics and/ or AI. 

Contact me directly or through LinkedIn:   


On Apr 21, 2025, at 5:59 PM, Alan Timm <gest...@gmail.com> wrote:

Running on an rtx 4070?  That 3.8b vision model will run ALOT faster.  It took two minutes on my raspberry pi.  :-)

Alan Timm

unread,
Apr 21, 2025, 11:33:46 PM4/21/25
to RSSC-List
I don't know what the performance difference is between a laptop rtx 4070 and a desktop rtx 3090..

But on my desktop rtx 3090 it was *a bit* faster...
like ~ 0.5 seconds total.

  ❯❯ /home/alansrobotlab : ollama run llava-phi3 --verbose
>>> what is in /home/alansrobotlab/Pictures/homeoffice.jpg
Added image '/home/alansrobotlab/Pictures/homeoffice.jpg'
1. A wooden desk with a laptop on it, next to two black coffee mugs and a plant. The time displayed on the laptop
screen is 03:58. There is also a yellow plastic chair with wheels tucked underneath the desk.
2. A window in the room that shows a view of trees outside.

total duration:       463.085074ms
load duration:        14.85164ms
prompt eval count:    589 token(s)
prompt eval duration: 23.863499ms
prompt eval rate:     24682.05 tokens/s
eval count:           76 token(s)
eval duration:        417.033728ms
eval rate:            182.24 tokens/s

Gmail

unread,
Apr 22, 2025, 12:36:53 AM4/22/25
to Alan Timm, RSSC-List
Well, for basic world knowledge , < 3 seconds would be fine. For real time robot navigating (navigating through a home by camera alone is one of my goals/ use cases), .5 seconds might be a bit too slow. 

Assuming 1.5 MPH, a robot would go a bit more than a foot in a second. I suppose then that the robot would have to stop every so often and check its path. I have been doing a lot experimentation with uploading videos to ChatGPT 4o using the API. ChatGPT 4o has gotten a lot better over the last few months. They are teasing us about version 5. I can’t wait! 

OpenAI has also released (beta) the “live vision“ video and audio analysis. I have been using it for the last several months, and I find it to be laggy. It falls behind as much as 30 seconds after only three or four minutes of use. Also, OpenAI limits me to using it for about 15 minutes a day. Still, it’s truly amazing for interactive conversations. All of a sudden, your robot is no longer blind. You want your robot to have conversations similar to the sci-fi robots Robby, C3PO, Johnny 5, or Rosie? This is your answer!  BUT, I have tried for robot navigation and unfortunately when it comes to navigation, I found that this feature is not yet ready for prime time. 😆



Thomas

-  

Need something prototyped, built or coded? I’ve been building prototypes for companies for 15 years. I am now incorporating generative AI into products.

-

Need a great hardworking engineer? I am currently looking for a new job opportunity in robotics and/ or AI. 

Contact me directly or through LinkedIn:   


On Apr 21, 2025, at 8:33 PM, Alan Timm <gest...@gmail.com> wrote:

I don't know what the performance difference is between a laptop rtx 4070 and a desktop rtx 3090..
--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

Alan Timm

unread,
May 15, 2025, 11:30:47 PM5/15/25
to RSSC-List
For those of you that attended last weekend's meeting you heard Alfie's voice. Using piper tts and voice 65 out of 906 is faster than real time.  He sounds pretty good and natural-ish for on-device generation.

More recently nvidia quietly released a new ASR automatic speech recognition model called parakeet v2 0.6b.  It also runs much faster than real time and out performs whisper is both speed and accuracy.

The default 16 bit model transcribes speech at twice real-time (3.6 seconds for 7 seconds of speech)

There's also an onnx-asr project and converted onnx model that transcribes speech at 4 times real time (1.5 seconds for 7 seconds of speech).

I'll still need a speech detector, maybe a wake word detector?  and a diarazation model but i'm amazed about how well all this works on a raspberry pi 5 type sbc.

Alan
this_is_alfie.wav

Alan Timm

unread,
Jun 5, 2025, 5:32:40 PM6/5/25
to RSSC-List
I've made alot of progress with on-device functionality for Alfie.  Here's a quick demo of silero speech detection and nvidia parakeet asr on the radxa rock 5c.

We'll talk alot more about it next next weekend!

screenshot_05062025_143051.jpg

Alan Timm

unread,
Jun 17, 2025, 12:09:21 AM6/17/25
to RSSC-List
Following Brian's lead I've started syncing my work with a github repository:

Among other things that keeps a copy of the code safe in case I do something dumb, which is known to happen.  :-)

Also I think Jim makes a convincing argument for using a wakeword.

Hey Jim, what was that shiny new wakeword program you're using?  


Alan

Jim DiNunzio

unread,
Jun 17, 2025, 2:35:44 AM6/17/25
to Alan Timm, RSSC-List

Hi Alan,

 

I’m using Porcupine Wake Word by Pico Voice. It runs locally on your machine and is free for non-commercial projects. You can create one wake word per month. Sign up and click the non-commercial option, and agree not to aspire to make any money with it  (at least while using their tech!)

 

https://picovoice.ai/platform/porcupine/

https://picovoice.ai/docs/quick-start/porcupine-python/

 

You can see my example code utilizing two wake words:

This is a simple test which only requires pvporcupine and pyaudio and your wake word ppn file you get from picovoice:

https://github.com/jimdinunzio/big-orange/blob/Python-3.9/python/tests/test_porcupine_wake_word.py

 

As a career software guy, I’m a big fan of github and development records. All Big Orange code (and my other projects’ code) has been on github since 2020.

 

https://github.com/jimdinunzio/big-orange/

 

Jim

--

You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

Sergei G

unread,
Jun 17, 2025, 11:27:41 AM6/17/25
to Alan Timm, RSSC-List, j...@dinunzio.com
One of the overlooked useful features of GitHub is the ability to create and edit formatted README.md files right from the browser.

You can organize your notes and share your finds with the world (well, with the Club ;-)) - for free. It is probably the most durable/reliable storage of documentation and code at the moment.

Here is my frequently updated collection: https://github.com/slgrobotics/robots_bringup/tree/main


Best Regards,
-- Sergei


From: rssc...@googlegroups.com <rssc...@googlegroups.com> on behalf of Jim DiNunzio <j...@dinunzio.com>
Sent: Tuesday, June 17, 2025 1:35 AM
To: 'Alan Timm' <gest...@gmail.com>; 'RSSC-List' <rssc...@googlegroups.com>
Subject: RE: [RSSC-List] Alfie build thread
 

Thomas Messerschmidt

unread,
Jun 17, 2025, 8:23:33 PM6/17/25
to j...@dinunzio.com, Alan Timm, RSSC-List
Thanks for sharing the links Jim. 


Thomas



On Jun 16, 2025, at 11:35 PM, Jim DiNunzio <j...@dinunzio.com> wrote:



Alan Timm

unread,
Jun 23, 2025, 12:08:32 AM6/23/25
to RSSC-List
Alfie's next upgrade:  The Jetson Nano NX 16gb.

It's about the size of two pis stacked on top of each other, and is capable of 100tops with this carrier board.
It'll fit perfectly in the base once i move the buck converter.

Right out of the box it's generating llm tokens at twice the speed of the rock 5c with ollama which seems... a little slow.  
I expected it to be ALOT faster.  18 vs 40 tokens per second isn't bad, but not really impressive for dedicated gpu hardware.

Plus there was a press release stating that these boards could generate tokens so much faster, but they don't say HOW.
I suspect they're using tensorrt-llm to run the models, so that's what I've been working on this weekend.

screenshot_22062025_205909.jpg

Nathan Lewis

unread,
Jun 23, 2025, 9:47:42 AM6/23/25
to RSSC-list
That's awesome! Does that board have way to connect to the camera inputs on the Orin module?

- Nathan
--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

Alan Timm

unread,
Jun 23, 2025, 10:45:10 PM6/23/25
to RSSC-List
Hey Nate! 

I took a closer look at the carrier board and the expansion board and there's no camera inputs.  :-(

That's kinda a bummer that they didn't make the cut.

They very recently released a slightly larger version of the carrier board that supports the full Super MAXN modes.  This one includes 4x CSI camera connectors.

Alan

Alan Timm

unread,
Jun 24, 2025, 12:22:10 AM6/24/25
to RSSC-List
Oof, what an adventure, here's how to run accelerated llms on the jetson (and also on nvidia gpus)

tldr; 
mlc_llm chat HF://mlc-ai/Qwen3-1.7B-q4f16_1-MLC --device cuda
or...
mlc_llm serve HF://mlc-ai/Qwen3-1.7B-q4f16_1-MLC --device cuda

Firstly, in order to maintain sanity with the fast pace of changes in all the ai stuff there's a meta package called jetson-containers that dockerizes most of the things you'd want to do on the board.  super handy if you're running jetson hardware.

Secondly in their press release I figured out that they're running llms under mlc-llm, which compiles language models into something that can run much faster than on ollama or huggingface transformers.

So here's the final stats for Qwen3-0.6B 4bit quantized:
  • Radxa Rock 5C (Ollama):        18 tokens per second
  • Jetson Orin NX 16GB (Ollama):  37 tokens per second
  • Jetson Orin NX 16GB (mlc-llm): 98 tokens per second
That's pretty good.

And here's a few more stats on the Orin for different versions of the model under mlc:
  • Qwen3-0.6B: 98 tokens per second
  • Qwen3-1.7B: 50 tokens per second
  • Qwen3-4B:   22 tokens per second
  • Qwen3-8B:   13 tokens per second

Alan Timm

unread,
Jun 24, 2025, 11:39:15 PM6/24/25
to RSSC-List
Tonight I benchmarked a handful of qwen3 models on my rtx 3090 and on my jetson using ollama and mlc in the background while working on other things.

I'd say that the performance improvement makes moving your SLMs to mlc_llm worthwhile.
I didn't expect there to be diminishing returns on larger models vs smaller models.  That's interesting.
The current gameplan is to host the model using mlc_llm serve, then interact with it using langgraph etc.


Proompt:   write a haiku about my third favorite mini slinkie

(Average of 3 runs)
(ollama models are unsloth Q4_0 quantized)
(mlc models are q4f16_1 quantized)

screenshot_24062025_203032.jpg

screenshot_24062025_202959.jpg

Alan Timm

unread,
Jul 5, 2025, 8:29:23 PM7/5/25
to RSSC-List
Ok, so this weekend is the weekend I integrate the new jetson orin nx into Alfie.  (I was getting frustrated bouncing back and forth between the two systems.)

Here's a quick shot of just how small the jetson modules are, they're only 70mm wide. and another shot of the naked module with the carrier board.  This fits exactly where the previous stack of the radxa rock 5c and the 12v buck converter was.
Now I just need to design a new cubbyhole for the new 30amp buck converter.

There's just enough IO for everything except for one thing -- there's no GPIO on the board.  Luckily I have a few spare GPIOs on the waveshare driver board so I'll move over the shoulder limit switch to that.
  • USBA - oakd lite depth camera
  • USBA - waveshareboard 1 comms
  • USBC - Seeedstudio mic array
  • USB2.0 header - waveshareboard 2 comms
  • Debug Serial - host communications over usb pigtail
  • Serial header - closed loop stepper ttl serial comms at 115,000baud

On the LLM front... after spending an inordinate amount of time optimizing the qwen3 0.6b model for speed I remembered one of the first things that Seth said and...  the 0.6b model isn't very useful. So I've moved on to the qwen3 1.7b model and am getting ~50 tokens per second with it.

screenshot_05072025_171323.jpg

screenshot_05072025_171357.jpg

Alan Timm

unread,
Oct 10, 2025, 3:48:05 PM10/10/25
to RSSC-List
Oof, it's been a minute, hasn't it.
With Dani's help Alfie has been reassembled and he's been online consistently for the past week or two.

Here's a look at the newelectronics bay with that jetson stuffed in.  It's a tight fit but it works.
screenshot_10102025_124202.jpg

I also reverse engineered the bottom plate and added in vents to help keep everything cool.
screenshot_10102025_124535.jpg

screenshot_10102025_124349.jpg

Alan Timm

unread,
Oct 12, 2025, 4:04:50 PM10/12/25
to RSSC-List
Hey guys, quick update since I wasn't able to stick around for show'n'tell this time.

The code for the waveshare general driver boards is about 80% complete.
VSCode + PlatformIO + FreeRTOS + MicroROS is kinda awesome once you get the hang of it.

At this point I have the boards:
  • generating full 100hz status packets including diagnostic info
  • capturing critical serial bus servo stats for up to 10 servos
  • passing along 9 axis imu info
  • accepting 100hz command packets for the same
I just posted an update to the repo with all of the changes I've been working on.

And I know I'm repeating myself but "GET YOURSELF SOME GITHUB COPILOT!"
Even the free plan is incredibly useful.

I've been partnering with copilot for everything from code refactors to helping me to understand why my freertos + microros solution wasn't generating update messages at 100hz like I thought it should.
It's like having an infinitely patient subject matter expert looking over your shoulder to jump in and offer advice, explanations, and help when you need it.

Alan

screenshot_12102025_125703.jpg

Alan Timm

unread,
Oct 15, 2025, 10:54:03 PM10/15/25
to RSSC-List
Ok guys and gals I just gotta say...  Pair programming with Github Copilot is MAGICAL!

I've covered more ground in the past few days that I would have been able to over the next month, and that's IF I could have maintained focus long enough to deliver.
(That's questionable, I seem to have the attention span of a hyperactive ferret.)

The general driver board freertos + microros firmware is feature complete, and the UI I've developed program and test the servos is now complete and working perfectly.

Now that this tool is done I can program the offsets for each of the servos for their home position and hard minimum and maximum limits.

Then to the fun stuff.  :-)


screenshot_15102025_194305.jpg

Alan Timm

unread,
Oct 24, 2025, 9:04:59 PM10/24/25
to RSSC-List
Ok, so houston?  we have a little problem...
(Thanks Dani for the video!)

So good news: he's been assembled and the framework code is all done and works 100% but...

But... we have an unforeseen consequence of some of my hardware design choices.
He got himself some crippling jigglies when executing any type of turn.  You can see it at a couple of points in the video.
The faster the turn the faster he attempts to jiggly himself over.  That's... not ideal.

I've tried everything and it appears to be a problem with the skid steer configuration while driving the outside wheels on both sides in combination with a tall bot.

So...  I'm going to try switching to mecanum this weekend and hope for the best.


Wish me luck!

Alan

Chris Albertson

unread,
Oct 25, 2025, 3:58:28 PM10/25/25
to Alan Timm, RSSC-List

On Oct 24, 2025, at 6:04 PM, Alan Timm <gest...@gmail.com> wrote:

Ok, so houston?  we have a little problem...
(Thanks Dani for the video!)

So good news: he's been assembled and the framework code is all done and works 100% but...

But... we have an unforeseen consequence of some of my hardware design choices.
He got himself some crippling jigglies when executing any type of turn.  You can see it at a couple of points in the video.
The faster the turn the faster he attempts to jiggly himself over.  That's... not ideal.

I've tried everything and it appears to be a problem with the skid steer configuration while driving the outside wheels on both sides in combination with a tall bot.

Yes,  steerring with four fixed wheels is geometrically impossible.   Two fo the wheels will have to slide sideways.     Mecanum wheels work but then the robot can only run on a hard and level surface.   But all four-wheel platforms have the same problem that unless the floor is dead-perfect flat only three wheel will be touching the floor at any time.  K, unless that is compliance in the structure.  But clompliance in such a tall stucture means wobbles.      The ideal solution is what cars do, steerable wekks and suspension.  But that is complex.      The cheap solution is unpowered casters for rear wheels

I think this is all ok if the goal is research into how to control a pair of arms and hands

In the long run this robot will need a much more robost base with larger wheels



So...  I'm going to try switching to mecanum this weekend and hope for the best.


Wish me luck!

Alan


On Wednesday, October 15, 2025 at 7:54:03 PM UTC-7 Alan Timm wrote:
Ok guys and gals I just gotta say...  Pair programming with Github Copilot is MAGICAL!

I've covered more ground in the past few days that I would have been able to over the next month, and that's IF I could have maintained focus long enough to deliver.
(That's questionable, I seem to have the attention span of a hyperactive ferret.)

The general driver board freertos + microros firmware is feature complete, and the UI I've developed program and test the servos is now complete and working perfectly.

Now that this tool is done I can program the offsets for each of the servos for their home position and hard minimum and maximum limits.

Then to the fun stuff.  :-)



--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

Alan Timm

unread,
Oct 25, 2025, 8:35:48 PM10/25/25
to RSSC-List
Yep you're not wrong, I was actually ok with the wheel slip for now, it was just the jigglies that caught me off guard.

The mecanum wheels came in today, so I'm going to swap them in and move all 4 motors to 60prm encoder gearmotors while I'm waiting for the 100rpm ones to arrive next month.

I'm about 50% done designing a pi pico mecanum driver board in kicad with a pair of TB6612FNG 12v dual h bridge drivers.  that'll replace one of my waveshare general driver boards and allow the pico to do all the calculations for twist messages for all 4 wheels at 500hz, and maybe even take into account the imu readings from the other board.

Believe it or not i'm actually buddying up with claude to get the pcb design best practices right for a 4 layer board.


screenshot_25102025_173327.jpg

Alan Timm

unread,
Oct 26, 2025, 1:11:20 PM10/26/25
to RSSC-List
Ok, I'm not sure why it worked, but it did, and now alfie can turn!

After a little bit of hardware finessing alfie has a new pair of shoes.

So now motors with encoders all around, with pid loops and a few other tricks for a pure closed loop velocity drive system.
The base now accepts standard twist messages and translates that into the required velocities for all four wheels at 100hz.

I'll post an updated video later this afternoon with all the fun things he can do now with his new wheels.



screenshot_26102025_100458.jpg

Chris Albertson

unread,
Oct 26, 2025, 6:21:47 PM10/26/25
to Alan Timm, RSSC-List

On Oct 26, 2025, at 10:11 AM, Alan Timm <gest...@gmail.com> wrote:

Ok, I'm not sure why it worked, but it did, and now alfie can turn!

After a little bit of hardware finessing alfie has a new pair of shoes.

So now motors with encoders all around, with pid loops and a few other tricks for a pure closed loop velocity drive system.
The base now accepts standard twist messages and translates that into the required velocities for all four wheels at 100hz.


With an omni directional base you can do even better.   The usual “twist” massages only have X as a velocity.   Now you can populate the Y field.      Not only that, in addition to “twist” the robot should accept “cmd_pose” messages with Z (or “yaw) position populated.

I built a four legged robot and found I could fully populate both cmd_vel and CMD-pose, althought a dog-bot only has a few cm of vertical travel (from squatting or stretching its legs)

It might seem just a dumb truick but with cmd-pose you can do things like spin on a z-axis while driving a straight line but for a kitchen robot this would be useful,    Lets say the sink is across the room from the stove and the robot wants to move from the sink to the stove.   Most efficient is to drive in a straight line while doing a 180 degree spin.  I think this is what humans do.       But your older diferential drive robot would have to drive a complex path with two arcs almost as bad is parallel parking a car

SO I think not only did you solve the skidding problem but you may have seriously simlified motion planing.   

OK maybe not simplified because now there are an infinite ways to do every move. I watched my real dog and she sometimes decides to place the center of rotation between her back les and sometimes places in to the lleft of right of her front shoulders.  Rarely does she spin round he center of gravity unless she is moving fast.

Being tall your ‘bot might want to minimize accerations on X or Y while minimizing drive time.   If so it will do a lot of those spin-while-driving-straight moves.

Holinomic bases are a lot like walking so what you learn will transfo=er well


I'll post an updated video later this afternoon with all the fun things he can do now with his new wheels.



Alan Timm

unread,
Oct 26, 2025, 10:20:28 PM10/26/25
to RSSC-List
Aw yeah baby!  Alfie's new pair of shoes fit him just fine!

Here's a quick video update showing the capabilities of the new mecanum drivetrain.  Hey Chris, I'll take a look at that.  More control options are always better.  I was excited to get the twist messages working last night.  twist + pose sounds even more fun.  :-)

Right now the back wheels are controlled by one board, the front wheels with the other.  both boards accept velocity commands which are executed closed loop by the boards using encoders.
Then a higher level node sends real time commands to each of the boards for now.  Eventually I'll finish that picoboard driver then the pico can handle all the calcs at the same time.

Now I'm reworking the back mechanics, moving away from the Stepper motor to a gearmotor with encoder so I can have more strength and control over the shoulder position.

Thomas Messerschmidt

unread,
Oct 26, 2025, 11:26:29 PM10/26/25
to Alan Timm, RSSC-List
Looks great Alan! Very smooth. 


Thomas Messerschmidt

-  

Need something prototyped, built or coded? I’ve been building prototypes for companies for 15 years. I am now incorporating generative AI into products.

Contact me directly or through LinkedIn:   




On Oct 26, 2025, at 7:20 PM, Alan Timm <gest...@gmail.com> wrote:

Aw yeah baby!  Alfie's new pair of shoes fit him just fine!
--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

Allan Lopez

unread,
Oct 27, 2025, 2:25:00 AM10/27/25
to RSSC-List
Hi everyone,

I am interested in joining the club, but I live far in Los Angeles. Are you having meetings in person on Saturdays?
I am a beginner and am hoping to learn to build robots!

Thank you,
Allan Lopez

Alan Timm

unread,
Oct 27, 2025, 7:07:49 PM10/27/25
to RSSC-List
Hey Allan,

Alan here.  Nice to meet ya!

We meet the Second Saturday of (almost) every month at Cal State Long Beach.
We also have a really nice hybrid meeting setup, so you're always welcome to join over zoom if you can't make it in person.
Keep an eye out on the forums as well as our website https://rssc.org for details on our upcoming meetups.

See you next month!

Alan

Alan Timm

unread,
Oct 29, 2025, 11:02:36 PM10/29/25
to RSSC-List
I was taking closer look at the soft gripper variant of xlerobot and it got me thinking about what  could use to update the hands for alfie.
  • soft compliant gripper
  • hand camera (I thought they were on their way out, then I saw them again on the figure robot and xlerobot)
  • force sensors on gripper to estimate grip force (the servos are highly geared, so current draw can't be used)

Here's where I'm exploring the concepts in onshape (while I'm redesigning the arms and doing a bunch of stuff other than starting on "operation high five"

The compliant grippers are printed in TPU95, so I faithfully recreated their gripper finger design to see how well it works.  They print in TPU then use grip tape.
There's these adorable 640x480 color camera modules that I'm thinking of placing directly in the middle back of the hand, then using a pair of those cheap force sensors to estimate force strength if i can get them to work with compliant fingers.

(And I'm trying really hard not to be distracted by that adorable AmazingHand design by Pollen Robotics.)

screenshot_29102025_195654.jpg

Chris Albertson

unread,
Oct 29, 2025, 11:58:21 PM10/29/25
to Alan Timm, RSSC-List
My experiments with two-finger grippers told me they fail because they only make two points of contact, and the object being grasped will rotate.  You need at least three contact points.     In theory, if one finger is curved, it could touch the object at two points, but that only works in special cases.  You can not beat the Yale Hands.

Another experiment I did says that, yes, you CAN measure current to detect force.   For this to work, you need some compliance in the system, and it seems that with the TPU, you have this.   I tried mounting the servo in rubber grommets.   What you need is for the motor tocontinue to move as it presses harder and not come to a quick and hard stop.   Any rubber in the system does this.       Current works really well as a force sensor.   A typical servo moves in air with very little current, and then as it meets resistance, the current continues until the little PCB in the servo burns up.   (Yes, you will burn up several servos testing this.  Usually the MOSFETs act like a fuse and blow.) But the rubber in the system makes the current ramp up slower, so software has time to work.

Failing that, resistive force sensing might work.  I bought a few and was going to use them as ground force sensors in the robot’s feet.   https://shop.ncoa.org/best-medical-alert-systems-nb-3
You do not have to place them over the fingers; they should be embedded in the rubber part.

BTW, I met a person a couple of weeks ago who is using this sensor in a running shoe insole to measure foot contact when people run.   Using this data, they make custom insoles.  It is just a voltage divider.


I decided to keep it simple: Make a hole, more like a tunnel, in the rubber part and put a light through it.   When the rubber bends, the tunnel bends and blocks the light.



On Oct 29, 2025, at 8:02 PM, Alan Timm <gest...@gmail.com> wrote:

I was taking closer look at the soft gripper variant of xlerobot and it got me thinking about what  could use to update the hands for alfie.
  • soft compliant gripper
  • hand camera (I thought they were on their way out, then I saw them again on the figure robot and xlerobot)
  • force sensors on gripper to estimate grip force (the servos are highly geared, so current draw can't be used)

Here's where I'm exploring the concepts in onshape (while I'm redesigning the arms and doing a bunch of stuff other than starting on "operation high five"

The compliant grippers are printed in TPU95, so I faithfully recreated their gripper finger design to see how well it works.  They print in TPU then use grip tape.
There's these adorable 640x480 color camera modules that I'm thinking of placing directly in the middle back of the hand, then using a pair of those cheap force sensors to estimate force strength if i can get them to work with compliant fingers.

(And I'm trying really hard not to be distracted by that adorable AmazingHand design by Pollen Robotics.)

<screenshot_29102025_195654.jpg>



On Monday, October 27, 2025 at 4:07:49 PM UTC-7 Alan Timm wrote:
Hey Allan,

Alan here.  Nice to meet ya!

We meet the Second Saturday of (almost) every month at Cal State Long Beach.
We also have a really nice hybrid meeting setup, so you're always welcome to join over zoom if you can't make it in person.
Keep an eye out on the forums as well as our website https://rssc.org for details on our upcoming meetups.

See you next month!

Alan


--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

Alan Timm

unread,
Nov 2, 2025, 12:10:31 PM11/2/25
to RSSC-List
Ok, let's talk about hands.

While I keep finding excuses to work on hardware instead of, you know, actually getting alfie to do anything I'm looking at what his hands should be.

tldr; i'm seriously considering adapting the pollenrobotics amazinghand design for alfie while incorporating a wrist camera and force feedback.
In theory you'd have better grips and capability, but I don't know that the rl training and data capture for it is ready to do it right.

The current grippers would work ok, as long as i added grip tape or something.
screenshot_02112025_085633.jpg

I've started looking at doing a compliant tpu style gripper like xlerobot which would work pretty well as well.
screenshot_02112025_085529.jpg

But...  Man those AmazingHands look nice.  I hacked a rough draft of a smaller version with 3 shorter fingers and a smaller palm, and it looks like it would be a good fit for Alfie.
screenshot_02112025_085819.jpg


Alfie v2 arm draft with 3:1 reduction and 3 finger amazing hand
screenshot_02112025_085856.jpg

Chris Albertson

unread,
Nov 2, 2025, 2:06:03 PM11/2/25
to Alan Timm, RSSC-List


On Nov 2, 2025, at 9:10 AM, Alan Timm <gest...@gmail.com> wrote:

Ok, let's talk about hands.

While I keep finding excuses to work on hardware instead of, you know, actually getting alfie to do anything I'm looking at what his hands should be.


These are the best, I’ve found.     I actually built two others and I can eliminate both.

1) A pincher hand.  The problem with two fingers is that the object rotates unless your software knows how to find the object’s center of mass and can place the contact points so that a line between them passes through the center of mass.   Without this, the object will rotate when it is lifted.

2) A humanoid hand.   The “Brunel hand” was open source and is pretty good and looks good. It was desiigned as a prosthetic device.  But it is hard to control unless you have a human brain.  Amputees, seem to be able to use it, but I’ve yet to see a robot do as well with it.

The above are limiting cases, the simplest and the most human-like.  Neither work well for robots.    I think what is needed is a middle ground, and I think the folks at Yale have a family of mid-ground hands.   You can pick one, and then they have options.  It is all open source and 3D printable.





With two points of contact, what you must calculate is rotational torque.  Ideally, this is zero (with zero moment arm length).   Then you estimate the force and the friction and see if it is good enough to get lift and stop rotation.    Pinchers are bad because the answer is “no” so many times.    But if the task is just software  development, then you can lift Styrofoam blocks with pinchers.      The hard tasks are (1) empty soda cans versus full cans, getting the hand to do both. (2) pick up a dime off the table, this is a classically hard task (3) a pour milk from a jug.  Because you need to rotate a jug that is VERY off-center.

Yales hands are compliant and wrap aound the object and create a geometric lock.  They don’t depend on friction.    They work like human hands even if they do not look human.    

Alan Timm

unread,
Nov 2, 2025, 11:08:45 PM11/2/25
to RSSC-List
Aw man, so many updates this weekend.  Thanks again Dani for your help on Friday!  Replacing the stepper drive for the back with a gearmotor solution is working out pretty well.

screenshot_02112025_200610.jpgscreenshot_02112025_200746.jpg

Among other things I worked out the remaining oopsies with servo control, and whipped up a quick left/right arm slave program to test everything out.  I need to tune the acceleration and max speed values but so far so good.

Alan Timm

unread,
Nov 9, 2025, 4:07:44 PM11/9/25
to RSSC-List
Quick update.  The 110rpm mecanum drive motors are installed, and after updating the control PIDs and doing some tuning exploration with Claude I think we're at a good starting point.

He scoots faster now, but because of the increased speed there's increased slippage on the flooring.  and for some reason he's not rotating around the center of the base anymore which I think is new.
All 4 wheels are independently controlled with closed loop pid controllers, active feedback with 22 ppr hall effect encoders. They accept velocity values in meters per second.
They in turn receive their orders from a higher controller that accepts twist commands and converts them to individual wheel speeds.

Alan Timm

unread,
Nov 29, 2025, 12:25:45 AM11/29/25
to RSSC-List
So it's been a busy week.  After seeing that all the cool kids are programming their humanoids with vr headsets, a quick impulse buy later... and after a few hours of coding (thanks Dani!) I have the head and main camera hooked up.

The first step is to integrate the headset so alfie looks where you look and you can see what alfie sees.

The XLeRobot folks have alot of code ready that makes this integration alot easier than it would have otherwise.

screenshot_28112025_212247.jpg

Alan Timm

unread,
Dec 23, 2025, 1:22:42 AM12/23/25
to RSSC-List

a quick video from the quest 3 showing what teleoperation looks like through the headset. apologies for the quality, the in-headset capture is capped at 1024 pixels wide, the actual experience is at a much higher resolution on-device.  I've been able to integrate some data into the hub, and since I replaced the oakd lite with a true stereo camera setup he's alot easier to control.  

Oh did I mention that the view through the headset is also in true 3d?  :-)

I think he's about ready to capture some training datasets for the contest in February.

Alan Timm

unread,
Dec 30, 2025, 5:15:25 PM12/30/25
to RSSC-List
Ok I think I'm done futzing with the VR teleoperation setup and ready to start capturing episodes for the Can-Do challenge.
  • (for the contest) grip strength is limited to 20% so less can-crushy-crushy
  • critical stats are visible on the left panes
  • roslogs are visible along the bottom
  • bearing is shown on top to remind me what direction I'm facing
  • ros2 urdf and tf view of robot on right pane in two views to understand the arm positions
  • implemented a foveated video pipeline with 4 feeds, 320x240 @ 30fps [left_wide, right_wide, left_center, right_center]
  • the center video feeds are narrow but at ~2.5 times the resolution to make it easier to see things farther away, like a coke can on the floor :-)
  • the center feeds are superimposed in the immersive view in a black rectangle

For now my gr00t code is in this branch:


screenshot_30122025_140225.jpg
screenshot_30122025_135920.jpg



On Monday, December 22, 2025 at 10:22:42 PM UTC-8 Alan Timm wrote:

a quick video from the quest 3 showing what teleoperation looks like through the headset. apologies for the quality, the in-headset capture is capped at 1024 pixels wide, the actual experience is at a much higher resolution on-device.  I've been able to integrate some data into the hub, and since I replaced the oakd lite with a true stereo camera setup he's alot easier to control.  

Oh did I mention that the view through the headset is also in true 3d?  :-)

I think he's about ready to capture some training datasets for the contest in February.

Alan Timm

unread,
Jan 1, 2026, 1:36:14 AMJan 1
to RSSC-List
Hey there fellow Can-Do-Ers!

I've been making a lot of progress with the gr00t n1.6 stuff.  I have automated scripts to capture examples to ros2 bag and a script that converts the recordings into the correct lerobot/gr00t format.

I've uploaded 31 sample episodes to hugginface to be able to visualize them. 

31 down, several hundred to go...

Happy New Years!
screenshot_31122025_223141.jpg

Alan Timm

unread,
Jan 20, 2026, 9:56:08 PMJan 20
to RSSC-List
Oof, so after working through a bunch of challenges I'm finally running a fine-tune for gr00t n1.6!

It's an early fine-tune with around only 70 samples of 2 different behaviors, but it's a start.
{"task_index": 0, "task": "find the can and pick it up"}
{"task_index": 1, "task": "put the can down"}

Among other things, I ended up having to (thanks claude!):
- turn batch size down to 16
- switch the optimizer from adamw_torch to adafactor
- enable gradient_checkpointing

If all goes according to plan I should have a fine tuned gr00t with a NEW_EMBODIMENT to play around with tomorrow.

screenshot_20012026_183355.jpg

Alan Timm

unread,
Feb 3, 2026, 9:36:56 PMFeb 3
to RSSC-List
Oh my goodness what an exciting couple of weeks!

With Dani's help we've collected about 100 examples of the following two behaviors:
  • find the soda can and pick it up
  • put the soda can down
And...  The trained model works!  Well, not on the robot yet.  But here's an open eval graphic that demonstrates that the model reproduces the desired actions against episode 0.
The evaluation script replays an episode with the recorded video and joint positions, feeds it to the model and records the model output.

It shows that the model's command actions (orange) match the original episodes recorded actions (blue) for that episode.

Next steps:  get the policy plugged into alfie's ros2 framework.

Exciting times!

traj_0.jpeg



Carl

unread,
Feb 5, 2026, 5:48:14 PM (13 days ago) Feb 5
to RSSC-List
Very cool - looking forward to seeing it!  I guess the future of programming is no code - just training CNNs :-)

Alan Timm

unread,
Feb 5, 2026, 11:24:59 PM (13 days ago) Feb 5
to RSSC-List
I honestly believe that's where we're going to end up.  As the models get better we're going to be removing the previous code "scaffolding" required to do tasks.

And around that same time AI will then continue to write code on it's own to accomplish tasks in real-time as needed.

2026 is already shaping up to be a wild ride, and we're just getting started.  :-)

Alan

Sergei G

unread,
Feb 6, 2026, 7:33:34 PM (12 days ago) Feb 6
to Alan Timm, RSSC-List
- ‎or,  How I Learned to Stop Worrying about LLMs and Love the ROS2 ;-)

Alan,

You brought and interesting point in "[RSSC-List] Alfie build thread" - and I think it deserves a closer look.

If we follow this line of thought, and ask ourselves - what will my hobby robot programming look like five years from now?

As the AI moves towards the embodiment, those tools and frameworks will come to us - packaged, ready to use, some even Open Source. We will be plugging them in, and watch the magic happen, the same way we use LLMs today.

Remember writing your own mapping, localization, control algorithms? Robotics framework maybe? Computing transformations? Well, you might still be doing it, but those who met ROS don't - while creating quite complex robots with very little "glue" code (i.e. launch files, config, URDFs).

How we use LLMs now? - build a microphone and loudspeaker, five lines of Python, and you've got an all-knowing conversationalist ready to move your Roomba two feet forward. Mostly - blind and clueless about its surroundings, even if it can see and identify them.

What is coming with World Models? - build a body, add five lines of Python, and you've got an all-capable beast with intuitive movement planning and orientation skills in 3D.

While LLMs aren't really built for robotics and are, honestly, a crutch, WMs will come as a perfect fit. Huge power in a nice framework bottle.

What is there for us hobbyists to do? My answer is: prepare. World models will need bodies, just as LLMs need audio tools. Do you have to go all in and build a humanoid? A room super-performer? Well, not me. I build code base - and learn.

ROS2 makes a great set of shoulders to stand on. If you are fluent in ROS2, understand its components and interfaces, are using a solid code base, have a well-built floor bot capable of carrying some extra compute power and well connected over the WiFi - you made a leap into that new world. You've got the BODY for embodiment. All that's left is a year or two of watching the WMs boil and five lines of Python code.

And yes, we're just getting started...

For info on the topic click here and here (for AI generated content).

Best Regards,
-- Sergei

From: rssc...@googlegroups.com <rssc...@googlegroups.com> on behalf of Alan Timm <gest...@gmail.com>
Sent: Thursday, February 5, 2026 10:24:59 PM
To: RSSC-List <rssc...@googlegroups.com>
Subject: Re: [RSSC-List] Alfie build thread
 

Alan Timm

unread,
Feb 6, 2026, 10:04:19 PM (12 days ago) Feb 6
to RSSC-List
Hey Sergei!

100%.  It's sobering how quickly seemingly intractable problems that we've spent days/weeks/years working on have been more or less solved over the past few years.  

So far the only thing my mind can come up with is a future of robotics builder whimsy:  the hardware is so cheap and available, and intelligence is so cheap and available that anyone with some time and interest can slap together an intelligent machine limited only by their imagination.

The other more recent revelation I had that as it becomes less necessary for people to write code to do things, AI will continue to take up the slack and will write code in support of it's own goals.  It's *almost* happening now.

Alan

Alan Timm

unread,
Feb 6, 2026, 10:10:01 PM (12 days ago) Feb 6
to RSSC-List
Tonight I got impatient waiting for my next fine tune to complete so I tried out runpod and rented a 4x H200 SXM monster.  
Runpod lets you rent servers to do interesting things.
The fine tune should complete in about 45 minutes.  Well worth the $15.

I was getting some weird responses in gr00t inference so Claude Opus 4.6 suggested I change some things in the alfiebot_config and run a fresh fine tune.

Now I can try out a few things before calling it a night

GPUs go brrrrrrrrrrrrrr!

screenshot_06022026_185344.jpg

screenshot_06022026_184850.jpg
On Thursday, February 5, 2026 at 8:24:59 PM UTC-8 Alan Timm wrote:
I honestly believe that's where we're going to end up.  As the models get better we're going to be removing the previous code "scaffolding" required to do tasks.

And around that same time AI will then continue to write code on it's own to accomplish tasks in real-time as needed.

2026 is already shaping up to be a wild ride, and we're just getting started.  :-)

Alan

On Thursday, February 5, 2026 at 2:48:14 PM UTC-8 Carl wrote:
Very cool - looking forward to seeing it!  I guess the future of programming is no code - just training CNNs :-)

Alan Timm

unread,
Feb 15, 2026, 1:17:32 PM (4 days ago) Feb 15
to RSSC-List
Good morning!

Yesterday I was joking about how nvidia's own section on generating a TRT engine for gr00t results in 100x worse inference results. (actually not a joke, 100% true).

TensorRT is a method of compiling a model into GPU code to execute faster than running the model on pytorch.

When I got home yesterday I started a Claude code session to analyze the end to end tensorrt compilation workflow and fix it while generating benchmarks to confirm improvements.

This resulted in a 2 hour autonomous long-horizon working session where my only involvement was to answer a few clarifying questions and to click "ok" every once in a while.  Claude created and executed the gameplan, created and ran tests, updated code and even used the gr00t-dev docker container to run everything.

Result:  inference goes from 2.7hz to 4.3hz with no loss in quality.

Now we're working on converting the other half of the model to INT8 which should bring us to around 6hz.

The future is closer than you think.

screenshot_15022026_101150.jpg

Thomas Messerschmidt

unread,
Feb 15, 2026, 3:29:29 PM (3 days ago) Feb 15
to Alan Timm, RSSC-List
"...2 hour autonomous long-horizon working session where my only involvement was to answer a few clarifying questions and to click "ok" every once in a while.  Claude created and executed the gameplan, created and ran tests, updated code and even used the gr00t-dev docker container to run everything."

WOW! Claud is getting to be an even better assistant/agent when it comes to coding. 

--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.

Alan Timm

unread,
Feb 15, 2026, 9:30:11 PM (3 days ago) Feb 15
to RSSC-List
So after a few more hours of analysis and optimizations here are the final results on my Orin hardware.

From 2.2hz to 5.4hz with no loss in quality.  not bad for a few hours of work, and with the pipeline documented I can reproduce the improvements after every fine tune.

I needed to be a lot more involved in this second round of analysis and testing, but based on memory bandwidth this is as fast as the model is capable of (until I try another fine tune with an action horizon of 4 instead of 16)  :-)

Here's the analysis doc we were using to capture the gameplan and results.


Backbone TensorRT Optimization: Findings & Next Steps

Current Best Config: 5.4 Hz (was 2.2 Hz)

torch.compile(default) backbone + TRT FP16 DiT + 2-step denoising = 186ms avg E2E (~5.4 Hz)

Previous bests: 240ms (4.2 Hz, 4-step denoising) → 226ms (4.4 Hz, with async prefetch) → 186ms (5.4 Hz, 2-step denoising)

Experimental Results

Backbone Quality (cos_sim vs PyTorch BF16 flash reference)

StageMSECos SimMedian LatencyNotes
A: PyTorch BF16 flashbaselinebaseline157msReference — uses flash_attention_2
B: PyTorch FP32 eager5.440.9111194msflash→eager attention swap destroys quality
C: ONNX Runtime (SDPA)0.0910.999565msSDPA attention solves quality problem
D: TRT FP32 (SDPA)0.0930.999350msLossless TRT compilation
E: TRT FP16 (SDPA)27.410.354149msFP16 destroys quality
F: TRT INT8 (SDPA)27.420.353141msINT8 ≈ FP16 quality, 5% faster
torch.compile + SDPA0.1170.998191ms12% slower than flash, stays in PyTorch

Key Findings

  1. SDPA attention solves the ONNX export quality problem. Eager attention (cos_sim=0.914) was the original bottleneck preventing backbone TRT. SDPA (cos_sim=0.999) is near-identical to flash.

  2. FP32 TRT compilation is lossless — only 0.002 MSE increment from ONNX→TRT FP32. The ONNX trace is faithful.

  3. FP16 TRT destroys quality (cos_sim=0.349). The Eagle backbone's internal values overflow FP16's 5-bit exponent range (±65504). BF16 has 8 exponent bits (±3.4×10³⁸) and the model relies on this range. On Orin SM87, BF16 tensor cores are not natively supported — TRT falls back to FP16.

  4. FP16 TRT is actually 5% faster than PyTorch flash (149ms vs 157ms) — tensor cores help, but quality is unusable.

  5. FP32 TRT is 2x slower than PyTorch flash (350ms vs 157ms) — no tensor core benefit in FP32 on SM87.

  6. INT8 TRT builds successfully after reboot (31 min build, 2.96GB engine). Requires fresh memory — previously OOM'd before reboot.

  7. INT8 adds virtually zero error on top of FP16. INT8 vs FP16 incremental: MSE=0.014, cos_sim=0.999. The quality bottleneck is entirely FP16 precision, not INT8 quantization.

  8. INT8 is 5.4% faster than FP16 (141ms vs 149ms) — marginal gain since the model is memory-bandwidth bound.

  9. torch.compile + SDPA gives 191ms (22% slower than flash) with cos_sim=0.998. Best non-flash option that stays in PyTorch.

Latency vs Quality Summary

Quality (cos_sim)  1.0 |  A*----D                    C
                       |       \                    /
                   0.9 |        \                  /
                       |         \                / torch.compile+SDPA
                       |          \              /
                   0.5 |           \
                       |            E---F
                   0.3 |
                       +---+---+---+---+---+---+---+
                       100 150 200 250 300 350 400   ms

A = PyTorch flash (157ms, baseline)   D = TRT FP32 (350ms, 0.999)
E = TRT FP16 (149ms, 0.354)           F = TRT INT8 (141ms, 0.353)
C = ONNX Runtime (565ms, 0.999)       torch.compile+SDPA (191ms, 0.998)

The speed-quality gap: No TRT config achieves both good quality AND speed improvement. The only fast options (FP16/INT8) have destroyed quality. The only high-quality options (FP32/ONNX) are slower than PyTorch flash.

Root Cause: BF16 vs FP16 Dynamic Range

The Eagle backbone (Eagle-Block2A-2B-v2) produces intermediate values that require BF16's wider exponent range. The SigLIP2 vision encoder and Qwen2 language model both have activations and attention scores that exceed FP16's ±65504 range. This is a fundamental model property — not fixable by output buffer dtype or accumulation fixes.

Why TRT FP16/INT8 Aren't Much Faster Than PyTorch Flash

The 149ms (FP16) and 141ms (INT8) results are only 5-10% faster than PyTorch flash's 157ms. This is far less than the 2-4x speedup typically expected from TRT quantization. Three factors explain this:

1. Memory-bandwidth bound at batch=1. On Orin AGX, the ~2B parameter Eagle backbone is bottlenecked by LPDDR5 bandwidth (~205 GB/s theoretical, ~130 GB/s real), not compute. At batch=1, the GPU spends most of its time loading weights from DRAM, not doing math. The FP32→FP16 ratio confirms this: 350ms vs 149ms = 2.35x, almost exactly the 2x expected from halving memory traffic. BF16 and FP16 are both 16-bit — they move the same bytes through memory. So FP16 TRT can't be faster than BF16 PyTorch on bandwidth alone.

2. Flash attention is an algorithmic advantage TRT can't replicate. Flash attention never materializes the N×N attention matrix (O(N) memory vs O(N²) for SDPA). The ONNX export path uses SDPA since flash can't be traced. Even with TRT's kernel fusion and tensor cores, it can't match flash attention's fundamentally fewer memory round-trips. The two roughly cancel:

TRT FP16 advantages:             Flash attention advantages:
  + Kernel fusion                  + O(N) memory (vs O(N²) SDPA)
  + FP16 tensor cores              + Fewer total memory round-trips
  + Graph-level optimization       + Hand-tuned CUDA kernel for this workload
  ≈ 149ms                          ≈ 157ms   (roughly a wash)

3. INT8 has limited memory savings in practice. INT8 should halve memory traffic vs FP16, but TRT INT8 only uses INT8 for weights — activations stay FP16. Many layers (LayerNorm, Softmax, embeddings, attention) fall back to FP16 entirely. Net memory reduction is ~30%, not 50%, yielding only 5.4% speedup (141ms vs 149ms).

Bottom line: PyTorch BF16 with flash attention is already operating near the memory-bandwidth ceiling for this model at batch=1 on Orin AGX. TRT can't meaningfully beat it because the bottleneck is DRAM bandwidth, not kernel efficiency — and flash attention's algorithmic memory savings offset TRT's fusion benefits.

Artifacts

PathDescription
groot_n1d6_onnx_sdpa_fp32/backbone_model.onnxSDPA FP32 ONNX export (high quality)
groot_n1d6_onnx_sdpa_fp32/backbone_fp32_agx.trtFP32 TRT engine, 5.9GB, cos_sim=0.999, 350ms
groot_n1d6_onnx_sdpa_fp32/backbone_fp16_agx.trtFP16 TRT engine, 3.1GB, cos_sim=0.354, 149ms
groot_n1d6_onnx_sdpa_fp32/backbone_int8_agx.trtINT8 TRT engine, 3.1GB, cos_sim=0.353, 141ms
calibration_data_backbone/calib_data.npz100 samples backbone calibration data
calibration_data_backbone/backbone_int8_calib.cacheINT8 calibration cache (reuse for rebuilds)

PyTorch Optimization Commands (inside Docker)

# Baseline: PyTorch BF16 flash backbone + TRT FP16 DiT
python scripts/deployment/standalone_inference_script.py \
    --model-path alfie-gr00t/checkpoint-10000 \
    --dataset-path alfiebot.CanDoChallenge \
    --embodiment-tag NEW_EMBODIMENT \
    --inference-mode tensorrt \
    --trt-engine-path groot_n1d6_onnx/dit_fp16.trt \
    --traj-ids 0 1 2 --steps 200 --denoising-steps 4 --action-horizon 16 --seed 42

# Best config: torch.compile + pipeline parallelism
python scripts/deployment/standalone_inference_script.py \
    --model-path alfie-gr00t/checkpoint-10000 \
    --dataset-path alfiebot.CanDoChallenge \
    --embodiment-tag NEW_EMBODIMENT \
    --inference-mode tensorrt \
    --trt-engine-path groot_n1d6_onnx/dit_fp16.trt \
    --compile-backbone \
    --compile-backbone-mode default \
    --pipeline-backbone-dit \
    --traj-ids 0 1 2 \
    --steps 200 \
    --denoising-steps 2 \
    --action-horizon 4 \
    --seed 42

# Open loop eval with timing
python gr00t/eval/open_loop_eval.py \
    --dataset-path alfiebot.CanDoChallenge \
    --embodiment-tag NEW_EMBODIMENT \
    --model-path alfie-gr00t/checkpoint-10000 \
    --inference-mode tensorrt \
    --trt-engine-path groot_n1d6_onnx/dit_fp16.trt \
    --compile-backbone \
    --compile-backbone-mode default \
    --traj-ids 0 --action-horizon 16 --denoising-steps 4 \
    --save-plot-path ./episode000_optimized.png

TRT Engine Build Commands (inside Docker)

# SDPA ONNX export
python scripts/deployment/export_backbone_onnx.py \
    --model_path alfie-gr00t/checkpoint-10000 \
    --dataset_path alfiebot.CanDoChallenge \
    --embodiment_tag new_embodiment \
    --attn_implementation sdpa \
    --export_dtype fp32 \
    --output_dir groot_n1d6_onnx_sdpa_fp32

# FP32 TRT (high quality, slow)
python scripts/deployment/build_tensorrt_engine.py \
    --onnx groot_n1d6_onnx_sdpa_fp32/backbone_model.onnx \
    --engine groot_n1d6_onnx_sdpa_fp32/backbone_fp32_agx.trt \
    --precision fp32 \
    --calib-data calibration_data_backbone/calib_data.npz \
    --max-seq-len 512

# FP16 TRT (fast, bad quality)
python scripts/deployment/build_tensorrt_engine.py \
    --onnx groot_n1d6_onnx_sdpa_fp32/backbone_model.onnx \
    --engine groot_n1d6_onnx_sdpa_fp32/backbone_fp16_agx.trt \
    --precision fp16 \
    --calib-data calibration_data_backbone/calib_data.npz \
    --max-seq-len 512 \
    --prepare-system --tactic-memory 2048 --workspace 1024

# INT8 TRT (fast, bad quality — same as FP16, needs fresh memory after reboot)
python scripts/deployment/build_tensorrt_engine.py \
    --onnx groot_n1d6_onnx_sdpa_fp32/backbone_model.onnx \
    --engine groot_n1d6_onnx_sdpa_fp32/backbone_int8_agx.trt \
    --precision int8 \
    --calib-data calibration_data_backbone/calib_data.npz \
    --calib-cache calibration_data_backbone/backbone_int8_calib.cache \
    --max-seq-len 512 \
    --prepare-system --tactic-memory 2048 --workspace 1024

# Benchmark
python scripts/deployment/benchmark_backbone_pipeline.py \
    --model_path alfie-gr00t/checkpoint-10000 \
    --dataset_path alfiebot.CanDoChallenge \
    --embodiment_tag new_embodiment \
    --onnx_path groot_n1d6_onnx_sdpa_fp32/backbone_model.onnx \
    --trt_fp16_path groot_n1d6_onnx_sdpa_fp32/backbone_fp16_agx.trt \
    --trt_int8_path groot_n1d6_onnx_sdpa_fp32/backbone_int8_agx.trt

Remaining Approaches to Try

1. Mixed-Precision TRT (FP16 matmuls + FP32 sensitive layers) DEPRIORITIZED

Rationale: FP16 TRT is 149ms but quality is destroyed. Keeping LayerNorm, softmax, and attention in FP32 while GEMMs use FP16 could fix quality. However: even the best case (~180-200ms) would be slower than PyTorch flash (157ms). The memory-bandwidth analysis above shows TRT can't beat flash attention at batch=1 regardless of precision mixing. Not worth the medium effort.

2. INT8 Backbone TRT COMPLETED

Result: INT8 builds successfully after reboot (31 min, 2.96GB engine). Quality is identical to FP16 (cos_sim=0.353) — INT8 quantization itself is essentially lossless (INT8-vs-FP16: MSE=0.014, cos_sim=0.999). Latency: 141ms (5.4% faster than FP16's 149ms). Conclusion: INT8 doesn't help because the quality bottleneck is FP16 dynamic range, not quantization precision.

3. FP16 ONNX Export DEPRIORITIZED

Same dynamic range problem regardless of where the FP16 cast happens. Model wasn't trained in FP16 and its activations fundamentally exceed FP16 range.

4. torch.compile + flash attention COMPLETED

Result: torch.compile(mode='default') with flash_attention_2 works. mode='max-autotune' and mode='reduce-overhead' both FAIL — they use CUDA graphs internally, which conflicts with SigLIP2's lazily-cached freqs_cis tensor.

E2E benchmark (3 trajs, 30 inference steps, skip 1 warmup):

ConfigAvg E2EP90 E2EMSEMAE
Baseline (flash + TRT DiT)274.6ms277.4ms0.0032300.023735
torch.compile(default) + flash + TRT DiT267.4ms247.8ms0.0032340.023758

P90 improved from 277ms to 248ms (10.5% faster). Average includes torch.compile's first-call warmup penalty. MSE essentially unchanged — no quality loss.

5. CUDA Graphs on the flash attention path BLOCKED

Result: CUDA graphs are fundamentally incompatible with the Eagle backbone. Two issues:

  1. SigLIP2's Rope2DPosEmb lazily caches freqs_cis — pre-computing it fixes this.
  2. SigLIP2's split_patch_embeddings_to_windows_with_meta uses data-dependent indexing (all_windows[sorted_idx]) — this is a cudaErrorStreamCaptureUnsupported error during graph capture. The windowed attention path dynamically sorts and indexes patches based on input-dependent window metadata. This cannot be captured in a static CUDA graph.

Conclusion: CUDA graphs are not viable for the Eagle backbone without modifying SigLIP2's windowed attention implementation.

6. Pipeline parallelism (overlap backbone and DiT) COMPLETED

Result: Pipeline parallelism (backbone on separate CUDA stream) works. Double-buffered: backbone(N+1) runs while DiT(N) processes on default stream.

E2E benchmark (3 trajs, 30 inference steps, skip 1 warmup):

ConfigAvg E2EP90 E2EMin E2EMSEMAE
Baseline (flash + TRT DiT)274.6ms277.4ms260.6ms0.0032300.023735
Pipeline only242.3ms260.8ms85.0ms0.0032300.023735
torch.compile + pipeline213.7ms229.2ms83.7ms0.0032340.023758

Pipeline alone: 11.8% faster avg (274.6→242.3ms). Min of 85ms confirms overlap is working — that's roughly just DiT time when backbone was already running from previous frame.

Combined compile + pipeline: 22.2% faster avg (274.6→213.7ms). This is the new best config. Quality is identical to baseline.

Note: Pipeline adds 1-frame latency (frame N's actions are computed using frame N-1's backbone features for the DiT). First frame still runs sequentially.

7. Model Distillation / Pruning

Rationale: A smaller backbone = less memory to load from DRAM = proportionally faster. A 50% smaller model could run in ~80ms.

Effort: High. Requires retraining.

8. Next Wave: Async Prefetch + Action Horizon + Denoising Steps

The 4.7 Hz ceiling can be pushed further with inference-level optimizations (no model changes):

8a. Async CPU Prefetching (always-on in open_loop_eval.py)

CPU preprocessing (image transforms, Eagle tokenization, collation) takes 15-30ms and was running synchronously before GPU inference. Now prefetched in a background thread via ThreadPoolExecutor, hiding this latency behind GPU work from the previous step.

8b. Pipeline Parallelism Wired Up in open_loop_eval.py

The --pipeline-backbone-dit flag was defined but never connected to the evaluation loop. Now wired up using the PipelinedInference class from standalone_inference_script.py. Overlaps backbone(N+1) with DiT(N) on separate CUDA streams.

8c. Reduced Denoising Steps (--denoising-steps 2)

Each TRT DiT step takes ~18ms. Going from 4→2 steps saves ~36ms. Quality impact needs empirical validation — flow matching may degrade at 2 steps.

8d. Runtime Action Horizon Override (--model-action-horizon 4)

Model generates 16-step action chunks but at ~4 Hz only 3-4 steps are used. Overriding action_horizon at runtime shrinks sa_embs from (1,17,1536) to (1,5,1536), reducing DiT compute. The TRT engine supports dynamic shapes — no rebuild needed. Quality risk: model trained on 16-step noise distribution.

For production quality with smaller action horizon, fine-tune with the target horizon (see below).

8e. torch.compile Action Encoder/Decoder (--compile-action-head)

The action encoder (MultiEmbodimentActionEncoder) and decoder (CategorySpecificMLP) run 4x per inference in the denoising loop. torch.compile(mode='default') fuses their torch.bmm() kernels.

8f. cuDNN Benchmark (--cudnn-benchmark)

For fixed input shapes (eval always uses same image resolution), torch.backends.cudnn.benchmark = True auto-selects faster conv algorithms.

Measured Results (traj 0, skip 2 warmup steps)

ConfigAvg (ms)Min (ms)P90 (ms)HzMSEMAE
Baseline (compile backbone + TRT DiT, 4 denoise, AH=16)2262152274.40.0005950.00860
+ compile action head + cuDNN2252152264.40.0004580.00807
2-step denoising1881771895.30.0003990.00599
+ model-action-horizon=4 (runtime)2232112244.50.0282800.06980
+ model-action-horizon=8 (runtime)2232122244.50.0165780.05051
Combined best (2 denoise + compile + cuDNN)1861771885.40.0004740.00614

Key findings:

  1. 2-step denoising is the big win: 226→188ms (-38ms, 17% faster) AND quality improves (fewer Euler steps = less FP16 error compounding in TRT DiT)
  2. Compile action head + cuDNN: negligible timing impact (~1ms), action encoder/decoder MLPs are too small to benefit from torch.compile
  3. Runtime action horizon override: REJECTED. No timing benefit (DiT is memory-bound, sa_embs size doesn't matter), quality destroyed (47x/28x worse MSE). Model must be retrained with smaller AH for this to work.
  4. Async CPU prefetch: fully hidden (0.1ms wait time), always-on in new code

9. Fine-Tuning with Smaller Action Horizon (RTX 5090)

The model was trained with action_horizon=16 (16 delta_indices for action). At ~4 Hz inference and 15 fps training, 16 steps = 1.07s lookahead but only 3-4 steps (~0.27s) are used before re-inferring. Training with a matched horizon eliminates wasted computation.

Config changes (experiment_cfg/conf.yaml):

model:
  action_horizon: 4  # Was 16
data:
  modality_configs:
    new_embodiment:
      action:
        delta_indices: [0, 1, 2, 3]  # Was [0..15]

Impact on training:

  • Sequence length: 17 tokens → 5 tokens (state=1 + action=4)
  • DiT attention: O(17²) → O(5²) — ~11.6x less compute per attention layer
  • Training speedup: ~3-5x faster per step

Approach: Resume from checkpoint-10000, train 2000-5000 steps (~15-30 min on RTX 5090). Sweep action_horizon ∈ {4, 8, 16} to find the quality/speed sweet spot.

Post-training: Rebuild TRT engine with --opt-sa-seq 5 for the new sa_embs shape.

Commands

# Baseline (4-step denoising, ~226ms)
python gr00t/eval/open_loop_eval.py \
    --dataset-path alfiebot.CanDoChallenge --embodiment-tag NEW_EMBODIMENT \
    --model-path alfie-gr00t/checkpoint-10000 \
    --inference-mode tensorrt --trt-engine-path groot_n1d6_onnx/dit_fp16.trt \
    --compile-backbone --compile-backbone-mode default \
    --traj-ids 0 --action-horizon 16 --denoising-steps 4 \
    --skip-timing-steps 2 --save-plot-path ./episode000_baseline.png

# Best config (~186ms, 5.4 Hz)
python gr00t/eval/open_loop_eval.py \
    --dataset-path alfiebot.CanDoChallenge --embodiment-tag NEW_EMBODIMENT \
    --model-path alfie-gr00t/checkpoint-10000 \
    --inference-mode tensorrt --trt-engine-path groot_n1d6_onnx/dit_fp16.trt \
    --compile-backbone --compile-backbone-mode default \
    --traj-ids 0 --action-horizon 16 --denoising-steps 2 \
    --skip-timing-steps 2 --save-plot-path ./episode000_optimized.png

Scripts Modified in This Investigation

ScriptChanges
scripts/deployment/benchmark_backbone_pipeline.pyAuto-detect ONNX dtype/rank, TRT dtype casting, latency dtype fix
scripts/deployment/build_tensorrt_engine.py4D/5D pixel_values auto-detection, BackboneInt8Calibrator 5D support
scripts/deployment/test_sdpa_backbone.pySDPA + torch.compile backbone benchmark
scripts/deployment/export_backbone_onnx.pySDPA attention export support (already existed)
scripts/deployment/standalone_inference_script.pytorch.compile, CUDAGraphBackboneWrapper (nn.Module), PipelinedInference, CLI flags
gr00t/eval/open_loop_eval.pyAsync CPU prefetch, pipeline wiring, --model-action-horizon--compile-action-head--cudnn-benchmark, timing instrumentation

On Sun, Feb 15, 2026 at 12:29 PM Thomas Messerschmidt <thomas...@gmail.com> wrote:
"...2 hour autonomous long-horizon working session where my only involvement was to answer a few clarifying questions and to click "ok" every once in a while.  Claude created and executed the gameplan, created and ran tests, updated code and even used the gr00t-dev docker container to run everything."

WOW! Claud is getting to be an even better assistant/agent when it comes to coding. 

On Sun, Feb 15, 2026 at 10:17 AM Alan Timm <gest...@gmail.com> wrote:
Good morning!

Yesterday I was joking about how nvidia's own section on generating a TRT engine for gr00t results in 100x worse inference results. (actually not a joke, 100% true).

TensorRT is a method of compiling a model into GPU code to execute faster than running the model on pytorch.

When I got home yesterday I started a Claude code session to analyze the end to end tensorrt compilation workflow and fix it while generating benchmarks to confirm improvements.

This resulted in a 2 hour autonomous long-horizon working session where my only involvement was to answer a few clarifying questions and to click "ok" every once in a while.  Claude created and executed the gameplan, created and ran tests, updated code and even used the gr00t-dev docker container to run everything.

Result:  inference goes from 2.7hz to 4.3hz with no loss in quality.

Now we're working on converting the other half of the model to INT8 which should bring us to around 6hz.

The future is closer than you think.

screenshot_15022026_101150.jpg



On Friday, February 6, 2026 at 7:10:01 PM UTC-8 Alan Timm wrote:
Tonight I got impatient waiting for my next fine tune to complete so I tried out runpod and rented a 4x H200 SXM monster.  
Runpod lets you rent servers to do interesting things.
The fine tune should complete in about 45 minutes.  Well worth the $15.

I was getting some weird responses in gr00t inference so Claude Opus 4.6 suggested I change some things in the alfiebot_config and run a fresh fine tune.

Now I can try out a few things before calling it a night

GPUs go brrrrrrrrrrrrrr!



screenshot_06022026_184850.jpg


For now my gr00t code is in this branch:
https://github.com/alansrobotlab2/alfiebot_ws/tree/alfie_gr00t/src/alfie_gr00t

On Thursday, February 5, 2026 at 8:24:59 PM UTC-8 Alan Timm wrote:
I honestly believe that's where we're going to end up.  As the models get better we're going to be removing the previous code "scaffolding" required to do tasks.

And around that same time AI will then continue to write code on it's own to accomplish tasks in real-time as needed.

2026 is already shaping up to be a wild ride, and we're just getting started.  :-)

Alan

On Thursday, February 5, 2026 at 2:48:14 PM UTC-8 Carl wrote:
Very cool - looking forward to seeing it!  I guess the future of programming is no code - just training CNNs :-)

Reply all
Reply to author
Forward
0 new messages