
” I can identify and classify thousands of object types in uploaded photos, common categories include:
- People (faces, age/gender estimates, activities)
- Animals (species, breeds)
- Plants (trees, flowers, leaves)
- Food (types of dishes, ingredients)
- Text (printed/handwritten, languages)
- Vehicles (cars, planes, bikes)
- Buildings (types, landmarks)
- Everyday objects (furniture, tools, electronics)
- Clothing (styles, colors, accessories)
- Signs and labels (road signs, logos, warnings)”
On Apr 18, 2025, at 7:18 PM, Gmail <thomas...@gmail.com> wrote:Alan,ChatGPT 4o says,” I can identify and classify thousands of object types in uploaded photos, common categories include:
- People (faces, age/gender estimates, activities)
- Animals (species, breeds)
- Plants (trees, flowers, leaves)
- Food (types of dishes, ingredients)
- Text (printed/handwritten, languages)
- Vehicles (cars, planes, bikes)
- Buildings (types, landmarks)
- Everyday objects (furniture, tools, electronics)
- Clothing (styles, colors, accessories)
- Signs and labels (road signs, logos, warnings)”
Can you recommend a similar (free) on-device image classification model? I mean something more like chatgpt and less like YOLO. I am ok if it requires a standard or even a gaming laptop with a high end GPU.
![]() | |

On Apr 19, 2025, at 10:28 AM, Alan Timm <gest...@gmail.com> wrote:
Here's the result of passing in the attached image and asking "What's in the image?" on my Radxa Rock 5C, 15GB ram 8 core sbc @ 1.8Ghz
The round trip time was almost 2 minutes. So not fast, but maybe useful?
>>> what is in /home/alfie/Pictures/homeoffice.jpg
Added image '/home/alfie/Pictures/homeoffice.jpg'
The image shows an old school desktop computer setup with a yellow plastic chair in front of it. The laptop
screen displays "03:58" and the mouse is black. There are two mugs next to the keyboard - one is green and
the other is white. On the desk, there is also a potted plant with green leaves.
total duration: 1m57.419420595s
load duration: 4.535755612s
prompt eval count: 716 token(s)
prompt eval duration: 1m38.395394584s
prompt eval rate: 7.28 tokens/s
eval count: 73 token(s)
eval duration: 14.425655452s
eval rate: 5.06 tokens/s
<homeoffice.jpg>
--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rssc-list/9afb46ba-07e8-49fc-a4f2-56cfe9083706n%40googlegroups.com.
<homeoffice.jpg>

Hey there!I'm getting closer to (re)assembling alfie. The 12 20a buck converter is working well, although I think it's time to shorten a whole bunch of cables so that everything fits (it doesn't right yet).Also I've fallen into a bit of a rabbit hole wrt on-board processing. I rage-quit my indiedroid nova SBC and have moved on to the Radxa Rock 5C with 16gb ram.There are some compelling options for on-device speech synthesis, speech recognition?!, and large/small language models?! It's crazy that you can run these on a raspberry pi sized device.
- piper-tts is streaming natural sounding speech with about a 1 second delay
- faster-whisper for faster than real-time speech recognition
- qwen2.5 models in 0.5b, 1.5b, 3b variants for ai agents
- deepseek-r1:1.5b reasoning model through ollama
I think? the qwen models are capable of tool use, but you can run several combinations of these on an 8gb ram sbc, and the whole stack with room to spare on a 16gb device.Here's a sample of libretts_r_medium voice 4 (there's 903 total voices available) linked in the message.

Hi Alan,
I’m using Porcupine Wake Word by Pico Voice. It runs locally on your machine and is free for non-commercial projects. You can create one wake word per month. Sign up and click the non-commercial option, and agree not to aspire to make any money with it (at least while using their tech!)
https://picovoice.ai/platform/porcupine/
https://picovoice.ai/docs/quick-start/porcupine-python/
You can see my example code utilizing two wake words:
This is a simple test which only requires pvporcupine and pyaudio and your wake word ppn file you get from picovoice:
https://github.com/jimdinunzio/big-orange/blob/Python-3.9/python/tests/test_porcupine_wake_word.py
As a career software guy, I’m a big fan of github and development records. All Big Orange code (and my other projects’ code) has been on github since 2020.
https://github.com/jimdinunzio/big-orange/
Jim
--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rssc-list/a71f83b4-787a-4c19-9dc3-081276560793n%40googlegroups.com.

--You received this message because you are subscribed to the Google Groups "RSSC-List" group.To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rssc-list/f324092d-6dc3-4f06-9268-09c19daa8611n%40googlegroups.com.









On Oct 24, 2025, at 6:04 PM, Alan Timm <gest...@gmail.com> wrote:Ok, so houston? we have a little problem...(Thanks Dani for the video!)So good news: he's been assembled and the framework code is all done and works 100% but...But... we have an unforeseen consequence of some of my hardware design choices.He got himself some crippling jigglies when executing any type of turn. You can see it at a couple of points in the video.The faster the turn the faster he attempts to jiggly himself over. That's... not ideal.I've tried everything and it appears to be a problem with the skid steer configuration while driving the outside wheels on both sides in combination with a tall bot.
So... I'm going to try switching to mecanum this weekend and hope for the best.Wish me luck!AlanOn Wednesday, October 15, 2025 at 7:54:03 PM UTC-7 Alan Timm wrote:Ok guys and gals I just gotta say... Pair programming with Github Copilot is MAGICAL!I've covered more ground in the past few days that I would have been able to over the next month, and that's IF I could have maintained focus long enough to deliver.(That's questionable, I seem to have the attention span of a hyperactive ferret.)The general driver board freertos + microros firmware is feature complete, and the UI I've developed program and test the servos is now complete and working perfectly.Now that this tool is done I can program the offsets for each of the servos for their home position and hard minimum and maximum limits.Then to the fun stuff. :-)
--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rssc-list/26ced6c1-52c0-4921-b7b1-4b1d537f2905n%40googlegroups.com.


On Oct 26, 2025, at 10:11 AM, Alan Timm <gest...@gmail.com> wrote:Ok, I'm not sure why it worked, but it did, and now alfie can turn!After a little bit of hardware finessing alfie has a new pair of shoes.So now motors with encoders all around, with pid loops and a few other tricks for a pure closed loop velocity drive system.The base now accepts standard twist messages and translates that into the required velocities for all four wheels at 100hz.
I'll post an updated video later this afternoon with all the fun things he can do now with his new wheels.
To view this discussion visit https://groups.google.com/d/msgid/rssc-list/9671c98a-c7f0-4cbd-adbf-81f79b313131n%40googlegroups.com.
<screenshot_26102025_100458.jpg>

On Oct 29, 2025, at 8:02 PM, Alan Timm <gest...@gmail.com> wrote:
I was taking closer look at the soft gripper variant of xlerobot and it got me thinking about what could use to update the hands for alfie.
- soft compliant gripper
- hand camera (I thought they were on their way out, then I saw them again on the figure robot and xlerobot)
- force sensors on gripper to estimate grip force (the servos are highly geared, so current draw can't be used)
Here's where I'm exploring the concepts in onshape (while I'm redesigning the arms and doing a bunch of stuff other than starting on "operation high five"The compliant grippers are printed in TPU95, so I faithfully recreated their gripper finger design to see how well it works. They print in TPU then use grip tape.There's these adorable 640x480 color camera modules that I'm thinking of placing directly in the middle back of the hand, then using a pair of those cheap force sensors to estimate force strength if i can get them to work with compliant fingers.(And I'm trying really hard not to be distracted by that adorable AmazingHand design by Pollen Robotics.)
<screenshot_29102025_195654.jpg>On Monday, October 27, 2025 at 4:07:49 PM UTC-7 Alan Timm wrote:Hey Allan,Alan here. Nice to meet ya!We meet the Second Saturday of (almost) every month at Cal State Long Beach.We also have a really nice hybrid meeting setup, so you're always welcome to join over zoom if you can't make it in person.Keep an eye out on the forums as well as our website https://rssc.org for details on our upcoming meetups.See you next month!Alan
--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rssc-list/44d65ee6-4346-4b06-a973-5368bb34f180n%40googlegroups.com.
<screenshot_29102025_195654.jpg>




On Nov 2, 2025, at 9:10 AM, Alan Timm <gest...@gmail.com> wrote:Ok, let's talk about hands.While I keep finding excuses to work on hardware instead of, you know, actually getting alfie to do anything I'm looking at what his hands should be.





a quick video from the quest 3 showing what teleoperation looks like through the headset. apologies for the quality, the in-headset capture is capped at 1024 pixels wide, the actual experience is at a much higher resolution on-device. I've been able to integrate some data into the hub, and since I replaced the oakd lite with a true stereo camera setup he's alot easier to control.
Oh did I mention that the view through the headset is also in true 3d? :-)
I think he's about ready to capture some training datasets for the contest in February.





I honestly believe that's where we're going to end up. As the models get better we're going to be removing the previous code "scaffolding" required to do tasks.And around that same time AI will then continue to write code on it's own to accomplish tasks in real-time as needed.2026 is already shaping up to be a wild ride, and we're just getting started. :-)Alan
On Thursday, February 5, 2026 at 2:48:14 PM UTC-8 Carl wrote:
Very cool - looking forward to seeing it! I guess the future of programming is no code - just training CNNs :-)

--
You received this message because you are subscribed to the Google Groups "RSSC-List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rssc-list+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/rssc-list/57a527c4-ee39-4b54-a2b0-765c685d8984n%40googlegroups.com.
torch.compile(default) backbone + TRT FP16 DiT + 2-step denoising = 186ms avg E2E (~5.4 Hz)
Previous bests: 240ms (4.2 Hz, 4-step denoising) → 226ms (4.4 Hz, with async prefetch) → 186ms (5.4 Hz, 2-step denoising)
| Stage | MSE | Cos Sim | Median Latency | Notes |
|---|---|---|---|---|
| A: PyTorch BF16 flash | baseline | baseline | 157ms | Reference — uses flash_attention_2 |
| B: PyTorch FP32 eager | 5.44 | 0.911 | 1194ms | flash→eager attention swap destroys quality |
| C: ONNX Runtime (SDPA) | 0.091 | 0.999 | 565ms | SDPA attention solves quality problem |
| D: TRT FP32 (SDPA) | 0.093 | 0.999 | 350ms | Lossless TRT compilation |
| E: TRT FP16 (SDPA) | 27.41 | 0.354 | 149ms | FP16 destroys quality |
| F: TRT INT8 (SDPA) | 27.42 | 0.353 | 141ms | INT8 ≈ FP16 quality, 5% faster |
| torch.compile + SDPA | 0.117 | 0.998 | 191ms | 12% slower than flash, stays in PyTorch |
SDPA attention solves the ONNX export quality problem. Eager attention (cos_sim=0.914) was the original bottleneck preventing backbone TRT. SDPA (cos_sim=0.999) is near-identical to flash.
FP32 TRT compilation is lossless — only 0.002 MSE increment from ONNX→TRT FP32. The ONNX trace is faithful.
FP16 TRT destroys quality (cos_sim=0.349). The Eagle backbone's internal values overflow FP16's 5-bit exponent range (±65504). BF16 has 8 exponent bits (±3.4×10³⁸) and the model relies on this range. On Orin SM87, BF16 tensor cores are not natively supported — TRT falls back to FP16.
FP16 TRT is actually 5% faster than PyTorch flash (149ms vs 157ms) — tensor cores help, but quality is unusable.
FP32 TRT is 2x slower than PyTorch flash (350ms vs 157ms) — no tensor core benefit in FP32 on SM87.
INT8 TRT builds successfully after reboot (31 min build, 2.96GB engine). Requires fresh memory — previously OOM'd before reboot.
INT8 adds virtually zero error on top of FP16. INT8 vs FP16 incremental: MSE=0.014, cos_sim=0.999. The quality bottleneck is entirely FP16 precision, not INT8 quantization.
INT8 is 5.4% faster than FP16 (141ms vs 149ms) — marginal gain since the model is memory-bandwidth bound.
torch.compile + SDPA gives 191ms (22% slower than flash) with cos_sim=0.998. Best non-flash option that stays in PyTorch.
Quality (cos_sim) 1.0 | A*----D C
| \ /
0.9 | \ /
| \ / torch.compile+SDPA
| \ /
0.5 | \
| E---F
0.3 |
+---+---+---+---+---+---+---+
100 150 200 250 300 350 400 ms
A = PyTorch flash (157ms, baseline) D = TRT FP32 (350ms, 0.999)
E = TRT FP16 (149ms, 0.354) F = TRT INT8 (141ms, 0.353)
C = ONNX Runtime (565ms, 0.999) torch.compile+SDPA (191ms, 0.998)
The speed-quality gap: No TRT config achieves both good quality AND speed improvement. The only fast options (FP16/INT8) have destroyed quality. The only high-quality options (FP32/ONNX) are slower than PyTorch flash.
The Eagle backbone (Eagle-Block2A-2B-v2) produces intermediate values that require BF16's wider exponent range. The SigLIP2 vision encoder and Qwen2 language model both have activations and attention scores that exceed FP16's ±65504 range. This is a fundamental model property — not fixable by output buffer dtype or accumulation fixes.
The 149ms (FP16) and 141ms (INT8) results are only 5-10% faster than PyTorch flash's 157ms. This is far less than the 2-4x speedup typically expected from TRT quantization. Three factors explain this:
1. Memory-bandwidth bound at batch=1. On Orin AGX, the ~2B parameter Eagle backbone is bottlenecked by LPDDR5 bandwidth (~205 GB/s theoretical, ~130 GB/s real), not compute. At batch=1, the GPU spends most of its time loading weights from DRAM, not doing math. The FP32→FP16 ratio confirms this: 350ms vs 149ms = 2.35x, almost exactly the 2x expected from halving memory traffic. BF16 and FP16 are both 16-bit — they move the same bytes through memory. So FP16 TRT can't be faster than BF16 PyTorch on bandwidth alone.
2. Flash attention is an algorithmic advantage TRT can't replicate. Flash attention never materializes the N×N attention matrix (O(N) memory vs O(N²) for SDPA). The ONNX export path uses SDPA since flash can't be traced. Even with TRT's kernel fusion and tensor cores, it can't match flash attention's fundamentally fewer memory round-trips. The two roughly cancel:
TRT FP16 advantages: Flash attention advantages:
+ Kernel fusion + O(N) memory (vs O(N²) SDPA)
+ FP16 tensor cores + Fewer total memory round-trips
+ Graph-level optimization + Hand-tuned CUDA kernel for this workload
≈ 149ms ≈ 157ms (roughly a wash)
3. INT8 has limited memory savings in practice. INT8 should halve memory traffic vs FP16, but TRT INT8 only uses INT8 for weights — activations stay FP16. Many layers (LayerNorm, Softmax, embeddings, attention) fall back to FP16 entirely. Net memory reduction is ~30%, not 50%, yielding only 5.4% speedup (141ms vs 149ms).
Bottom line: PyTorch BF16 with flash attention is already operating near the memory-bandwidth ceiling for this model at batch=1 on Orin AGX. TRT can't meaningfully beat it because the bottleneck is DRAM bandwidth, not kernel efficiency — and flash attention's algorithmic memory savings offset TRT's fusion benefits.
| Path | Description |
|---|---|
groot_n1d6_onnx_sdpa_fp32/backbone_model.onnx | SDPA FP32 ONNX export (high quality) |
groot_n1d6_onnx_sdpa_fp32/backbone_fp32_agx.trt | FP32 TRT engine, 5.9GB, cos_sim=0.999, 350ms |
groot_n1d6_onnx_sdpa_fp32/backbone_fp16_agx.trt | FP16 TRT engine, 3.1GB, cos_sim=0.354, 149ms |
groot_n1d6_onnx_sdpa_fp32/backbone_int8_agx.trt | INT8 TRT engine, 3.1GB, cos_sim=0.353, 141ms |
calibration_data_backbone/calib_data.npz | 100 samples backbone calibration data |
calibration_data_backbone/backbone_int8_calib.cache | INT8 calibration cache (reuse for rebuilds) |
# Baseline: PyTorch BF16 flash backbone + TRT FP16 DiT
python scripts/deployment/standalone_inference_script.py \
--model-path alfie-gr00t/checkpoint-10000 \
--dataset-path alfiebot.CanDoChallenge \
--embodiment-tag NEW_EMBODIMENT \
--inference-mode tensorrt \
--trt-engine-path groot_n1d6_onnx/dit_fp16.trt \
--traj-ids 0 1 2 --steps 200 --denoising-steps 4 --action-horizon 16 --seed 42
# Best config: torch.compile + pipeline parallelism
python scripts/deployment/standalone_inference_script.py \
--model-path alfie-gr00t/checkpoint-10000 \
--dataset-path alfiebot.CanDoChallenge \
--embodiment-tag NEW_EMBODIMENT \
--inference-mode tensorrt \
--trt-engine-path groot_n1d6_onnx/dit_fp16.trt \
--compile-backbone \
--compile-backbone-mode default \
--pipeline-backbone-dit \
--traj-ids 0 1 2 \
--steps 200 \
--denoising-steps 2 \
--action-horizon 4 \
--seed 42
# Open loop eval with timing
python gr00t/eval/open_loop_eval.py \
--dataset-path alfiebot.CanDoChallenge \
--embodiment-tag NEW_EMBODIMENT \
--model-path alfie-gr00t/checkpoint-10000 \
--inference-mode tensorrt \
--trt-engine-path groot_n1d6_onnx/dit_fp16.trt \
--compile-backbone \
--compile-backbone-mode default \
--traj-ids 0 --action-horizon 16 --denoising-steps 4 \
--save-plot-path ./episode000_optimized.png
# SDPA ONNX export
python scripts/deployment/export_backbone_onnx.py \
--model_path alfie-gr00t/checkpoint-10000 \
--dataset_path alfiebot.CanDoChallenge \
--embodiment_tag new_embodiment \
--attn_implementation sdpa \
--export_dtype fp32 \
--output_dir groot_n1d6_onnx_sdpa_fp32
# FP32 TRT (high quality, slow)
python scripts/deployment/build_tensorrt_engine.py \
--onnx groot_n1d6_onnx_sdpa_fp32/backbone_model.onnx \
--engine groot_n1d6_onnx_sdpa_fp32/backbone_fp32_agx.trt \
--precision fp32 \
--calib-data calibration_data_backbone/calib_data.npz \
--max-seq-len 512
# FP16 TRT (fast, bad quality)
python scripts/deployment/build_tensorrt_engine.py \
--onnx groot_n1d6_onnx_sdpa_fp32/backbone_model.onnx \
--engine groot_n1d6_onnx_sdpa_fp32/backbone_fp16_agx.trt \
--precision fp16 \
--calib-data calibration_data_backbone/calib_data.npz \
--max-seq-len 512 \
--prepare-system --tactic-memory 2048 --workspace 1024
# INT8 TRT (fast, bad quality — same as FP16, needs fresh memory after reboot)
python scripts/deployment/build_tensorrt_engine.py \
--onnx groot_n1d6_onnx_sdpa_fp32/backbone_model.onnx \
--engine groot_n1d6_onnx_sdpa_fp32/backbone_int8_agx.trt \
--precision int8 \
--calib-data calibration_data_backbone/calib_data.npz \
--calib-cache calibration_data_backbone/backbone_int8_calib.cache \
--max-seq-len 512 \
--prepare-system --tactic-memory 2048 --workspace 1024
# Benchmark
python scripts/deployment/benchmark_backbone_pipeline.py \
--model_path alfie-gr00t/checkpoint-10000 \
--dataset_path alfiebot.CanDoChallenge \
--embodiment_tag new_embodiment \
--onnx_path groot_n1d6_onnx_sdpa_fp32/backbone_model.onnx \
--trt_fp16_path groot_n1d6_onnx_sdpa_fp32/backbone_fp16_agx.trt \
--trt_int8_path groot_n1d6_onnx_sdpa_fp32/backbone_int8_agx.trt
Rationale: FP16 TRT is 149ms but quality is destroyed. Keeping LayerNorm, softmax, and attention in FP32 while GEMMs use FP16 could fix quality. However: even the best case (~180-200ms) would be slower than PyTorch flash (157ms). The memory-bandwidth analysis above shows TRT can't beat flash attention at batch=1 regardless of precision mixing. Not worth the medium effort.
Result: INT8 builds successfully after reboot (31 min, 2.96GB engine). Quality is identical to FP16 (cos_sim=0.353) — INT8 quantization itself is essentially lossless (INT8-vs-FP16: MSE=0.014, cos_sim=0.999). Latency: 141ms (5.4% faster than FP16's 149ms). Conclusion: INT8 doesn't help because the quality bottleneck is FP16 dynamic range, not quantization precision.
Same dynamic range problem regardless of where the FP16 cast happens. Model wasn't trained in FP16 and its activations fundamentally exceed FP16 range.
Result: torch.compile(mode='default') with flash_attention_2 works. mode='max-autotune' and mode='reduce-overhead' both FAIL — they use CUDA graphs internally, which conflicts with SigLIP2's lazily-cached freqs_cis tensor.
E2E benchmark (3 trajs, 30 inference steps, skip 1 warmup):
| Config | Avg E2E | P90 E2E | MSE | MAE |
|---|---|---|---|---|
| Baseline (flash + TRT DiT) | 274.6ms | 277.4ms | 0.003230 | 0.023735 |
| torch.compile(default) + flash + TRT DiT | 267.4ms | 247.8ms | 0.003234 | 0.023758 |
P90 improved from 277ms to 248ms (10.5% faster). Average includes torch.compile's first-call warmup penalty. MSE essentially unchanged — no quality loss.
Result: CUDA graphs are fundamentally incompatible with the Eagle backbone. Two issues:
Rope2DPosEmb lazily caches freqs_cis — pre-computing it fixes this.split_patch_embeddings_to_windows_with_meta uses data-dependent indexing (all_windows[sorted_idx]) — this is a cudaErrorStreamCaptureUnsupported error during graph capture. The windowed attention path dynamically sorts and indexes patches based on input-dependent window metadata. This cannot be captured in a static CUDA graph.Conclusion: CUDA graphs are not viable for the Eagle backbone without modifying SigLIP2's windowed attention implementation.
Result: Pipeline parallelism (backbone on separate CUDA stream) works. Double-buffered: backbone(N+1) runs while DiT(N) processes on default stream.
E2E benchmark (3 trajs, 30 inference steps, skip 1 warmup):
| Config | Avg E2E | P90 E2E | Min E2E | MSE | MAE |
|---|---|---|---|---|---|
| Baseline (flash + TRT DiT) | 274.6ms | 277.4ms | 260.6ms | 0.003230 | 0.023735 |
| Pipeline only | 242.3ms | 260.8ms | 85.0ms | 0.003230 | 0.023735 |
| torch.compile + pipeline | 213.7ms | 229.2ms | 83.7ms | 0.003234 | 0.023758 |
Pipeline alone: 11.8% faster avg (274.6→242.3ms). Min of 85ms confirms overlap is working — that's roughly just DiT time when backbone was already running from previous frame.
Combined compile + pipeline: 22.2% faster avg (274.6→213.7ms). This is the new best config. Quality is identical to baseline.
Note: Pipeline adds 1-frame latency (frame N's actions are computed using frame N-1's backbone features for the DiT). First frame still runs sequentially.
Rationale: A smaller backbone = less memory to load from DRAM = proportionally faster. A 50% smaller model could run in ~80ms.
Effort: High. Requires retraining.
The 4.7 Hz ceiling can be pushed further with inference-level optimizations (no model changes):
CPU preprocessing (image transforms, Eagle tokenization, collation) takes 15-30ms and was running synchronously before GPU inference. Now prefetched in a background thread via ThreadPoolExecutor, hiding this latency behind GPU work from the previous step.
The --pipeline-backbone-dit flag was defined but never connected to the evaluation loop. Now wired up using the PipelinedInference class from standalone_inference_script.py. Overlaps backbone(N+1) with DiT(N) on separate CUDA streams.
--denoising-steps 2)Each TRT DiT step takes ~18ms. Going from 4→2 steps saves ~36ms. Quality impact needs empirical validation — flow matching may degrade at 2 steps.
--model-action-horizon 4)Model generates 16-step action chunks but at ~4 Hz only 3-4 steps are used. Overriding action_horizon at runtime shrinks sa_embs from (1,17,1536) to (1,5,1536), reducing DiT compute. The TRT engine supports dynamic shapes — no rebuild needed. Quality risk: model trained on 16-step noise distribution.
For production quality with smaller action horizon, fine-tune with the target horizon (see below).
--compile-action-head)The action encoder (MultiEmbodimentActionEncoder) and decoder (CategorySpecificMLP) run 4x per inference in the denoising loop. torch.compile(mode='default') fuses their torch.bmm() kernels.
--cudnn-benchmark)For fixed input shapes (eval always uses same image resolution), torch.backends.cudnn.benchmark = True auto-selects faster conv algorithms.
| Config | Avg (ms) | Min (ms) | P90 (ms) | Hz | MSE | MAE |
|---|---|---|---|---|---|---|
| Baseline (compile backbone + TRT DiT, 4 denoise, AH=16) | 226 | 215 | 227 | 4.4 | 0.000595 | 0.00860 |
| + compile action head + cuDNN | 225 | 215 | 226 | 4.4 | 0.000458 | 0.00807 |
| + 2-step denoising | 188 | 177 | 189 | 5.3 | 0.000399 | 0.00599 |
| + model-action-horizon=4 (runtime) | 223 | 211 | 224 | 4.5 | 0.028280 | 0.06980 |
| + model-action-horizon=8 (runtime) | 223 | 212 | 224 | 4.5 | 0.016578 | 0.05051 |
| Combined best (2 denoise + compile + cuDNN) | 186 | 177 | 188 | 5.4 | 0.000474 | 0.00614 |
Key findings:
The model was trained with action_horizon=16 (16 delta_indices for action). At ~4 Hz inference and 15 fps training, 16 steps = 1.07s lookahead but only 3-4 steps (~0.27s) are used before re-inferring. Training with a matched horizon eliminates wasted computation.
Config changes (experiment_cfg/conf.yaml):
model:
action_horizon: 4 # Was 16
data:
modality_configs:
new_embodiment:
action:
delta_indices: [0, 1, 2, 3] # Was [0..15]
Impact on training:
Approach: Resume from checkpoint-10000, train 2000-5000 steps (~15-30 min on RTX 5090). Sweep action_horizon ∈ {4, 8, 16} to find the quality/speed sweet spot.
Post-training: Rebuild TRT engine with --opt-sa-seq 5 for the new sa_embs shape.
# Baseline (4-step denoising, ~226ms)
python gr00t/eval/open_loop_eval.py \
--dataset-path alfiebot.CanDoChallenge --embodiment-tag NEW_EMBODIMENT \
--model-path alfie-gr00t/checkpoint-10000 \
--inference-mode tensorrt --trt-engine-path groot_n1d6_onnx/dit_fp16.trt \
--compile-backbone --compile-backbone-mode default \
--traj-ids 0 --action-horizon 16 --denoising-steps 4 \
--skip-timing-steps 2 --save-plot-path ./episode000_baseline.png
# Best config (~186ms, 5.4 Hz)
python gr00t/eval/open_loop_eval.py \
--dataset-path alfiebot.CanDoChallenge --embodiment-tag NEW_EMBODIMENT \
--model-path alfie-gr00t/checkpoint-10000 \
--inference-mode tensorrt --trt-engine-path groot_n1d6_onnx/dit_fp16.trt \
--compile-backbone --compile-backbone-mode default \
--traj-ids 0 --action-horizon 16 --denoising-steps 2 \
--skip-timing-steps 2 --save-plot-path ./episode000_optimized.png
| Script | Changes |
|---|---|
scripts/deployment/benchmark_backbone_pipeline.py | Auto-detect ONNX dtype/rank, TRT dtype casting, latency dtype fix |
scripts/deployment/build_tensorrt_engine.py | 4D/5D pixel_values auto-detection, BackboneInt8Calibrator 5D support |
scripts/deployment/test_sdpa_backbone.py | SDPA + torch.compile backbone benchmark |
scripts/deployment/export_backbone_onnx.py | SDPA attention export support (already existed) |
scripts/deployment/standalone_inference_script.py | torch.compile, CUDAGraphBackboneWrapper (nn.Module), PipelinedInference, CLI flags |
gr00t/eval/open_loop_eval.py | Async CPU prefetch, pipeline wiring, --model-action-horizon, --compile-action-head, --cudnn-benchmark, timing instrumentation |
"...2 hour autonomous long-horizon working session where my only involvement was to answer a few clarifying questions and to click "ok" every once in a while. Claude created and executed the gameplan, created and ran tests, updated code and even used the gr00t-dev docker container to run everything."WOW! Claud is getting to be an even better assistant/agent when it comes to coding.
On Sun, Feb 15, 2026 at 10:17 AM Alan Timm <gest...@gmail.com> wrote:
Good morning!Yesterday I was joking about how nvidia's own section on generating a TRT engine for gr00t results in 100x worse inference results. (actually not a joke, 100% true).TensorRT is a method of compiling a model into GPU code to execute faster than running the model on pytorch.When I got home yesterday I started a Claude code session to analyze the end to end tensorrt compilation workflow and fix it while generating benchmarks to confirm improvements.This resulted in a 2 hour autonomous long-horizon working session where my only involvement was to answer a few clarifying questions and to click "ok" every once in a while. Claude created and executed the gameplan, created and ran tests, updated code and even used the gr00t-dev docker container to run everything.Result: inference goes from 2.7hz to 4.3hz with no loss in quality.Now we're working on converting the other half of the model to INT8 which should bring us to around 6hz.The future is closer than you think.
On Friday, February 6, 2026 at 7:10:01 PM UTC-8 Alan Timm wrote:
Tonight I got impatient waiting for my next fine tune to complete so I tried out runpod and rented a 4x H200 SXM monster.Runpod lets you rent servers to do interesting things.The fine tune should complete in about 45 minutes. Well worth the $15.I was getting some weird responses in gr00t inference so Claude Opus 4.6 suggested I change some things in the alfiebot_config and run a fresh fine tune.Now I can try out a few things before calling it a nightGPUs go brrrrrrrrrrrrrr!
For now my gr00t code is in this branch:
https://github.com/alansrobotlab2/alfiebot_ws/tree/alfie_gr00t/src/alfie_gr00tOn Thursday, February 5, 2026 at 8:24:59 PM UTC-8 Alan Timm wrote:I honestly believe that's where we're going to end up. As the models get better we're going to be removing the previous code "scaffolding" required to do tasks.And around that same time AI will then continue to write code on it's own to accomplish tasks in real-time as needed.2026 is already shaping up to be a wild ride, and we're just getting started. :-)AlanOn Thursday, February 5, 2026 at 2:48:14 PM UTC-8 Carl wrote:Very cool - looking forward to seeing it! I guess the future of programming is no code - just training CNNs :-)For now my gr00t code is in this branch:
https://github.com/alansrobotlab2/alfiebot_ws/tree/alfie_gr00t/src/alfie_gr00t