USB level disconnect causing firmware updater failure

Sanjay Sheth

unread,

Jan 14, 2025, 11:17:51 PMJan 14

to fwupd

Firmware updater works with fwupd 1.9.25 using fwupd/fwupdmgr daemons.
It also works with fwupd 2.0.3 when using with fwupdtool.

But when using fwupd 2.0.3 or later codebase, with fwupd/fwupdmgr daemons, it fails consistently due to USB level failure. Also seeing following error in the console when device is getting added:
19:49:33.328 FuEngine failed to add device /sys/devices/pci0000:00/0000:00:14.0/usb2/2-3/2-3:1.0/host0/target0:0:0/0:0:0:0/block/sda: failed to subclass open: failed to open /dev/sda: No medium found
19:49:33.334 FuEngine failed to add device /sys/devices/pci0000:00/0000:00:14.0/usb2/2-3/2-3:1.0/host0/target0:0:0/0:0:0:0/block/sda: failed to subclass open: failed to open /dev/sda: No medium found
19:49:33.342 FuPluginScsi found ufshci controller at /sys/devices/pci0000:00/0000:00:14.0
19:49:33.344 FuEngine failed to add device /sys/devices/pci0000:00/0000:00:14.0/usb2/2-3/2-3:1.0/host0/target0:0:0/0:0:0:0/block/sda: failed to subclass open: failed to open /dev/sda: No medium found

Full verbose logs attached for both working/failure cases. Same device/os/firmware setup is used for all above cases. filesize/base64hash also same. Different device (e.g. Rally Bar Mini) also giving same results.

Is it possible we are experiencing ZLP issue:
https://libusb.sourceforge.io/api-1.0/group__libusb__asyncio.html#gga1fb47dd0f7c209b60a3609ff0c03d56da26b66334b6ec0537c49841ca623d901f see api "LIBUSB_TRANSFER_ADD_ZERO_PACKET" it perfectly describes the issue
https://android.googlesource.com/platform/external/libusb/+/e3d0a4c%5E%21/
https://github.com/libusb/libusb/issues/1254
https://stackoverflow.com/questions/70085203/bulk-transfer-sending-too-much-multiple-of-usb-packet

Please advise,
Sanjay

verbose_logs.zip

Sanjay Sheth

unread,

Jan 16, 2025, 9:46:35 PMJan 16

to Richard Hughes, fwupd

Tested with latest 1.9.27 using daemon (fwupd+fwupdmgr), and it works. But for the same setup, daemon failed with the same error for 2.0.0.
In other words, suspecting something broken between those 2 tag releases.

SyncApp on Windows, handles ZLP, investing if that is preventive code or it is required now.

I am still clueless on why fwupdtool works for all tag releases (1.9.X - 2.x.x).

On Wed, Jan 15, 2025 at 2:32 AM Richard Hughes <ric...@hughsie.com> wrote:

On Wednesday, 15 January 2025 at 04:17, 'Sanjay Sheth' via fwupd <fw...@googlegroups.com> wrote:
> It also works with fwupd 2.0.3 when using with fwupdtool.

Ohh interesting; so maybe it's a systemd sandbox issue? e.g. something like https://github.com/fwupd/fwupd/commit/5caa472b9de98b022c85ab0db6d26c3f0a767d20

> But when using fwupd 2.0.3 or later codebase, with fwupd/fwupdmgr daemons, it fails consistently due to USB level failure. Also seeing following error in the console when device is getting added:
> 19:49:33.328 FuEngine failed to add device /sys/devices/pci0000:00/0000:00:14.0/usb2/2-3/2-3:1.0/host0/target0:0:0/0:0:0:0/block/sda: failed to subclass open: failed to open /dev/sda: No medium found
> 19:49:33.334 FuEngine failed to add device /sys/devices/pci0000:00/0000:00:14.0/usb2/2-3/2-3:1.0/host0/target0:0:0/0:0:0:0/block/sda: failed to subclass open: failed to open /dev/sda: No medium found
> 19:49:33.342 FuPluginScsi found ufshci controller at /sys/devices/pci0000:00/0000:00:14.0
> 19:49:33.344 FuEngine failed to add device /sys/devices/pci0000:00/0000:00:14.0/usb2/2-3/2-3:1.0/host0/target0:0:0/0:0:0:0/block/sda: failed to subclass open: failed to open /dev/sda: No medium found

Yes, those sda warnings could be the fadvise64 regression -- but that shouldn't affect logitech_bulkcontroller. If you run the daemon with "sudo ./src/fwupd" you shouldn't see the sda errors as it's not in the sandbox there.

> Different device (e.g. Rally Bar Mini) also giving same results.

How odd -- I saw the "failed to write-firmware: failed to write end transfer packet: not CMD_ACK, got timeout" before Christmas but I didn't think of trying with fwupdtool.

> Is it possible we are experiencing ZLP issue:

I don't see why fwupdtool would work in that case -- you'd think they would both fail. It could be sandbox related, but if so, kinda weird.

> https://libusb.sourceforge.io/api-1.0/group__libusb__asyncio.html#gga1fb47dd0f7c209b60a3609ff0c03d56da26b66334b6ec0537c49841ca623d901f see api "LIBUSB_TRANSFER_ADD_ZERO_PACKET" it perfectly describes the issue

I guess you could try using that -- I don't think the libusb_bulk_transfer() sync call we do allows setting the libusb_transfer_flags -- I think we'd have to use libusb_fill_bulk_transfer() and the other bits of low level API. I'm just surprised it's needed -- I'm guessing you don't do this zero packet on your Windows updater?

Richard.

Richard Hughes

unread,

Jan 17, 2025, 2:41:28 PMJan 17

to Sanjay Sheth, fwupd

On Friday, 17 January 2025 at 02:46, 'Sanjay Sheth' via fwupd <fw...@googlegroups.com> wrote:
> SyncApp on Windows, handles ZLP, investing if that is preventive code or it is required now.

The fwupd hughsie/zlp git branch might be helpful to you. I'm away from my desk now, but I can test with the RB on Monday.

Richard

Sanjay Sheth

unread,

Jan 17, 2025, 3:01:34 PMJan 17

to Richard Hughes, fwupd

Sure thing. Also feel free to incrementally upgrade your RallyBar device. I can provide a few firmware images for testing purposes.

Sanjay Sheth

unread,

Jan 19, 2025, 4:26:27 PMJan 19

to fwupd

Unfortunately, hughsie/zlp git branch didn't helped. I am investigating why fwupdtool always works on my Ubuntu dev box, and also looking at Logitech Windows updater (SyncApp).

I am exploring following logic from Google:
"Rather than sending multiple packet of (X bytes) as the last packet, send 2 packets of length (X-1) bytes & 1 byte"

Richard Hughes

unread,

Jan 22, 2025, 9:32:24 AMJan 22

to Sanjay Sheth, fwupd

On Sunday, 19 January 2025 at 21:26, 'Sanjay Sheth' via fwupd <fw...@googlegroups.com> wrote:
> Unfortunately, hughsie/zlp git branch didn't helped.

I'm assuming you un-commented the FU_USB_DEVICE_PRIVATE_FLAG_ADD_ZERO_PACKET in plugins/logitech-bulkcontroller/fu-logitech-bulkcontroller-device.c?

> I am exploring following logic from Google:
> "Rather than sending multiple packet of (X bytes) as the last packet, send 2 packets of length (X-1) bytes & 1 byte"

Is that for FuLogitechBulkcontrollerCmd.DataTransfer? or the low level fu_logitech_bulkcontroller_device_send()?

Can you share any of the debugging on why you think ZLP is the cause of the firmware install failure? I see `failed to write-firmware: failed to write end transfer packet: not CMD_ACK, got timeout`

Richard

Sanjay Sheth

unread,

Jan 22, 2025, 10:44:09 AMJan 22

to fwupd

On Wednesday, January 22, 2025 at 6:32:24 AM UTC-8 Richard Hughes wrote:

On Sunday, 19 January 2025 at 21:26, 'Sanjay Sheth' via fwupd <fw...@googlegroups.com> wrote:
> Unfortunately, hughsie/zlp git branch didn't helped.

I'm assuming you un-commented the FU_USB_DEVICE_PRIVATE_FLAG_ADD_ZERO_PACKET in plugins/logitech-bulkcontroller/fu-logitech-bulkcontroller-device.c?

Sorry no. Let me retry.

> I am exploring following logic from Google:
> "Rather than sending multiple packet of (X bytes) as the last packet, send 2 packets of length (X-1) bytes & 1 byte"

Is that for FuLogitechBulkcontrollerCmd.DataTransfer? or the low level fu_logitech_bulkcontroller_device_send()?

That logic added by Logitech Windows firmware updater (SyncApp) to resolve CMD_ACK timeout issue for FU_LOGITECH_BULKCONTROLLER_CMD_END_TRANSFER packet.

Logic is needed only when we send last packet of firmware bytes, and only when that packet size is multiple of 512. In other words:
if ((chk_blob % 512) == 0) {

fu_logitech_bulkcontroller_device_upd_send_cmd (....chk_blob - 1...)

fu_logitech_bulkcontroller_device_upd_send_cmd (final byte of chk_blob)

}

Can you share any of the debugging on why you think ZLP is the cause of the firmware install failure? I see `failed to write-firmware: failed to write end transfer packet: not CMD_ACK, got timeout`

I have asked for this detail, waiting for the response.

Richard

Sanjay Sheth

unread,

Jan 22, 2025, 2:42:13 PMJan 22

to fwupd

Un-commenting out (FU_USB_DEVICE_PRIVATE_FLAG_ADD_ZERO_PACKET) didn't helped.
Again, latest 1.9.27 & main branch works with fwupdtool.
hughsie/zlp also works with fwupdtool. But no luck when running with daemon (fwupd+fwupdmgr), always see CMD_ACK timeout issue

Sanjay Sheth

unread,

Jan 23, 2025, 7:19:50 PMJan 23

to fwupd

Another interesting finding from both Google & Logitech dev setup:
"fwupdtool install /path/to/cab" fails,

but following always works:
"fwupdtool --verbose --plugins logitech_bulkcontroller install-blob /path/to/firmware.bin"

Richard Hughes

unread,

Jan 24, 2025, 7:35:46 AMJan 24

to Sanjay Sheth, fwupd, ntsi...@google.com

On Friday, 24 January 2025 at 00:19, 'Sanjay Sheth' via fwupd <fw...@googlegroups.com> wrote:
> Another interesting finding from both Google & Logitech dev setup:
> "fwupdtool install /path/to/cab" fails,
> but following always works:
> "fwupdtool --verbose --plugins logitech_bulkcontroller install-blob /path/to/firmware.bin"

Okay, that's certainly interesting. I've got a release RB so reinstall is an expected failure, but like you say, when installing v1.9.556 onto v1.9.556 I get:

install-blob: failed to write-firmware: firmware upgrade failed (which is expected, as the RB can only upgrade)
install: failed to write-firmware: failed to write end transfer packet: not CMD_ACK, got timeout (which is the bad thing)

I did a full -vv log of both operations, and lightly filtered them (e.g. ts in the JSON blob is different) and attached the diff.txt file

The first two differences seem to be the differing protocol return orders, as documented in https://github.com/fwupd/fwupd/tree/main/plugins/logitech-bulkcontroller#design-notes

The third is a ts difference, and a GUID that changed -- which I think we can ignore.

The fourth is where it gets interesting, and makes sense in the `failed to write end transfer packet` context:

@@ -774906,18 +482009,25 @@ FuPluginLogitechBulkController response:
FuPluginLogitechBulkController request: 03 cc 00 00 97 03 00 00 14 25 3a f7
FuPluginLogitechBulkController response: 01 ff 00 00 04 00 00 00 03 cc 00 00
FuPluginLogitechBulkController request: 04 cc 00 00 24 00 00 00 01 00 00 00
-FuPluginLogitechBulkController response: 02 ff 00 00 08 00 00 00 04 cc 00 00
+FuPluginLogitechBulkController response: 01 ff 00 00 08 00 00 00 04 cc 00 00

..so when:

* using install-blob we send EndTransfer (0xCC04) and get Ack (0xFF01)
* using install we send EndTransfer (0xCC04) and get Timeout (0xFF02)

Recapturing the logs, this time with timestamps turned on I see:

install-blob:

11:48:10.355 FuPluginLogitechBulkController request: 03 cc 00 00 97 03 00 00 14 25 3a f7
11:48:10.356 FuPluginLogitechBulkController response: 01 ff 00 00 04 00 00 00 03 cc 00 00
(1ms)
11:48:12.780 FuPluginLogitechBulkController request: 04 cc 00 00 24 00 00 00 01 00 00 00
11:48:18.578 FuPluginLogitechBulkController response: 01 ff 00 00 08 00 00 00 04 cc 00 00
(5798ms)

install:

11:51:31.356 FuPluginLogitechBulkController request: 03 cc 00 00 97 03 00 00 14 25 3a f7
11:51:31.357 FuPluginLogitechBulkController response: 01 ff 00 00 04 00 00 00 03 cc 00 00
(1ms)
11:51:54.377 FuPluginLogitechBulkController request: 04 cc 00 00 24 00 00 00 01 00 00 00
11:51:54.380 FuPluginLogitechBulkController response: 02 ff 00 00 08 00 00 00 04 cc 00 00
(3ms)

So for some reason, install-blob is asking the hardware to EndTransfer and it returns with Ack in ~6s -- but install is asking the hardware to EndTransfer and it returns in 3ms with Timeout.

Should we retry the FU_LOGITECH_BULKCONTROLLER_CMD_END_TRANSFER if the first one returns with timeout? I'm very confused why install and install-blob should differ here; it's the same device initialization. I even used `--plugins logitech_bulkcontroller` to restrict install down to just the bulkcontroller initialization.

The only difference I can see, from a timing point of view, is that:

install:

- Call FuPlugin->setup() to create the FuPluginLogitechBulkController (716ms)
- decompress cab archive (9125ms)
- write firmware...

install-blob:

- Call FuPlugin->setup() to create the FuPluginLogitechBulkController (730ms)
- write firmware...

I tried to simulate this using something like this:

@@ -1402,6 +1402,9 @@ fu_logitech_bulkcontroller_device_setup(FuDevice *device, GError **error)
+ g_usleep(9125 * 1000);
return TRUE;
}

But install-blob still "succeeded", even when I made the delay much much bigger. So, the timestamp difference between ->setup() and ->write_firmware() is maybe not the issue here.

Perhaps it's the cabinet decompressor or stream operations -- this is a 6GB payload so I guess there might be a >4GB 32 bit bug somewhere. To test this I did a complete -vv of both install and install-blob, this time unlimiting the buffer logging -- so each file is ~6GB in size. This certainly gave "diff" something to chew on.

This gave the other attached file, decompress.patch -- again lightly filtered for clarity. It shows the same differing protocol return orders we saw in the earlier log, and the Timeout (0xFF02) failure -- but doesn't show any differences in the byte data we send the hardware.

At this point I'm a bit stumped -- can you share with me what EndTransfer is actually doing in the firmware, device-side so to speak?

Richard.

diff.txt

decompress.patch

Sanjay Sheth

unread,

Jan 29, 2025, 7:14:37 PMJan 29

to fwupd

Trying to understand, why 1.9.xx always works! Daemons always works for fwupd version 1.9.25, 1.9.27 etc. We see failures for the same CAB files when used with fwupd 2.0.0 (or higher) daemons.

Is it possible to revert libusb version, temporarily, to test if it resolve the issue!

We also tried upgrade with older CollabOS firmware versions: 1.11, 1.12, 1.13. They all worked with 1.9.xx.

We can provide incremental builds for testing upgrade on production device.

For now, ruling out ZLP, since same firmware files works with 1.9.xx

Richard Hughes

unread,

Jan 31, 2025, 8:58:29 AMJan 31

to Sanjay Sheth, fwupd

On Thursday, 30 January 2025 at 00:14, 'Sanjay Sheth' via fwupd <fw...@googlegroups.com> wrote:
> Trying to understand, why 1.9.xx always works

Yes, lets dig into that a bit as that's possibly easier to explain than install/install-blob.

> Is it possible to revert libusb version, temporarily, to test if it resolve the issue!

Using the same library libusb as before, I built 1_9_X and that works with both fwupdtool and fwupdmgr as you reported.

> For now, ruling out ZLP, since same firmware files works with 1.9.xx

I think that's sensible; we can revisit later if needed.

Comparing 1_9_X to main I see four possible differences:

* "FuUsbDevice -> GUsbDevice -> libusb" v.s. "FuUsbDevice -> libusb" -- this isn't trivial to "revert", and most of the code from GUsb was incorporated without functional changes, so lets put this one on the back burner for now. There might be *minute* timing differences, but nothing that should change between "run by install, and run by install-blob".

* We now probe child peripheral devices (e.g. the Sight) -- disabling this in fu_logitech_bulkcontroller_device_json_parser() made no difference.

* We changed the error domains from GIOError to FwupdError -- but we also changed the corresponding g_error_matches() to the new value, so this seems unlikely. We also don't hit fu_logitech_bulkcontroller_device_clear_queue_cb error in the usual runtime case.

* We make the proto buffer predictable (e.g. header_msg->id=0 in fu_logitech_bulkcontroller_proto_manager_set_header) when the device is marked as emulated or is tagged to record emulation data -- but neither of those flags are set in fwupdmgr or fwupdtool mode by default -- not be to be confused with can-emulation-tag, which means "i could if I wanted, rather than am in either mode".

* We changed slightly how the chunking was being done in the conversion from fu_chunk_array_new_from_bytes() to fu_chunk_array_new_from_stream() but as we saw when comparing fwupdtool and fwupdmgr in the big 6GB logs the bytestream sent to the device is identical.

* We changed how we calculate the hash from doing it in one allocated blob to reading from the stream in chunks. This is MD5 so that's perfectly valid, and also would result in a byte-stream difference -- which we don't see.

* We made the sequence ID non-random by incrementing it and storing it per-device (so that the emulation data is predicable). Lets drill into that a bit, although that wouldn't explain install vs install-blob as it's the same codepath:

Comparing the -vv output of 1_9_X has some new differences (as the 1_9_X code is using a random sequence number) but tailing both logs to the last 5000 lines we don't see any differences as we're using the UPD bulk interface fu_logitech_bulkcontroller_device_upd_send_cmd -> fu_logitech_bulkcontroller_device_send,fu_logitech_bulkcontroller_device_recv rather than the SYNC interface as in fu_logitech_bulkcontroller_device_sync_send_cmd. The UPD interface doesn't use sequence numbers, and also finding "04 CC" (EndTransfer) we can see the MD5 data has been calculated in exactly the same way as 1_9_X.

I'm was getting desperate at this point, so I'm starting to bisect -- which for a 15 minute update with 6GB logs is no fun at all. I'll try to do big jumps to minimize the number of writes we have to do. This is going backwards in time, with later lines meaning older fwupd snapshots of main.

* 90e317ca6d2bc21dbffa8ca8e8918e143c1377e5 -> initialized, wrote .cab, but not CMD_ACK, got timeout
* 07cff334338399707fd55a7a3ddef0b81c407700 -> initialized, wrote .cab, but not CMD_ACK, got timeout
* 8eb260e0ed4865fa0b44964b67a8150d446226bd -> failed to probe
* 68ce789667461b206a4e9ff5f38f794bace3bb07 -> failed to probe

It would appear the commits we've pushed in the last year or so really improved the "just detect the RB hardware" as coldplug was getting increasingly unreliable as we got further back.

Going back to the original error, "not CMD_ACK, got timeout" I think you're going to have to tell me what *exactly* the device is doing when it gets the EndTransfer command. I can't really continue to debug this if I don't know what the device is expecting; this is the most time I've ever spent on debugging a custom protocol issue. Given https://github.com/fwupd/fwupd/blob/main/plugins/logitech-bulkcontroller/README.md#design-notes exists suggest there are bigger issues with the protocol design or implementation and I'm out of ideas on what to change client side.

Can you involve the team responsible for the bulk-controller device-side implementation please? I'm out of ideas.

Richard

Sanjay Sheth

unread,

Feb 1, 2025, 4:36:40 AMFeb 1

to fwupd

As for libusb, did you applied older libusb version (one that is bundled with 1.9.X) to 2.x codebase?

Or you applied newer libusb version to 1.9.x codebase.

Please share, how to change libusb before building fwupd.

On my setup I saw that EndTransfer never makes it to device, as USB connectivity goes away as soon as we send last firmware packet.

Also working with RB device team further troubleshooting

Richard Hughes

unread,

Feb 1, 2025, 6:26:00 AMFeb 1

to Sanjay Sheth, fwupd

On Saturday, 1 February 2025 at 09:36, 'Sanjay Sheth' via fwupd <fw...@googlegroups.com> wrote:
> As for libusb, did you applied older libusb version (one that is bundled with 1.9.X) to 2.x codebase?

There's no bundled libusb in fwupd, they're both using the shared system version of /usr/lib64/libusb-1.0.so.0.4.0

> On my setup I saw that EndTransfer never makes it to device, as USB connectivity goes away as soon as we send last firmware packet.

So you mean the device reboots after sending END_TRANSFER? i.e. You get to fu_logitech_bulkcontroller_device_upd_send_cmd(), call fu_logitech_bulkcontroller_device_send() and never get a response from fu_logitech_bulkcontroller_device_recv()?

I'm really confused why that would be difference between fwupd install and fwupd install-blob -- unless there's a race condition making install somehow microseconds slower install-blob or something like that? Although, that would be super-weird, as we're not reading from any streams or doing any file IO inbetween those two call... Is it expected that EndTransfer reboots the device? I would have expected (and the code does) UnInit on the buffer although maybe we shouldn't be doing that if the device is busy rebooting.

> Also working with RB device team further troubleshooting

Great; the two questions I have for them is:

* When the firmware updater calls EndTransfer are we expecting an INIT ACK response? If the answer is "no" then how do we know if the MD5 checksum we sent was correct?

* Should we call UnInit after EndTransfer?

Richard.

Sanjay Sheth

unread,

Feb 3, 2025, 5:40:18 AMFeb 3

to fwupd

Response from device firmware team.

This is the expected flow
INIT --> ACK --> START FILE TRANSFER--> END_PACKET--> ACK --> UNINIT--> UNINIT ACK

Richard Hughes

unread,

Feb 3, 2025, 5:51:14 AMFeb 3

to Sanjay Sheth, fwupd

On Monday, 3 February 2025 at 10:40, 'Sanjay Sheth' via fwupd <fw...@googlegroups.com> wrote:
> This is the expected flow
> INIT --> ACK --> START FILE TRANSFER--> END_PACKET--> ACK --> UNINIT--> UNINIT ACK

I guess you mean END_TRANSFER (0xCC04) not END_PACKET? If so, that's exactly what fwupd is doing. Two more questions for them:

1. In what circumstances can calling END_TRANSFER return with timeout?
2. If END_TRANSFER returns with timeout, should we retry again END_TRANSFER-->ACK? If so, with what polling delay and for how many times?

Richard

> --
> You received this message because you are subscribed to the Google Groups "fwupd" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to fwupd+un...@googlegroups.com.
> To view this discussion, visit https://groups.google.com/d/msgid/fwupd/3c542054-0a9f-42c8-9f8c-cc135a038881n%40googlegroups.com.

Sanjay Sheth

unread,

Feb 5, 2025, 6:25:15 PMFeb 5

to fwupd

Response from device team:

1. In what circumstances can calling END_TRANSFER return with timeout? -->

=====> 1. Previous write was pending and in this case retry will not be helpful

2. Write did went but device was in another state, to ack that
3. Write was successful, but device some how not recived to ack that

2. If END_TRANSFER returns with timeout, should we retry again END_TRANSFER-->ACK? If so, with what polling delay and for how many times?

==> Currently sync app uses 30 sec time out, thats alligned with device

Richard Hughes

unread,

Feb 6, 2025, 9:53:48 AMFeb 6

to Sanjay Sheth, fwupd, Kyle Williams

On Wednesday, 5 February 2025 at 23:25, 'Sanjay Sheth' via fwupd <fw...@googlegroups.com> wrote:
> =====> 1. Previous write was pending and in this case retry will not be helpful

I think this may be the case, but see below:

> 2. Write did went but device was in another state, to ack that
> 3. Write was successful, but device some how not recived to ack that

Did the device team mention anything about the oddities observed in https://github.com/fwupd/fwupd/tree/main/plugins/logitech-bulkcontroller#design-notes -- I wonder if we're not doing something the same as the windows updater. If the Windows updater also has to deal with the replies being out-of-order then any more information would be very welcome.

> 2. If END_TRANSFER returns with timeout, should we retry again END_TRANSFER-->ACK? If so, with what polling delay and for how many times?
> ==> Currently sync app uses 30 sec time out, thats alligned with device

I tried this; retrying END_TRANSFER every 1000ms for 30 tries, and just got:

failed to write-firmware: failed to write end transfer packet: failed after 30 retries: not CMD_ACK, got timeout

i.e. a whole lot of:

13:39:04.513 FuPluginLogitechBulkController request: 04 cc 00 00 24 00 00 00 01 00 00 00
13:39:04.519 FuPluginLogitechBulkController response: 02 ff 00 00 08 00 00 00 04 cc 00 00
13:39:05.520 FuPluginLogitechBulkController request: 04 cc 00 00 24 00 00 00 01 00 00 00
13:39:05.524 FuPluginLogitechBulkController response: 02 ff 00 00 08 00 00 00 04 cc 00 00
13:39:06.524 FuPluginLogitechBulkController request: 04 cc 00 00 24 00 00 00 01 00 00 00
13:39:06.529 FuPluginLogitechBulkController response: 02 ff 00 00 08 00 00 00 04 cc 00 00
13:39:07.529 FuPluginLogitechBulkController request: 04 cc 00 00 24 00 00 00 01 00 00 00
13:39:07.533 FuPluginLogitechBulkController response: 02 ff 00 00 08 00 00 00 04 cc 00 00

Interestingly, the device was doing the transition-to-device-mode "blo-lop" sound the entire time the END_TRANSFER retry loop was running -- which is a HUGE clue.

Also interestingly, while looking at the timestamps when using fwupdtool install foo.cab:

14:14:09.488 FuPluginLogitechBulkController request: 03 cc 00 00 97 03 00 00 14 25 3a f7
14:14:09.491 FuPluginLogitechBulkController response: 01 ff 00 00 04 00 00 00 03 cc 00 00
14:14:35.333 FuPluginLogitechBulkController request: 04 cc 00 00 24 00 00 00 01 00 00 00
14:14:35.337 FuPluginLogitechBulkController response: 02 ff 00 00 08 00 00 00 04 cc 00 00
14:14:36.337 FuPluginLogitechBulkController request: 04 cc 00 00 24 00 00 00 01 00 00 00

That's a *big* delay between the last DataTransfer and the first call to EndTransfer. This can only be the MD5 generation of the stream, and hey, it might explain why it worked on 1.9.x -- on that branch we just load the entire decompressed blob into memory (>4GB) and then I guess we can do the MD5 sum faster. On 1.9.x we're using 2 x blob size, which would be ~9GB for this firmware and why deploying images on budget ChromeBooks won't work with 1.9.x -- but I digress.

I suspect what's happening device side is that if we're in "update mode" there's a get-out-of-jail timer of ~20 seconds -- i.e. if we don't do a DataTransfer or EndTransfer request within 20 seconds then the device figures that the host has hung and reboots so that it still "works" from a user point of view. So lets summarize the root cause:

* In install-blob mode we've loaded the stream from disk, which gets chunked and MD5'd in ~5 seconds.

- This means we complete the DataTransfer->EndTransfer transfer within the 20 second timer -> all okay

* In install mode we load the cabinet from disk, which gets unchunked from cabinet format with each partial stream in a composite stream then chunked again and MD5'd in ~20 seconds. In 2.0.x we're doing this by streaming the data from *file* (which is how we can deploy a 4GB update using only 50MB memory) which is probably a lot slower as we're doing a lot of small seek()'s and read()'s rather than one big multi-GB read.

- This means we DO NOT complete the DataTransfer->EndTransfer transfer within the 20 second timer -> **timeout**

And, for the drum roll -- computing the MD5 from the composite stream BEFORE we start DataTransfer we avoid the device-initiated timeout -- which fixes the bug for me both using install-blob and install.

For future bugs, it would have been super to know there was a 20 second device-side timeout timer.

Richard

Sanjay Sheth

unread,

Feb 7, 2025, 6:47:46 AMFeb 7

to fwupd

Thank you very much @hughsie . Your patient and skills saved the day. I owe you big time.

I am following up with Device team for out-of-order replies handling.

Truly appreciate your help here.

SUBRAT Ojha

unread,

Feb 7, 2025, 7:11:04 AMFeb 7

to fwupd

@hughsie : Logitech Devices already uses direct 30 Sec timer...
private static final long TIMEOUT_MS = TimeUnit.SECONDS.toMillis(30);

Good to hear that issue is sorted out.

Reply all

Reply to author

Forward