Increasing unreliability of booting via AOE

33 vues
Accéder directement au premier message non lu

glid...@googlemail.com

non lue,
5 mars 2023, 12:02:1105/03/2023
à kiwi
I realise this problem is unlikely to be due to a fault in kiwi, but this group is likely a good place to find expertise regarding AOE.

Some background on how I'm using kiwi. I use kiwi to create disc images for my MythTV-based digital TV recorder frontend (the frontend is just for playback; a server does the recording). I have kiwi create a live-cd image: the use of a tmpfs overlay above a readonly image works well because it allows me to just power off without worrying about shutting down cleanly. There's a directory where some information is cached and to avoid that having to be constantly recreated I make that an nfs mount of a directory on the server.

Most of the time the live-cd image lives on an SSD within the frontend device and all works perfectly, but every now and then I want to test out a new updated live-cd image. To do that I make the new image available on the server via vblade and reboot the device in a different mode to use AOE. The booting mode is controlled via grub, with the grub config held on the server. Once I've tested the new image for a while, I ssh into the frontend device and use

  dd if=/dev/etherd/e0.1 of=/dev/sda bs=1M

to copy the new image to the SSD, and lastly revert the temporary grub config change to go back to booting from the SSD.

This all works very nicely, never having to take the SSD out of the device or even connect a keyboard to it.

The problem is that booting the image via AOE has become increasingly unreliable. It often stalls and at various points in the booting process.

Sometimes it stalls at "dracut_initqueue" (actually it always stalls there for at least a minute, but can stick there for good and the log lines below it never appear).

  Feb 25 14:35:51 slurp dracut-initqueue[614]: DHCP is finished successfully
  Feb 25 14:36:46 slurp kernel: aoe: e0.1: setting 1024 byte data frames
  Feb 25 14:36:46 slurp kernel: aoe: 408d5c3fb96a e0.1 v4019 has 1662976 sectors
  Feb 25 14:36:46 slurp kernel: etherd/e0.1: p1 p2

Sometimes it stalls at "Switch Boot"

Sometimes it stalls after giving a log in prompt, but without ever starting the graphical interface.

Sometimes it boots and works, but I cannot ssh into it.

Sometimes I can ssh into it, but the "dd" command to copy the image to SSD stalls.

All this has gotten slowly worse over a period of a few years. Over that time the live-cd image has become larger, I've moved to later versions of openSuSE Leap and later versions of kiwi. I may have made small tweaks to my network. There was a time in the past when I used AOE exclusively and the appliance was mostly reliable, but it's now getting close to unusable.

Can anyone suggest anything I might try towards making it work better?



Tony Rome

non lue,
5 mars 2023, 12:49:2905/03/2023
à kiwi
yes i have experienced the same issue with fails using SSD the media is likely fail to read sectors
if you can mount as RW and FSCK the device
you will likely find error.
Mount Ronly does not help 

Le message a été supprimé

glid...@googlemail.com

non lue,
6 mars 2023, 04:42:5806/03/2023
à kiwi
Thank's for the suggestion, but my problem is the other way around: the system works fine running from SSD, but intermittently deadlocks running from AOE.

Tony Rome

non lue,
6 mars 2023, 12:40:1606/03/2023
à kiwi
Yes OK AOE is the company name for Flashrom Technology - Microchip Technology Inc. AOE
So likely the same idea cannot read the device the Filesystem is not stable , Bad sectors
I assume it is an ext file system for the image. If  not, what is it? and why are you choosing linux for this ?

I would use a blank file system and for RW test on it AOE chip Then post the results

glid...@googlemail.com

non lue,
22 avr. 2023, 09:41:2322/04/2023
à kiwi
Marcus made some suggestions, via email, as to how to trace the cause of these problems. One of the suggestions was to test with qemu. I've just tried, but can't get past an error. I did this:

sudo ip link add br0 type bridge
sudo ip addr add 10.0.2.1/16 dev br0
sudo ip link set br0 up
sudo ip link set br0 promisc on
<restart vblade configured to use br0>
qemu-system-x86_64 -m 1024 -boot order=n -netdev bridge,id=net0,br=br0 -device virtio-net-pci,netdev=net0 -kernel /tftpboot/boot/linux -initrd /tftpboot/boot/initrd -append "rd.kiwi.live.pxe root=live:AOEINTERFACE=e0.1" -nographic

The error I see is "-netdev bridge,id=net0,br=br0: bridge helper failed"

I tried adding "allow br0" to /etc/qemu/bridge.conf but that didn't help.

The commands I'm using could be completely wrong. I totally failed to find any instructions that I could understand. These command were established in a long back and forth conversation with ChatGPT.

glid...@googlemail.com

non lue,
6 mai 2023, 15:35:0906/05/2023
à kiwi
I eventually managed to get qemu to work, and my appliance seems to boot up just fine in that environment. It doesn't actually work because it needs network access to a server, but it gets into a graphical set up screen, past all the stages where the live system tends to fail. So it looks like I have a network problem. I hadn't previously realised that AOE has no packet loss fallback, and that AOE might fail on a network that seems fine with TCP-based protocols and ones that have their own resend capability. The route between server and client in this case goes through two switches and two cat 5e cable runs of about 8 meters, one of the cables broken by passing through a wall plate. I'm tempted to try replacing the long cables with cat 8, but given the hassle involved it will be very disappointing if it doesn't help. I don't really know how to go about fault finding this.

glid...@googlemail.com

non lue,
7 mai 2023, 15:08:3807/05/2023
à kiwi
The obvious next test occurred to me: I tried joining the server and client directly together via a short ethernet cable, with no switches involved (actually tried two different cables). It still didn't boot reliably. I still assume network packet loss is the problem, so it must be one or both of the NICs at fault. I bet it isn't uncommon to have some small amount of packet loss. I'm now thinking that AOE isn't really a practical option. Is there any way I can try NDB with kiwi?

Marcus Schäfer

non lue,
21 mai 2023, 12:33:0321/05/2023
à 'glid...@googlemail.com' via kiwi
Hi,

sorry to hear that AoE is causing you so much trouble

> Is there any way I can try NDB with kiwi?

I haven't done it for some time but it worked for me in the past
and I documented it here:

https://osinside.github.io/kiwi/working_with_images/network_overlay_boot.html

search for "Export via NBD"

Hope this helps

Regards,
Marcus
--
Public Key available via: https://keybase.io/marcus_schaefer/key.asc
keybase search marcus_schaefer
-------------------------------------------------------
Marcus Schäfer Brunnenweg 18
Tel: +49 7562 905437 D-88260 Argenbühl
Germany
-------------------------------------------------------
signature.asc

glid...@googlemail.com

non lue,
22 mai 2023, 04:58:2422/05/2023
à kiwi
On Sunday, May 21, 2023 at 5:33:03 PM UTC+1 Marcus wrote:
Hi,

sorry to hear that AoE is causing you so much trouble

> Is there any way I can try NDB with kiwi?

I haven't done it for some time but it worked for me in the past
and I documented it here:

https://osinside.github.io/kiwi/working_with_images/network_overlay_boot.html

search for "Export via NBD"

Hope this helps

Oh, I wasn't aware of that. Thanks. I will give that a try.

Regarding AOE, I have one more test I want to make, to rule out something being wrong with the network card in the device I use for the appliance, but pending that test, I can't seem to make it work on any real network. I'm wondering if something has changed somewhere (in the kernel maybe) that means AOE just doesn't work any more on real networks, and works only in the idealised case of using a bridge and VM. On the other hand, I have no concrete knowledge of the kernel, and I can't think abstractly what might lead to such behaviour.

Regards, Paul.
Répondre à tous
Répondre à l'auteur
Transférer
0 nouveau message