On Wed, Jun 24, 2020 at 2:25 PM Danielle Ratson <
dani...@mellanox.com> wrote:
>
> On 6/24/2020 1:57 PM, Dmitry Vyukov wrote:
> > On Wed, Jun 24, 2020 at 11:07 AM Danielle Ratson <
dani...@mellanox.com> wrote:
> >> Hi Dmitry,
> >>
> >> Recently we experience a lot of issues when running syzkaller.
> >>
> >> All the last runs were interrupted in some point and were stopped after very few hours.
> >>
> >> First of all we get a lot of crashes with the description "lost connection to test machine", example log is attached.
> > Hi Danielle,
> >
> > The timeout for this command is 1 minute:
> >
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgoogle%2Fsyzkaller%2Fblob%2Fmaster%2Fpkg%2Fhost%2Ffeatures.go%23L110&data=02%7C01%7Cdanieller%40mellanox.com%7Ccc3b8b485a18441bae9508d8182d6450%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637285930524219330&sdata=1XtxqvuBsZfQPNIgtsN7EBlLYv5b%2BcJbT2LL%2BGRUC4c%3D&reserved=0
> > And it does not do much, it should complete within a second or so.
> >
> > I would suggest to run that command manually on the target and see how
> > long it takes. If it takes long, why?
>
>
> This crash doesn't happen all the time. Once in a while it is losing connection and reconnecting afterwards itself.
Maybe we need to bump that timeout a bit. However, I never seen it
firing before. So it seems to be particularly slow in your setup from
time to time.
Does it help if you dump that timeout to 3m? 5m?
> >> Note that we are running the syzkaller from a switch.
> > What does it mean "from a switch"?
> > If it's a single physical machine, maybe it gets broken and that's why
> > executor times out. Is it healthy when these errors happen?
>
> I mean a physical machine indeed, but the machine is healthy when it happens.
Low rate of various failures like this seems to be inevitable for now,
it's just too expensive to debug and address each episodic failure. We
are getting this on syzbot as well, you may see these here:
https://syzkaller.appspot.com/bug?id=b97ec15bfe317ac1ddccb41f2a913d4f7a31c6d7
> >> Second, after a while we get "failed to open /dev/vfio/27: Device or resource busy" error during the run which causes the system to fail executing afterwards.
> > Where do you see this error? What are surrounding log messages? What
> > exactly do you mean by "system to fail executing"? syz-manager should
> > reboot it and reboot should heal any such bugs, does it?
>
>
> I see this error in the run logs (after running bin/syz-manager -config), you can see the full error below:
>
> 2020/06/24 15:00:41 loop: phase=4 shutdown=false instances=1/1 [0] repro: pending=0 reproducing=0 queued=0
> 2020/06/24 15:00:41 loop: starting instance 0
> 2020/06/24 15:00:46 loop: instance 0 finished, crash=false
> 2020/06/24 15:00:46 failed to create instance: failed to read from qemu: EOF
> qemu-system-x86_64: -device vfio-pci,host=06:00.0,addr=0x10: vfio 0000:06:00.0: failed to open /dev/vfio/27: Device or resource busy
>
>
> I mean that from that point the syzkaller keeps repeating the error and doesn't recover by itself.
>
> We need to stop the run, reboot and run syzkaller again.
>
> Bottom line after few ~3 hours of run, this issue occurs again and we can't have a longer stable syzkaller execution which misses the whole point.
This looks more like either qemu bug, or you host kernel bug related
to /dev/vfio.
syz-manager simply killls one qemu and starts another. If qemu is
broken and can't start anymore, there is little syz-manager can do.
Potentially syzkaller has discovered a bug in your host kernel via fuzzing.
I would try to diagnose the host machine: are there left-over zombie
qemu processes? Can you open and work with that /dev/vfio/27? Are
there any bug messages in host dmesg? Are there and hanged
tasks/kernel threads on host? Production kernels has very limited
debugging capabilities, so it may help to build your host kernel with
all of debugging configs like hung task detection, etc. Then maybe you
will see something in dmesg.