Issues with syzkaller run

649 views
Skip to first unread message

Danielle Ratson

unread,
Jun 24, 2020, 5:07:19 AM6/24/20
to Dmitry Vyukov, mlxsw, syzkaller
Hi Dmitry,

Recently we experience a lot of issues when running syzkaller.

All the last runs were interrupted in some point and were stopped after very few hours.

First of all we get a lot of crashes with the description "lost connection to test machine", example log is attached.

Second, after a while we get "failed to open /dev/vfio/27: Device or resource busy" error during the run which causes the system to fail executing afterwards.


Note that we are running the syzkaller from a switch.

Do you know those issues and can advise us how to solve it?

Thanks,
Danielle
log.txt

Dmitry Vyukov

unread,
Jun 24, 2020, 6:57:30 AM6/24/20
to Danielle Ratson, mlxsw, syzkaller
On Wed, Jun 24, 2020 at 11:07 AM Danielle Ratson <dani...@mellanox.com> wrote:
>
> Hi Dmitry,
>
> Recently we experience a lot of issues when running syzkaller.
>
> All the last runs were interrupted in some point and were stopped after very few hours.
>
> First of all we get a lot of crashes with the description "lost connection to test machine", example log is attached.

Hi Danielle,

The timeout for this command is 1 minute:
https://github.com/google/syzkaller/blob/master/pkg/host/features.go#L110
And it does not do much, it should complete within a second or so.

I would suggest to run that command manually on the target and see how
long it takes. If it takes long, why?

> Note that we are running the syzkaller from a switch.

What does it mean "from a switch"?
If it's a single physical machine, maybe it gets broken and that's why
executor times out. Is it healthy when these errors happen?


> Second, after a while we get "failed to open /dev/vfio/27: Device or resource busy" error during the run which causes the system to fail executing afterwards.

Where do you see this error? What are surrounding log messages? What
exactly do you mean by "system to fail executing"? syz-manager should
reboot it and reboot should heal any such bugs, does it?

Danielle Ratson

unread,
Jun 24, 2020, 8:25:30 AM6/24/20
to Dmitry Vyukov, mlxsw, syzkaller
On 6/24/2020 1:57 PM, Dmitry Vyukov wrote:
> On Wed, Jun 24, 2020 at 11:07 AM Danielle Ratson <dani...@mellanox.com> wrote:
>> Hi Dmitry,
>>
>> Recently we experience a lot of issues when running syzkaller.
>>
>> All the last runs were interrupted in some point and were stopped after very few hours.
>>
>> First of all we get a lot of crashes with the description "lost connection to test machine", example log is attached.
> Hi Danielle,
>
> The timeout for this command is 1 minute:
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgoogle%2Fsyzkaller%2Fblob%2Fmaster%2Fpkg%2Fhost%2Ffeatures.go%23L110&amp;data=02%7C01%7Cdanieller%40mellanox.com%7Ccc3b8b485a18441bae9508d8182d6450%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637285930524219330&amp;sdata=1XtxqvuBsZfQPNIgtsN7EBlLYv5b%2BcJbT2LL%2BGRUC4c%3D&amp;reserved=0
> And it does not do much, it should complete within a second or so.
>
> I would suggest to run that command manually on the target and see how
> long it takes. If it takes long, why?


This crash doesn't happen all the time. Once in a while it is losing connection and reconnecting afterwards itself.


>
>> Note that we are running the syzkaller from a switch.
> What does it mean "from a switch"?
> If it's a single physical machine, maybe it gets broken and that's why
> executor times out. Is it healthy when these errors happen?


I mean a physical machine indeed, but the machine is healthy when it happens.


>
>
>> Second, after a while we get "failed to open /dev/vfio/27: Device or resource busy" error during the run which causes the system to fail executing afterwards.
> Where do you see this error? What are surrounding log messages? What
> exactly do you mean by "system to fail executing"? syz-manager should
> reboot it and reboot should heal any such bugs, does it?


I see this error in the run logs (after running bin/syz-manager -config), you can see the full error below:

2020/06/24 15:00:41 loop: phase=4 shutdown=false instances=1/1 [0] repro: pending=0 reproducing=0 queued=0
2020/06/24 15:00:41 loop: starting instance 0
2020/06/24 15:00:46 loop: instance 0 finished, crash=false
2020/06/24 15:00:46 failed to create instance: failed to read from qemu: EOF
qemu-system-x86_64: -device vfio-pci,host=06:00.0,addr=0x10: vfio 0000:06:00.0: failed to open /dev/vfio/27: Device or resource busy


I mean that from that point the syzkaller keeps repeating the error and doesn't recover by itself.

We need to stop the run, reboot and run syzkaller again.

Bottom line after few ~3 hours of run, this issue occurs again and we can't have a longer stable syzkaller execution which misses the whole point.



Dmitry Vyukov

unread,
Jun 24, 2020, 11:37:03 AM6/24/20
to Danielle Ratson, mlxsw, syzkaller
On Wed, Jun 24, 2020 at 2:25 PM Danielle Ratson <dani...@mellanox.com> wrote:
>
> On 6/24/2020 1:57 PM, Dmitry Vyukov wrote:
> > On Wed, Jun 24, 2020 at 11:07 AM Danielle Ratson <dani...@mellanox.com> wrote:
> >> Hi Dmitry,
> >>
> >> Recently we experience a lot of issues when running syzkaller.
> >>
> >> All the last runs were interrupted in some point and were stopped after very few hours.
> >>
> >> First of all we get a lot of crashes with the description "lost connection to test machine", example log is attached.
> > Hi Danielle,
> >
> > The timeout for this command is 1 minute:
> > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgoogle%2Fsyzkaller%2Fblob%2Fmaster%2Fpkg%2Fhost%2Ffeatures.go%23L110&amp;data=02%7C01%7Cdanieller%40mellanox.com%7Ccc3b8b485a18441bae9508d8182d6450%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637285930524219330&amp;sdata=1XtxqvuBsZfQPNIgtsN7EBlLYv5b%2BcJbT2LL%2BGRUC4c%3D&amp;reserved=0
> > And it does not do much, it should complete within a second or so.
> >
> > I would suggest to run that command manually on the target and see how
> > long it takes. If it takes long, why?
>
>
> This crash doesn't happen all the time. Once in a while it is losing connection and reconnecting afterwards itself.

Maybe we need to bump that timeout a bit. However, I never seen it
firing before. So it seems to be particularly slow in your setup from
time to time.
Does it help if you dump that timeout to 3m? 5m?



> >> Note that we are running the syzkaller from a switch.
> > What does it mean "from a switch"?
> > If it's a single physical machine, maybe it gets broken and that's why
> > executor times out. Is it healthy when these errors happen?
>
> I mean a physical machine indeed, but the machine is healthy when it happens.

Low rate of various failures like this seems to be inevitable for now,
it's just too expensive to debug and address each episodic failure. We
are getting this on syzbot as well, you may see these here:
https://syzkaller.appspot.com/bug?id=b97ec15bfe317ac1ddccb41f2a913d4f7a31c6d7


> >> Second, after a while we get "failed to open /dev/vfio/27: Device or resource busy" error during the run which causes the system to fail executing afterwards.
> > Where do you see this error? What are surrounding log messages? What
> > exactly do you mean by "system to fail executing"? syz-manager should
> > reboot it and reboot should heal any such bugs, does it?
>
>
> I see this error in the run logs (after running bin/syz-manager -config), you can see the full error below:
>
> 2020/06/24 15:00:41 loop: phase=4 shutdown=false instances=1/1 [0] repro: pending=0 reproducing=0 queued=0
> 2020/06/24 15:00:41 loop: starting instance 0
> 2020/06/24 15:00:46 loop: instance 0 finished, crash=false
> 2020/06/24 15:00:46 failed to create instance: failed to read from qemu: EOF
> qemu-system-x86_64: -device vfio-pci,host=06:00.0,addr=0x10: vfio 0000:06:00.0: failed to open /dev/vfio/27: Device or resource busy
>
>
> I mean that from that point the syzkaller keeps repeating the error and doesn't recover by itself.
>
> We need to stop the run, reboot and run syzkaller again.
>
> Bottom line after few ~3 hours of run, this issue occurs again and we can't have a longer stable syzkaller execution which misses the whole point.

This looks more like either qemu bug, or you host kernel bug related
to /dev/vfio.
syz-manager simply killls one qemu and starts another. If qemu is
broken and can't start anymore, there is little syz-manager can do.
Potentially syzkaller has discovered a bug in your host kernel via fuzzing.
I would try to diagnose the host machine: are there left-over zombie
qemu processes? Can you open and work with that /dev/vfio/27? Are
there any bug messages in host dmesg? Are there and hanged
tasks/kernel threads on host? Production kernels has very limited
debugging capabilities, so it may help to build your host kernel with
all of debugging configs like hung task detection, etc. Then maybe you
will see something in dmesg.

Danielle Ratson

unread,
Jun 25, 2020, 10:10:52 AM6/25/20
to Dmitry Vyukov, mlxsw, syzkaller
Hi Dmitry,

Thanks for all the suggestions, for now we indeed increase the timeout to 3 mins and it looks much better.
We ran it for few hours now and it only failed twice, while before it failed dozens time in the same period of time.

Moreover, we are not experiencing the "device or resource is busy" at the moment either.
If will get it again ill investigate according to your suggestions.

Thank you very much for the help,
Danielle

> -----Original Message-----
> From: Dmitry Vyukov <dvy...@google.com>
> Sent: Wednesday, June 24, 2020 6:37 PM
> To: Danielle Ratson <dani...@mellanox.com>
> Cc: mlxsw <ml...@mellanox.com>; syzkaller
> <syzk...@googlegroups.com>
> Subject: Re: Issues with syzkaller run
>
> On Wed, Jun 24, 2020 at 2:25 PM Danielle Ratson <dani...@mellanox.com>
> wrote:
> >
> > On 6/24/2020 1:57 PM, Dmitry Vyukov wrote:
> > > On Wed, Jun 24, 2020 at 11:07 AM Danielle Ratson
> <dani...@mellanox.com> wrote:
> > >> Hi Dmitry,
> > >>
> > >> Recently we experience a lot of issues when running syzkaller.
> > >>
> > >> All the last runs were interrupted in some point and were stopped after
> very few hours.
> > >>
> > >> First of all we get a lot of crashes with the description "lost connection to
> test machine", example log is attached.
> > > Hi Danielle,
> > >
> > > The timeout for this command is 1 minute:
> > > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
> > >
> thub.com%2Fgoogle%2Fsyzkaller%2Fblob%2Fmaster%2Fpkg%2Fhost%2Ffea
> ture
> > >
> s.go%23L110&amp;data=02%7C01%7Cdanieller%40mellanox.com%7Cc30bd1
> 7e54
> > >
> 5b41b5750d08d818547175%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C
> 0%7C
> > >
> 637286098239975437&amp;sdata=p98MWB0N%2Fyvddljvls3AbhC71XUHqHN
> eijGac
> > > k42E2M%3D&amp;reserved=0 And it does not do much, it should
> complete
> > > within a second or so.
> > >
> > > I would suggest to run that command manually on the target and see
> > > how long it takes. If it takes long, why?
> >
> >
> > This crash doesn't happen all the time. Once in a while it is losing
> connection and reconnecting afterwards itself.
>
> Maybe we need to bump that timeout a bit. However, I never seen it firing
> before. So it seems to be particularly slow in your setup from time to time.
> Does it help if you dump that timeout to 3m? 5m?
>
>
>
> > >> Note that we are running the syzkaller from a switch.
> > > What does it mean "from a switch"?
> > > If it's a single physical machine, maybe it gets broken and that's
> > > why executor times out. Is it healthy when these errors happen?
> >
> > I mean a physical machine indeed, but the machine is healthy when it
> happens.
>
> Low rate of various failures like this seems to be inevitable for now, it's just
> too expensive to debug and address each episodic failure. We are getting
> this on syzbot as well, you may see these here:
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsyzka
> ller.appspot.com%2Fbug%3Fid%3Db97ec15bfe317ac1ddccb41f2a913d4f7a31c
> 6d7&amp;data=02%7C01%7Cdanieller%40mellanox.com%7Cc30bd17e545b41
> b5750d08d818547175%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7
> C637286098239985433&amp;sdata=l6WPC7JFKTJYV8jWZjunCKIbJWLi83tPRDH
> kyeEy9%2B4%3D&amp;reserved=0

Danielle Ratson

unread,
Jun 28, 2020, 4:08:30 AM6/28/20
to Dmitry Vyukov, mlxsw, syzkaller
Hi Dmitry,

The syzkaller ran all the weekend so it seems like the fact we increased the timeout helped a lot.

There is a possibility that you can increase permanently the timeout to 3 mins in the code?

Thanks,
Danielle

Dmitry Vyukov

unread,
Jun 29, 2020, 1:14:58 AM6/29/20
to Danielle Ratson, mlxsw, syzkaller

Danielle Ratson

unread,
Jun 29, 2020, 12:02:35 PM6/29/20
to Dmitry Vyukov, mlxsw, syzkaller
Thank you very much.

> -----Original Message-----
> From: Dmitry Vyukov <dvy...@google.com>
> Sent: Monday, June 29, 2020 8:15 AM
> To: Danielle Ratson <dani...@mellanox.com>
> Cc: mlxsw <ml...@mellanox.com>; syzkaller
> <syzk...@googlegroups.com>
> Subject: Re: Issues with syzkaller run
>
> Done in
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> b.com%2Fgoogle%2Fsyzkaller%2Fcommit%2F0375051c1460f374bc036ac17e8f
> ec15dece5e8e&amp;data=02%7C01%7Cdanieller%40mellanox.com%7C3f7a0
> 640ed1d4475011a08d81beb5daa%7Ca652971c7d2e4d9ba6a4d149256f461b%7
> C0%7C1%7C637290044983797915&amp;sdata=X%2FtU3VZigiXsAAmNmwwZ
> MVJjWGEXDzqQ5WnFuZTB3og%3D&amp;reserved=0

Danielle Ratson

unread,
Jul 1, 2020, 3:37:33 AM7/1/20
to Dmitry Vyukov, mlxsw, syzkaller

Unfortunately, that problem is still happening.

First, we checked and there was indeed a qemu zombie but after killing the parent process and running syzkaller again, it failed again after a while with no left-over qemu zombie processes.

I am not sure what do you mean by open and work with /dev/vfio/27 as it is not a file but when i ran cat on it, it returned "cat: /dev/vfio/27: Device or resource busy".

In addition, dmesg had some issues as below:

[ 8047.834211] kvm [4540]: vcpu0, guest rIP: 0xffffffffaa7003be disabled perfctr wrmsr: 0xc2 data 0xffff
[11755.697984] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x19@0x100
[11763.892489] kvm [6294]: vcpu0, guest rIP: 0xffffffffbb5003be disabled perfctr wrmsr: 0xc2 data 0xffff
[15469.812566] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x19@0x100
[15476.692546] kvm [8054]: vcpu0, guest rIP: 0xffffffffb3b003be disabled perfctr wrmsr: 0xc2 data 0xffff
[19184.116522] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x19@0x100

Please see full dmesg attached.



            
see something in dmesg.


dmesg

Dmitry Vyukov

unread,
Jul 1, 2020, 3:42:53 AM7/1/20
to Danielle Ratson, mlxsw, syzkaller
I meant that you are presumably sharing /dev/vfio/27 with the guest
via qemu in some way.
Qemu works this file on the host in some way, I suggested to debug the
state of /dev/vfio/27 on the host in some way.
But I really don't know much about this (and I don't really know what
is this "this" really is). But so far it looks like qemu or host
kernel bug.
Reply all
Reply to author
Forward
0 new messages