Suggestions for improving syzkaller's reporting (was Re: possible deadlock (4))

53 views
Skip to first unread message

Theodore Ts'o

unread,
Jan 17, 2018, 11:00:23 PM1/17/18
to Dmitry Vyukov, syzk...@googlegroups.com
On Mon, Jan 15, 2018 at 12:46:52AM -0800, syzbot wrote:
>
> syzkaller hit the following crash on
> 2c1cfa49901839136e578ca516a7e230182da024
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers

(1) Saying that a .config is attached isn't really useful. What's
especially interesting is whether or not it reproducible or not, and
if so, whether it's available as a C reproducer, or just as a
syzkaller reproducer. So I'd suggest editing down the above.


(2) I would suggest moving the "See https://goo.gl/kgGztJ for
information about syzkaller reproducers" to the syzkaller reproducer
as a comment. In fact, it's not immediately obvious that the
repro.txt is in fact the syzkaller reproducer. So labelling it as
such with the pointer for more information there would be useful.


(3) I would suggest moving the .config towards the end, since it's not
terribly information-dense. Forcing the developer to scroll past five
thousand plus lines of .config can be... irritating.

Also, using "savedefconfig" to create a more minimal config file[1]
(with the instruction to use "make olddefconfig" to reconstitute the
full config) might also be helpful. For example, [2] is only 174
lines and 4k, instead of 5,128 lines and 178k.

[1] https://lwn.net/Articles/397363/
[2] https://github.com/tytso/xfstests-bld/blob/master/kernel-configs/x86_64-config-4.14


(4) The console output with the WARN_ON or BUG is going to be the most
useful. The fact that it's located after the config file, and has a
content-type of application/octet-string, has already been discussed.
Moving it so that it is one of the first things that can be read in
the mail message would be incredibly helpful.


(5) In this case, the console log was cut off too soon:

[ 25.520375] ======================================================
[ 25.520376] WARNING: possible circular locking dependency detected
[ 25.520381] 4.15.0-rc7+ #261 Not tainted
[ 25.520383] ----------------------------------------------

The full lockdep output was not included, and that's the most
interesting part! Very often from looking at the lockdep output, the
developer can often figure out what went wrong, and perhaps craft a
fix much faster. Without the full lockdep output, you are forcing the
developer to reproduce the failure, which means building a kernel with
the given config file, running that kernel in a VM, and then building
the reproducer and running it in the kernel. This can easily take 30+
minutes before the developer can even start trying to figuring out
what the problem might be.


(6) Something else to consider is whether *all* of this information
needs to be in the mail message. If the full console log is too
large, and you can only include a subset of the log, perhaps the full
log can be made available via Google Cloud Storage?

For example, in gce-xfstests the full test artifiacts are available at
a URL such as:

gs://gce-xfstests/results/results.tytso-20180107192932.4.15.0-rc4-xfstests-00005-g22446423108f.tar.xz

... although the user just has to run a command such as:

gce-xfstests get-results --unpack tytso-20180107192932

... where "tytso-20180107192932" is in the e-mail report in a place
where it can be easily cut and pasted into a command line to get the
full test results downloaded and unpacked into a directory in /tmp.

----------------------------

In summary, if you can structure the e-mail so the most interesting
bits are towards the beginning of the message, and in application/text
form so it is easily read without having to take special action ---
and if it includes all of the interesting bits of the lockdep report,
it's likely this will make syzkaller reports much more likely to be
acted upon.

Hopefully most of the suggestions in this e-mail are relatively low
effort for the Syzkaller team, and will have a large impact out of
proportion of the work needed to implement them.

Cheers,

- Ted

Dmitry Vyukov

unread,
Jan 18, 2018, 4:08:45 AM1/18/18
to Theodore Ts'o, syzkaller
On Thu, Jan 18, 2018 at 5:00 AM, Theodore Ts'o <ty...@mit.edu> wrote:
> On Mon, Jan 15, 2018 at 12:46:52AM -0800, syzbot wrote:
>>
>> syzkaller hit the following crash on
>> 2c1cfa49901839136e578ca516a7e230182da024
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached
>> Raw console output is attached.
>> C reproducer is attached
>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>> for information about syzkaller reproducers

Hi Ted,

Thanks for constructive feedback!
Let's start with easier ones first:

> (1) Saying that a .config is attached isn't really useful. What's
> especially interesting is whether or not it reproducible or not, and
> if so, whether it's available as a C reproducer, or just as a
> syzkaller reproducer. So I'd suggest editing down the above.

Done:
https://github.com/google/syzkaller/commit/2666e00902033fd61964c24f3444b94cf25cb6de

Now this block looks as:

So far this crash happened 3 times on repo1/branch1, repo2/branch2.
C reproducer is attached.
syzkaller reproducer is attached.
Raw console output is attached.
compiler: compiler2
.config is attached.

You can see full verbatim text of different types of emails in
dashboard/app/email_test.go.
Note: the "So far this crash happened..." line is new. Added due to a
popular request.


> (2) I would suggest moving the "See https://goo.gl/kgGztJ for
> information about syzkaller reproducers" to the syzkaller reproducer
> as a comment. In fact, it's not immediately obvious that the
> repro.txt is in fact the syzkaller reproducer. So labelling it as
> such with the pointer for more information there would be useful.

Done:
https://github.com/google/syzkaller/commit/b134896b20b5f52e418e0221bf356d6f7a8be47b
Now "See ... for more information" is the first line of syzkaller
repro file itself and that text is removed from email.


> (3) I would suggest moving the .config towards the end, since it's not
> terribly information-dense. Forcing the developer to scroll past five
> thousand plus lines of .config can be... irritating.

I've moved .config as the last attachment:
https://github.com/google/syzkaller/commit/0d1e95d2bafd3fb8cf589e3c6b9f5a116fdce673
Now it is: raw.log.txt, repro.syz.txt, repro.c.txt, config.txt.

Dmitry Vyukov

unread,
Jan 18, 2018, 4:47:16 AM1/18/18
to Theodore Ts'o, syzkaller
On Thu, Jan 18, 2018 at 10:08 AM, Dmitry Vyukov <dvy...@google.com> wrote:
> On Thu, Jan 18, 2018 at 5:00 AM, Theodore Ts'o <ty...@mit.edu> wrote:
>> On Mon, Jan 15, 2018 at 12:46:52AM -0800, syzbot wrote:
>>>
>>> syzkaller hit the following crash on
>>> 2c1cfa49901839136e578ca516a7e230182da024
>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>> compiler: gcc (GCC) 7.1.1 20170620
>>> .config is attached
>>> Raw console output is attached.
>>> C reproducer is attached
>>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>> for information about syzkaller reproducers
>
> Hi Ted,
>
> Thanks for constructive feedback!
> Let's start with easier ones first:

Now to the harder ones:
There are pros and cons both ways.
One one hand, the same amount of info in a more concise form is always better.
On the other hand, this makes part of info not readily available in
the report. I think one of use patterns is grepping config to
understand if this or that config is enabled. If we minimize config
this becomes impossible using the report itself. Or even worse will
send a developer a wrong path because he thinks that a particular
config is not enabled, sure not all people will read a remark about
savedefconfig. To grep config one will now need to do what you
describe below as a negative pattern: fetch kernel, checkout config,
copy that file here, run that sequence of commands, etc.
Perhaps a better solution for this is what you propose below: just
provide links to verbatim, gzipped files.


>> (4) The console output with the WARN_ON or BUG is going to be the most
>> useful. The fact that it's located after the config file, and has a
>> content-type of application/octet-string, has already been discussed.
>> Moving it so that it is one of the first things that can be read in
>> the mail message would be incredibly helpful.

This now has text/plain content type. raw.log.txt is now the first
attachment. And the symbolized kernel report itself is in the email
text inline as well.

Dmitry Vyukov

unread,
Jan 18, 2018, 4:55:46 AM1/18/18
to Theodore Ts'o, syzkaller, Greg Kroah-Hartman
I am between 2 fires (well, actually 4000 fires).
You suggest to move the info to web, but, for example, +Greg said just
yesterday "I don't visit random web sites for random bugs. Please
include all of the information in the email itself":
https://groups.google.com/d/msg/syzkaller/RyeMEEzwQF8/TsZgxepLDQAJ

On my side making these attachments links sounds good. That would
reduce email size considerably, remove all problems with Content-type,
and we could simply serve .txt files with Transfer-encoding: gzip, so
that you don't need to unpack them manually, but amount of traffic is
minimal.
But I afraid that tomorrow somebody else will say "why the hell you
are pointing me to other sites, just include all info in the email".
And then somebody else will say do some third option. If we can get
buy-in from some core group of kernel developers, then I can do this.
But otherwise it's probably that just one size does not fit all.

Dmitry Vyukov

unread,
Jan 18, 2018, 5:25:33 AM1/18/18
to Theodore Ts'o, syzkaller, Greg Kroah-Hartman
On Thu, Jan 18, 2018 at 10:55 AM, Dmitry Vyukov <dvy...@google.com> wrote:
>>>> (5) In this case, the console log was cut off too soon:
>>>>
>>>> [ 25.520375] ======================================================
>>>> [ 25.520376] WARNING: possible circular locking dependency detected
>>>> [ 25.520381] 4.15.0-rc7+ #261 Not tainted
>>>> [ 25.520383] ----------------------------------------------
>>>>
>>>> The full lockdep output was not included, and that's the most
>>>> interesting part! Very often from looking at the lockdep output, the
>>>> developer can often figure out what went wrong, and perhaps craft a
>>>> fix much faster. Without the full lockdep output, you are forcing the
>>>> developer to reproduce the failure, which means building a kernel with
>>>> the given config file, running that kernel in a VM, and then building
>>>> the reproducer and running it in the kernel. This can easily take 30+
>>>> minutes before the developer can even start trying to figuring out
>>>> what the problem might be.


It's not that it's not included, there were no subsequent output. This
is a problem. But this is a problem more with kernel rather than with
syzkaller. syzkaller always waits for 10 seconds after the appearance
of the "WARNING:" line to gather whole report.
But sometimes kernel hangs during reporting. It's hard to simply
understand that there is a bug at all, yes, as weird as it sound. You
can't say when a bug report starts and when it ends. You can't extract
some stable identity of a crash to be able to dedup them. "BUG:" means
a bug... unless it's actually part of "DEBUG:". Not all bugs are
prefixed with the common BUG/WARNING at all, sometimes people decide
"oh, here, I will just invent my own format for messages". "INFO:" is
a "nice" one, it can mean literally some info, but also deadlocks and
stalls. Worse, frequently reports are intermixed with other reports,
or with just unrelated output to the form where it's not possible to
make any sense of them. I can show you full freak show of kernel bug
reports that are very hard to make any sense of:

https://gist.githubusercontent.com/dvyukov/6e08ac521f3e19534970ed97aeee1603/raw/0f0bb361902de94e7ee331ac500a3ceebf812c22/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/6e9db2313e48773ad1cd861da8020008/raw/d5b7c023fc8a38c72b1cf8bb1da85fb1c31cea5f/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/3d1bda4c690414ac027de1da45759751/raw/2c68980eabf4f6be24060e807a75f2d3570b5a42/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/9b8831e9ac73ffafa111a33ad40c5667/raw/f4097fbea8f89b25a282a6ef7e648145e10ae4b7/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/da1e42436af9ad2afc7de49f2d503510/raw/7dd4cbcc651c5b87122f066a3c689999ae8c4121/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/4571b94bd8cbd78d759412c560fa395c/raw/964c73fc993fc8a9000571e0b7618000584f3638/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/b6deac5faa958ae3733413b34dd5feed/raw/c4da219e284f7fc55da8c3c3af623a87f31bf653/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/2f54c6a2e45347ea76d9c5ce3c0ff091/raw/45f4873898ec8e0d9aa16b9c5c63a85410fd05e0/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/96cb39e29124dbbe2a65a91ec7a5639e/raw/aa8f7b2b1dfa5b8bb8cf93d8a821ca9938e8fc54/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/424da8282d5b28f8be10eab595d37444/raw/acc2fb1ececc1ea9a8215213f7e37e08b524c096/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/b07f37720c632d6d56ae67d95e5599b3/raw/8624ba47d6eb4e7d4d58e3ae1242ebe6cc46d361/gistfile1.txt
https://gist.githubusercontent.com/dvyukov/bc24a7b92289ec04587fb29fc1085045/raw/3136e9262ee2233b5ab369a4a82e83953fc2d8a2/gistfile1.txt

This part causes constant pain and maintenance for us. Take a look at
our kernel output parsing:
https://github.com/google/syzkaller/blob/master/pkg/report/linux.go
It's total mess, and it's still incomplete and mishandles some cases.
The worst cases are when title is corrupted, say you get "[
110.647055] BUG: unable to handle kernel [ 110.647086] device gre0
entered promiscuous mode". So we say "aha it's a BUG:" and the bug
identify itself as "unable to handle kernel [ 110.647086] device gre0
entered promiscuous mode". We never seen this bug before, so we need
to report it a new one.

I wanted to discuss this problem with kernel community, but so far I
was busy with other stuff. For example, fuchsia kernel at least stops
other CPUs before printing a crash report (though, not perfectly too,
but better than linux).

But back to that particular report, syzkaller actually understood that
it's "corrupted", you can see test case for this here:
https://github.com/google/syzkaller/blob/master/pkg/report/testdata/linux/report/152
But the reason why it was reported, it's because it has a reproducer.
So one can run the reproducer and extract any other info about the
bug.
Foreseeing your question, yes, we actually do re-run reproducers
several times if we think that the report is corrupted. Yet still
that's what we got.

Greg Kroah-Hartman

unread,
Jan 18, 2018, 9:07:57 AM1/18/18
to Dmitry Vyukov, Theodore Ts'o, syzkaller
This is a real problem, and one that we have never had before, and need
to resolve. The movement to the standard dev_*() calls for logging
messages was great, but we need to come to some sort of understanding as
to what WARN_ON() means. Is this a kernel bug? An API usage bug?
Something else?

Otherwise the work you are doing is going to keep running into these
issues, knowing exactly what type of message is thrown when there is a
"real" problem is an important thing to get working properly.

So any suggestions you might have here would be greatly appreciated.

thanks,

greg k-h

Dmitry Vyukov

unread,
Jan 23, 2018, 3:08:46 AM1/23/18
to Greg Kroah-Hartman, Theodore Ts'o, syzkaller
Hi Greg,

I wrote "Distinguishing kernel bugs from invalid inputs" email to LKML
some time ago:
https://www.spinics.net/lists/kernel/msg2628643.html
But it did not get any responses.

Now my plan is to send a patch that clarifies comments in
"include/asm-generic/bug.h" to say that WARN* must not be used for
invalid inputs of any kind and should mean kernel bugs. I don't know
if it's a good place to outline general guidelines (e.g. what exactly
to use for invalid inputs), or it better be something in
Documentation/.
Do you have any suggestions on how to move forward with this? Send
another email to LKML?

Greg Kroah-Hartman

unread,
Jan 23, 2018, 3:20:40 AM1/23/18
to Dmitry Vyukov, Theodore Ts'o, syzkaller
On Tue, Jan 23, 2018 at 09:08:24AM +0100, Dmitry Vyukov wrote:
> > Otherwise the work you are doing is going to keep running into these
> > issues, knowing exactly what type of message is thrown when there is a
> > "real" problem is an important thing to get working properly.
> >
> > So any suggestions you might have here would be greatly appreciated.
>
>
> Hi Greg,
>
> I wrote "Distinguishing kernel bugs from invalid inputs" email to LKML
> some time ago:
> https://www.spinics.net/lists/kernel/msg2628643.html
> But it did not get any responses.

emails without patches are on the bottom of my list of things to respond
to, for obvious reasons :(

> Now my plan is to send a patch that clarifies comments in
> "include/asm-generic/bug.h" to say that WARN* must not be used for
> invalid inputs of any kind and should mean kernel bugs. I don't know
> if it's a good place to outline general guidelines (e.g. what exactly
> to use for invalid inputs), or it better be something in
> Documentation/.
> Do you have any suggestions on how to move forward with this? Send
> another email to LKML?

A patch like that would be great, along with a patch series that starts
to clean up the existing WARN* users to not use that for debugging.
Much like the sysfs patch I posted yesterday.

So send patches, that is how we respond. We tend to ignore thoughts
about how to change things for the most part.

thanks,

greg k-h
Reply all
Reply to author
Forward
0 new messages