While fuzzing with trinity inside a KVM tools guest running the latest linux-next kernel, I've stumbled on the following BUG.
I've also hit a similar trace where the 'BUG_ON(ELV_ON_HASH(rq));' above that list_del_init() gets hit, so I guess it's a race
condition of some sorts.
On Sat, Sep 22, 2012 at 4:35 PM, Sasha Levin <levinsasha...@gmail.com> wrote:
> Hi all,
> While fuzzing with trinity inside a KVM tools guest running the latest linux-next kernel, I've stumbled on the following BUG.
> I've also hit a similar trace where the 'BUG_ON(ELV_ON_HASH(rq));' above that list_del_init() gets hit, so I guess it's a race
> condition of some sorts.
I think this is floppy related (see redo_fd_request() in the stack
trace). And there were quite some changes to the area recently. Adding
maintainer to CC.
> On Sat, Sep 22, 2012 at 4:35 PM, Sasha Levin <levinsasha...@gmail.com> wrote:
> > Hi all,
> > While fuzzing with trinity inside a KVM tools guest running the latest linux-next kernel, I've stumbled on the following BUG.
> > I've also hit a similar trace where the 'BUG_ON(ELV_ON_HASH(rq));' above that list_del_init() gets hit, so I guess it's a race
> > condition of some sorts.
On Mon, 8 Oct 2012, Jan Kara wrote:
> > I'm still seeing this on linux-next.
> I think this is floppy related (see redo_fd_request() in the stack
> trace). And there were quite some changes to the area recently. Adding
> maintainer to CC.
Hmm ... I don't immediately see how this is happening.
Sasha, could you please do git bisect on drivers/block/floppy.c between f6365201d and your git HEAD for starters (assuming that f6365201d works well for you?).
> > On Sat, Sep 22, 2012 at 4:35 PM, Sasha Levin <levinsasha...@gmail.com> wrote:
> > > Hi all,
> > > While fuzzing with trinity inside a KVM tools guest running the latest linux-next kernel, I've stumbled on the following BUG.
> > > I've also hit a similar trace where the 'BUG_ON(ELV_ON_HASH(rq));' above that list_del_init() gets hit, so I guess it's a race
> > > condition of some sorts.
>>> > > I'm still seeing this on linux-next.
>> > I think this is floppy related (see redo_fd_request() in the stack
>> > trace). And there were quite some changes to the area recently. Adding
>> > maintainer to CC.
> Hmm ... I don't immediately see how this is happening.
> Sasha, could you please do git bisect on drivers/block/floppy.c between > f6365201d and your git HEAD for starters (assuming that f6365201d works > well for you?).
A bisect on floppy.c yielded the following:
b33d002f4b6bae912463e5a66387c498aa69b6fe is the first bad commit
commit b33d002f4b6bae912463e5a66387c498aa69b6fe
Author: Ben Hutchings <b...@decadent.org.uk>
Date: Mon Aug 27 20:56:53 2012 -0300
genhd: Make put_disk() safe for disks that have not been registered
> On 10/08/2012 05:45 PM, Jiri Kosina wrote:
>> On Mon, 8 Oct 2012, Jan Kara wrote:
>>>>>> I'm still seeing this on linux-next.
>>>> I think this is floppy related (see redo_fd_request() in the stack
>>>> trace). And there were quite some changes to the area recently. Adding
>>>> maintainer to CC.
>> Hmm ... I don't immediately see how this is happening.
>> Sasha, could you please do git bisect on drivers/block/floppy.c between >> f6365201d and your git HEAD for starters (assuming that f6365201d works >> well for you?).
> A bisect on floppy.c yielded the following:
> b33d002f4b6bae912463e5a66387c498aa69b6fe is the first bad commit
> commit b33d002f4b6bae912463e5a66387c498aa69b6fe
> Author: Ben Hutchings <b...@decadent.org.uk>
> Date: Mon Aug 27 20:56:53 2012 -0300
> genhd: Make put_disk() safe for disks that have not been registered
2 more things:
1. The guest vm which I'm testing on doesn't emulate anything which even looks like a floppy.
2. I'm seeing the following lines before the BUG:
[ 9.836604] floppy0: no floppy controllers found
[ 9.837246] work still pending
[ 9.837743] floppy0: floppy_shutdown: timeout handler died.
On Tue, 2012-10-09 at 09:26 -0400, Sasha Levin wrote:
> On 10/09/2012 09:21 AM, Sasha Levin wrote:
> > On 10/08/2012 05:45 PM, Jiri Kosina wrote:
> >> On Mon, 8 Oct 2012, Jan Kara wrote:
> >>>>>> I'm still seeing this on linux-next.
> >>>> I think this is floppy related (see redo_fd_request() in the stack
> >>>> trace). And there were quite some changes to the area recently. Adding
> >>>> maintainer to CC.
> >> Hmm ... I don't immediately see how this is happening.
> >> Sasha, could you please do git bisect on drivers/block/floppy.c between > >> f6365201d and your git HEAD for starters (assuming that f6365201d works > >> well for you?).
> > A bisect on floppy.c yielded the following:
> > b33d002f4b6bae912463e5a66387c498aa69b6fe is the first bad commit
> > commit b33d002f4b6bae912463e5a66387c498aa69b6fe
> > Author: Ben Hutchings <b...@decadent.org.uk>
> > Date: Mon Aug 27 20:56:53 2012 -0300
> > genhd: Make put_disk() safe for disks that have not been registered
> 2 more things:
> 1. The guest vm which I'm testing on doesn't emulate anything which even looks like a floppy.
> 2. I'm seeing the following lines before the BUG:
> [ 9.836604] floppy0: no floppy controllers found
> [ 9.837246] work still pending
> [ 9.837743] floppy0: floppy_shutdown: timeout handler died.
I see two problems:
1. redo_fd_request() races with tear-down of the disks, but because
set_next_request() checks disk->queue before doing anything this was
usually harmless. Now that do_floppy_init() doesn't clear disk->queue,
the race condition is much easier to hit. This may fix that problem in
do_floppy_init(), though there appear to be worse bugs in tear-down
order in floppy_module_exit():
2. I made a big mistake in using the existing GENHD_FL_UP flag, as it is
cleared by del_gendisk(). Incremental patch below, but it should be
squashed into the previous patch if that branch is still rebase-able.
Ben.
---
From: Ben Hutchings <b...@decadent.org.uk>
Date: Wed, 10 Oct 2012 16:17:01 +0100
Subject: [PATCH] genhd: Make put_disk() safe again for disks that *have* been
registered
Commit b33d002 ('genhd: Make put_disk() safe for disks that have not
been registered') wrongly used the GENHD_FL_UP flag to test whether a
disk held a reference to its queue. Since this is cleared by
del_gendisk(), queues will not be properly cleaned up if a disk has
been registered and then torn down in the normal way. Introduce a
new flag for this purpose.
diff --git a/block/genhd.c b/block/genhd.c
index 633751d..b5f482f 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -617,7 +617,10 @@ void add_disk(struct gendisk *disk)
* Take an extra ref on queue which will be put on disk_release()
* so that it sticks around as long as @disk is there.
*/
- WARN_ON_ONCE(!blk_get_queue(disk->queue));
+ if (blk_get_queue(disk->queue))
+ disk->flags |= GENHD_FL_GOT_QUEUE;
+ else
+ WARN_ON(1);
On Wed, 10 Oct 2012, Ben Hutchings wrote:
> > >>>>>> I'm still seeing this on linux-next.
> > >>>> I think this is floppy related (see redo_fd_request() in the stack
> > >>>> trace). And there were quite some changes to the area recently. Adding
> > >>>> maintainer to CC.
> > >> Hmm ... I don't immediately see how this is happening.
> > >> Sasha, could you please do git bisect on drivers/block/floppy.c between > > >> f6365201d and your git HEAD for starters (assuming that f6365201d works > > >> well for you?).
> > > A bisect on floppy.c yielded the following:
> > > b33d002f4b6bae912463e5a66387c498aa69b6fe is the first bad commit
> > > commit b33d002f4b6bae912463e5a66387c498aa69b6fe
> > > Author: Ben Hutchings <b...@decadent.org.uk>
> > > Date: Mon Aug 27 20:56:53 2012 -0300
> > > genhd: Make put_disk() safe for disks that have not been registered
> > 2 more things:
> > 1. The guest vm which I'm testing on doesn't emulate anything which even looks like a floppy.
> > 2. I'm seeing the following lines before the BUG:
> > [ 9.836604] floppy0: no floppy controllers found
> > [ 9.837246] work still pending
> > [ 9.837743] floppy0: floppy_shutdown: timeout handler died.
> I see two problems:
> 1. redo_fd_request() races with tear-down of the disks, but because
> set_next_request() checks disk->queue before doing anything this was
> usually harmless. Now that do_floppy_init() doesn't clear disk->queue,
> the race condition is much easier to hit. This may fix that problem in
> do_floppy_init(), though there appear to be worse bugs in tear-down
> order in floppy_module_exit():
> 2. I made a big mistake in using the existing GENHD_FL_UP flag, as it is
> cleared by del_gendisk(). Incremental patch below, but it should be
> squashed into the previous patch if that branch is still rebase-able.
Sasha,
did you manage to test this to see if it fixes the symptom you are seeing, please?
On Wed, Oct 10, 2012 at 11:52 AM, Ben Hutchings <b...@decadent.org.uk> wrote:
> On Tue, 2012-10-09 at 09:26 -0400, Sasha Levin wrote:
>> On 10/09/2012 09:21 AM, Sasha Levin wrote:
>> > On 10/08/2012 05:45 PM, Jiri Kosina wrote:
>> >> On Mon, 8 Oct 2012, Jan Kara wrote:
>> >>>>>> I'm still seeing this on linux-next.
>> >>>> I think this is floppy related (see redo_fd_request() in the stack
>> >>>> trace). And there were quite some changes to the area recently. Adding
>> >>>> maintainer to CC.
>> >> Hmm ... I don't immediately see how this is happening.
>> >> Sasha, could you please do git bisect on drivers/block/floppy.c between
>> >> f6365201d and your git HEAD for starters (assuming that f6365201d works
>> >> well for you?).
>> > A bisect on floppy.c yielded the following:
>> > b33d002f4b6bae912463e5a66387c498aa69b6fe is the first bad commit
>> > commit b33d002f4b6bae912463e5a66387c498aa69b6fe
>> > Author: Ben Hutchings <b...@decadent.org.uk>
>> > Date: Mon Aug 27 20:56:53 2012 -0300
>> > genhd: Make put_disk() safe for disks that have not been registered
>> 2 more things:
>> 1. The guest vm which I'm testing on doesn't emulate anything which even looks like a floppy.
>> 2. I'm seeing the following lines before the BUG:
>> [ 9.836604] floppy0: no floppy controllers found
>> [ 9.837246] work still pending
>> [ 9.837743] floppy0: floppy_shutdown: timeout handler died.
> I see two problems:
> 1. redo_fd_request() races with tear-down of the disks, but because
> set_next_request() checks disk->queue before doing anything this was
> usually harmless. Now that do_floppy_init() doesn't clear disk->queue,
> the race condition is much easier to hit. This may fix that problem in
> do_floppy_init(), though there appear to be worse bugs in tear-down
> order in floppy_module_exit():
> 2. I made a big mistake in using the existing GENHD_FL_UP flag, as it is
> cleared by del_gendisk(). Incremental patch below, but it should be
> squashed into the previous patch if that branch is still rebase-able.
> Ben.
> ---
> From: Ben Hutchings <b...@decadent.org.uk>
> Date: Wed, 10 Oct 2012 16:17:01 +0100
> Subject: [PATCH] genhd: Make put_disk() safe again for disks that *have* been
> registered
> Commit b33d002 ('genhd: Make put_disk() safe for disks that have not
> been registered') wrongly used the GENHD_FL_UP flag to test whether a
> disk held a reference to its queue. Since this is cleared by
> del_gendisk(), queues will not be properly cleaned up if a disk has
> been registered and then torn down in the normal way. Introduce a
> new flag for this purpose.
> diff --git a/block/genhd.c b/block/genhd.c
> index 633751d..b5f482f 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -617,7 +617,10 @@ void add_disk(struct gendisk *disk)
> * Take an extra ref on queue which will be put on disk_release()
> * so that it sticks around as long as @disk is there.
> */
> - WARN_ON_ONCE(!blk_get_queue(disk->queue));
> + if (blk_get_queue(disk->queue))
> + disk->flags |= GENHD_FL_GOT_QUEUE;
> + else
> + WARN_ON(1);
On Fri, 2012-10-12 at 13:55 -0400, Sasha Levin wrote:
> Hi Ben,
> On Wed, Oct 10, 2012 at 11:52 AM, Ben Hutchings <b...@decadent.org.uk> wrote:
> > On Tue, 2012-10-09 at 09:26 -0400, Sasha Levin wrote:
> >> On 10/09/2012 09:21 AM, Sasha Levin wrote:
> >> > On 10/08/2012 05:45 PM, Jiri Kosina wrote:
> >> >> On Mon, 8 Oct 2012, Jan Kara wrote:
> >> >>>>>> I'm still seeing this on linux-next.
> >> >>>> I think this is floppy related (see redo_fd_request() in the stack
> >> >>>> trace). And there were quite some changes to the area recently. Adding
> >> >>>> maintainer to CC.
> >> >> Hmm ... I don't immediately see how this is happening.
> >> >> Sasha, could you please do git bisect on drivers/block/floppy.c between
> >> >> f6365201d and your git HEAD for starters (assuming that f6365201d works
> >> >> well for you?).
> >> > A bisect on floppy.c yielded the following:
> >> > b33d002f4b6bae912463e5a66387c498aa69b6fe is the first bad commit
> >> > commit b33d002f4b6bae912463e5a66387c498aa69b6fe
> >> > Author: Ben Hutchings <b...@decadent.org.uk>
> >> > Date: Mon Aug 27 20:56:53 2012 -0300
> >> > genhd: Make put_disk() safe for disks that have not been registered
[...]
> > I see two problems:
> > 1. redo_fd_request() races with tear-down of the disks, but because
> > set_next_request() checks disk->queue before doing anything this was
> > usually harmless. Now that do_floppy_init() doesn't clear disk->queue,
> > the race condition is much easier to hit. This may fix that problem in
> > do_floppy_init(), though there appear to be worse bugs in tear-down
> > order in floppy_module_exit():
[...]
> > 2. I made a big mistake in using the existing GENHD_FL_UP flag, as it is
> > cleared by del_gendisk(). Incremental patch below, but it should be
> > squashed into the previous patch if that branch is still rebase-able.
[...]
> I'm now seeing these instead:
[...]
Sorry, I'm not going to spend more time in the quagmire of the floppy
driver. Whoever has this commit in their tree, please revert or drop it
as appropriate.
Ben.
-- Ben Hutchings
No political challenge can be met by shopping. - George Monbiot
On Fri, 2012-10-12 at 13:55 -0400, Sasha Levin wrote:
> Hi Ben,
> On Wed, Oct 10, 2012 at 11:52 AM, Ben Hutchings <b...@decadent.org.uk> wrote:
> > On Tue, 2012-10-09 at 09:26 -0400, Sasha Levin wrote:
> >> On 10/09/2012 09:21 AM, Sasha Levin wrote:
> >> > On 10/08/2012 05:45 PM, Jiri Kosina wrote:
> >> >> On Mon, 8 Oct 2012, Jan Kara wrote:
> >> >>>>>> I'm still seeing this on linux-next.
> >> >>>> I think this is floppy related (see redo_fd_request() in the stack
> >> >>>> trace). And there were quite some changes to the area recently. Adding
> >> >>>> maintainer to CC.
> >> >> Hmm ... I don't immediately see how this is happening.
> >> >> Sasha, could you please do git bisect on drivers/block/floppy.c between
> >> >> f6365201d and your git HEAD for starters (assuming that f6365201d works
> >> >> well for you?).
> >> > A bisect on floppy.c yielded the following:
> >> > b33d002f4b6bae912463e5a66387c498aa69b6fe is the first bad commit
> >> > commit b33d002f4b6bae912463e5a66387c498aa69b6fe
> >> > Author: Ben Hutchings <b...@decadent.org.uk>
> >> > Date: Mon Aug 27 20:56:53 2012 -0300
> >> > genhd: Make put_disk() safe for disks that have not been registered
[...]
> > I see two problems:
> > 1. redo_fd_request() races with tear-down of the disks, but because
> > set_next_request() checks disk->queue before doing anything this was
> > usually harmless. Now that do_floppy_init() doesn't clear disk->queue,
> > the race condition is much easier to hit. This may fix that problem in
> > do_floppy_init(), though there appear to be worse bugs in tear-down
> > order in floppy_module_exit():
[...]
> > 2. I made a big mistake in using the existing GENHD_FL_UP flag, as it is
> > cleared by del_gendisk(). Incremental patch below, but it should be
> > squashed into the previous patch if that branch is still rebase-able.
[...]
> I'm now seeing these instead:
[...]
Sorry, I'm not going to spend more time in the quagmire of the floppy
driver. Whoever has this commit in their tree, please revert or drop it
as appropriate.
Ben.
-- Ben Hutchings
No political challenge can be met by shopping. - George Monbiot
On Wed, 17 Oct 2012, Ben Hutchings wrote:
> > > 1. redo_fd_request() races with tear-down of the disks, but because
> > > set_next_request() checks disk->queue before doing anything this was
> > > usually harmless. Now that do_floppy_init() doesn't clear disk->queue,
> > > the race condition is much easier to hit. This may fix that problem in
> > > do_floppy_init(), though there appear to be worse bugs in tear-down
> > > order in floppy_module_exit():
> [...]
> > > 2. I made a big mistake in using the existing GENHD_FL_UP flag, as it is
> > > cleared by del_gendisk(). Incremental patch below, but it should be
> > > squashed into the previous patch if that branch is still rebase-able.
> [...]
> > I'm now seeing these instead:
> [...]
> Sorry, I'm not going to spend more time in the quagmire of the floppy
> driver. Whoever has this commit in their tree, please revert or drop it
> as appropriate.
As far as I can tell, Jens has pulled it from me, but it hasn't made it into Linus' tree as of today.
I will do it in my tree and send a new pull request to Jens.
>>>> 1. redo_fd_request() races with tear-down of the disks, but because
>>>> set_next_request() checks disk->queue before doing anything this was
>>>> usually harmless. Now that do_floppy_init() doesn't clear disk->queue,
>>>> the race condition is much easier to hit. This may fix that problem in
>>>> do_floppy_init(), though there appear to be worse bugs in tear-down
>>>> order in floppy_module_exit():
>> [...]
>>>> 2. I made a big mistake in using the existing GENHD_FL_UP flag, as it is
>>>> cleared by del_gendisk(). Incremental patch below, but it should be
>>>> squashed into the previous patch if that branch is still rebase-able.
>> [...]
>>> I'm now seeing these instead:
>> [...]
>> Sorry, I'm not going to spend more time in the quagmire of the floppy
>> driver. Whoever has this commit in their tree, please revert or drop it
>> as appropriate.
> As far as I can tell, Jens has pulled it from me, but it hasn't made it > into Linus' tree as of today.
> I will do it in my tree and send a new pull request to Jens.
I did not add the patch from Ben, as it was reported as not working. My
driver pull is late this time due to travel, but it'll go out start of
next week. So if you have pending floppy updates that are tested at that
time, then please do send them my way.
On Wed, 17 Oct 2012, Jens Axboe wrote:
> >>>> 1. redo_fd_request() races with tear-down of the disks, but because
> >>>> set_next_request() checks disk->queue before doing anything this was
> >>>> usually harmless. Now that do_floppy_init() doesn't clear disk->queue,
> >>>> the race condition is much easier to hit. This may fix that problem in
> >>>> do_floppy_init(), though there appear to be worse bugs in tear-down
> >>>> order in floppy_module_exit():
> >> [...]
> >>>> 2. I made a big mistake in using the existing GENHD_FL_UP flag, as it is
> >>>> cleared by del_gendisk(). Incremental patch below, but it should be
> >>>> squashed into the previous patch if that branch is still rebase-able.
> >> [...]
> >>> I'm now seeing these instead:
> >> [...]
> >> Sorry, I'm not going to spend more time in the quagmire of the floppy
> >> driver. Whoever has this commit in their tree, please revert or drop it
> >> as appropriate.
> > As far as I can tell, Jens has pulled it from me, but it hasn't made it > > into Linus' tree as of today.
> > I will do it in my tree and send a new pull request to Jens.
> I did not add the patch from Ben, as it was reported as not working. My
> driver pull is late this time due to travel, but it'll go out start of
> next week. So if you have pending floppy updates that are tested at that
> time, then please do send them my way.
Sorry for the delay on my side as well. I am now working on reverting Ben's original patch and doing some testing. Will be sending you a pull request still tonight.