Errors and later panics in 2.6.0-test11.

0 zobrazení
Přeskočit na první nepřečtenou zprávu

David Martínez Moreno

nepřečteno,
3. 12. 2003 8:23:0103.12.03
komu:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello again. I'm testing 2.6.0-test11 in one of my servers. In about a day or
so under a web/FTP server load, the kernel starts to spit messages:

Dec 2 22:07:25 ulises kernel: Bad page state at prep_new_page
Dec 2 22:07:25 ulises kernel: flags:0x0102002c mapping:d50c8a28 mapped:0 count:1
Dec 2 22:07:25 ulises kernel: Backtrace:
Dec 2 22:07:25 ulises kernel: Call Trace:
Dec 2 22:07:25 ulises kernel: [bad_page+93/133] bad_page+0x5d/0x85
Dec 2 22:07:25 ulises kernel: [prep_new_page+50/81] prep_new_page+0x32/0x51
Dec 2 22:07:25 ulises kernel: [buffered_rmqueue+165/264] buffered_rmqueue+0xa5/0x108
Dec 2 22:07:25 ulises kernel: [find_get_page+26/37] find_get_page+0x1a/0x25
Dec 2 22:07:25 ulises kernel: [__alloc_pages+166/788] __alloc_pages+0xa6/0x314
Dec 2 22:07:25 ulises kernel: [do_wp_page+168/674] do_wp_page+0xa8/0x2a2
Dec 2 22:07:25 ulises kernel: [handle_mm_fault+293/308] handle_mm_fault+0x125/0x134
Dec 2 22:07:25 ulises kernel: [do_page_fault+288/1264] do_page_fault+0x120/0x4f0
Dec 2 22:07:25 ulises kernel: [dentry_open+246/334] dentry_open+0xf6/0x14e
Dec 2 22:07:25 ulises kernel: [filp_open+98/100] filp_open+0x62/0x64
Dec 2 22:07:25 ulises kernel: [do_fcntl+188/394] do_fcntl+0xbc/0x18a
Dec 2 22:07:25 ulises kernel: [sys_fcntl64+87/149] sys_fcntl64+0x57/0x95
Dec 2 22:07:25 ulises kernel: [do_page_fault+0/1264] do_page_fault+0x0/0x4f0
Dec 2 22:07:25 ulises kernel: [error_code+45/56] error_code+0x2d/0x38
Dec 2 22:07:25 ulises kernel:
Dec 2 22:07:25 ulises kernel: Trying to fix it up, but a reboot is needed
Dec 2 22:07:28 ulises kernel: Bad page state at prep_new_page
Dec 2 22:07:28 ulises kernel: flags:0x0102002c mapping:d348b8a8 mapped:0 count:1
Dec 2 22:07:28 ulises kernel: Backtrace:
Dec 2 22:07:28 ulises kernel: Call Trace:
Dec 2 22:07:28 ulises kernel: [bad_page+93/133] bad_page+0x5d/0x85
Dec 2 22:07:28 ulises kernel: [prep_new_page+50/81] prep_new_page+0x32/0x51
Dec 2 22:07:28 ulises kernel: [buffered_rmqueue+165/264] buffered_rmqueue+0xa5/0x108
Dec 2 22:07:28 ulises kernel: [__alloc_pages+166/788] __alloc_pages+0xa6/0x314
Dec 2 22:07:28 ulises kernel: [__get_free_pages+31/65] __get_free_pages+0x1f/0x41
Dec 2 22:07:28 ulises kernel: [__pollwait+133/198] __pollwait+0x85/0xc6
Dec 2 22:07:28 ulises kernel: [datagram_poll+43/202] datagram_poll+0x2b/0xca
Dec 2 22:07:28 ulises kernel: [sock_poll+41/49] sock_poll+0x29/0x31
Dec 2 22:07:28 ulises kernel: [do_pollfd+140/144] do_pollfd+0x8c/0x90
Dec 2 22:07:28 ulises kernel: [do_poll+97/192] do_poll+0x61/0xc0
Dec 2 22:07:28 ulises kernel: [sys_poll+405/651] sys_poll+0x195/0x28b
Dec 2 22:07:28 ulises kernel: [__pollwait+0/198] __pollwait+0x0/0xc6
Dec 2 22:07:28 ulises kernel: [sysenter_past_esp+82/113] sysenter_past_esp+0x52/0x71
Dec 2 22:07:28 ulises kernel:
Dec 2 22:07:28 ulises kernel: Trying to fix it up, but a reboot is needed
Dec 2 22:07:45 ulises kernel: Bad page state at prep_new_page
Dec 2 22:07:45 ulises kernel: flags:0x0102002c mapping:d348b8a8 mapped:0 count:1
Dec 2 22:07:45 ulises kernel: Backtrace:
Dec 2 22:07:45 ulises kernel: Call Trace:
Dec 2 22:07:45 ulises kernel: [bad_page+93/133] bad_page+0x5d/0x85
Dec 2 22:07:45 ulises kernel: [prep_new_page+50/81] prep_new_page+0x32/0x51
Dec 2 22:07:45 ulises kernel: [buffered_rmqueue+165/264] buffered_rmqueue+0xa5/0x108
Dec 2 22:07:45 ulises kernel: [__alloc_pages+166/788] __alloc_pages+0xa6/0x314
Dec 2 22:07:45 ulises kernel: [do_wp_page+168/674] do_wp_page+0xa8/0x2a2
Dec 2 22:07:45 ulises kernel: [handle_mm_fault+293/308] handle_mm_fault+0x125/0x134
Dec 2 22:07:45 ulises kernel: [do_page_fault+288/1264] do_page_fault+0x120/0x4f0
Dec 2 22:07:45 ulises kernel: [sock_destroy_inode+27/31] sock_destroy_inode+0x1b/0x1f
Dec 2 22:07:45 ulises kernel: [destroy_inode+53/80] destroy_inode+0x35/0x50
Dec 2 22:07:45 ulises kernel: [iput+85/111] iput+0x55/0x6f
Dec 2 22:07:45 ulises kernel: [dput+210/347] dput+0xd2/0x15b
Dec 2 22:07:45 ulises kernel: [__fput+121/188] __fput+0x79/0xbc
Dec 2 22:07:45 ulises kernel: [sys_setresuid+227/369] sys_setresuid+0xe3/0x171
Dec 2 22:07:45 ulises kernel: [do_page_fault+0/1264] do_page_fault+0x0/0x4f0
Dec 2 22:07:45 ulises kernel: [error_code+45/56] error_code+0x2d/0x38
Dec 2 22:07:45 ulises kernel:
Dec 2 22:07:45 ulises kernel: Trying to fix it up, but a reboot is needed
Dec 2 22:07:56 ulises kernel: Bad page state at prep_new_page
Dec 2 22:07:56 ulises kernel: flags:0x0102002c mapping:d50c8a28 mapped:0 count:1
Dec 2 22:07:56 ulises kernel: Backtrace:
Dec 2 22:07:56 ulises kernel: Call Trace:
Dec 2 22:07:56 ulises kernel: [bad_page+93/133] bad_page+0x5d/0x85
Dec 2 22:07:56 ulises kernel: [prep_new_page+50/81] prep_new_page+0x32/0x51
Dec 2 22:07:56 ulises kernel: [buffered_rmqueue+165/264] buffered_rmqueue+0xa5/0x108
Dec 2 22:07:56 ulises kernel: [__alloc_pages+166/788] __alloc_pages+0xa6/0x314
Dec 2 22:07:56 ulises kernel: [do_wp_page+168/674] do_wp_page+0xa8/0x2a2
Dec 2 22:07:56 ulises kernel: [handle_mm_fault+293/308] handle_mm_fault+0x125/0x134
Dec 2 22:07:56 ulises kernel: [do_page_fault+288/1264] do_page_fault+0x120/0x4f0
Dec 2 22:07:56 ulises kernel: [dentry_open+246/334] dentry_open+0xf6/0x14e
Dec 2 22:07:56 ulises kernel: [filp_open+98/100] filp_open+0x62/0x64
Dec 2 22:07:56 ulises kernel: [sys_open+126/139] sys_open+0x7e/0x8b
Dec 2 22:07:56 ulises kernel: [do_page_fault+0/1264] do_page_fault+0x0/0x4f0
Dec 2 22:07:56 ulises kernel: [error_code+45/56] error_code+0x2d/0x38
Dec 2 22:07:56 ulises kernel:
Dec 2 22:07:56 ulises kernel: Trying to fix it up, but a reboot is needed
[...]

After that, probably due to these errors, the kernel starts to panic:

Dec 2 22:11:28 ulises kernel: Bad page state at prep_new_page
Dec 2 22:11:28 ulises kernel: flags:0x0102002c mapping:d50c8a28 mapped:0 count:1
Dec 2 22:11:28 ulises kernel: Backtrace:
Dec 2 22:11:28 ulises kernel: Call Trace:
Dec 2 22:11:28 ulises kernel: [bad_page+93/133] bad_page+0x5d/0x85
Dec 2 22:11:28 ulises kernel: [prep_new_page+50/81] prep_new_page+0x32/0x51
Dec 2 22:11:28 ulises kernel: [buffered_rmqueue+165/264] buffered_rmqueue+0xa5/0x108
Dec 2 22:11:28 ulises kernel: [__alloc_pages+331/788] __alloc_pages+0x14b/0x314
Dec 2 22:11:28 ulises kernel: [do_page_cache_readahead+221/265] do_page_cache_readahead+0xdd/0x109
Dec 2 22:11:28 ulises kernel: [kfree_skbmem+36/44] kfree_skbmem+0x24/0x2c
Dec 2 22:11:28 ulises kernel: [page_cache_readahead+190/335] page_cache_readahead+0xbe/0x14f
Dec 2 22:11:28 ulises kernel: [do_generic_mapping_read+186/927] do_generic_mapping_read+0xba/0x39f
Dec 2 22:11:28 ulises kernel: [file_read_actor+0/234] file_read_actor+0x0/0xea
Dec 2 22:11:28 ulises kernel: [__generic_file_aio_read+444/494] __generic_file_aio_read+0x1bc/0x1ee
Dec 2 22:11:28 ulises kernel: [file_read_actor+0/234] file_read_actor+0x0/0xea
Dec 2 22:11:28 ulises kernel: [xfs_read+346/620] xfs_read+0x15a/0x26c
Dec 2 22:11:28 ulises kernel: [linvfs_read+141/159] linvfs_read+0x8d/0x9f
Dec 2 22:11:28 ulises kernel: [do_sync_read+139/183] do_sync_read+0x8b/0xb7
Dec 2 22:11:28 ulises kernel: [inet_setsockopt+54/58] inet_setsockopt+0x36/0x3a
Dec 2 22:11:28 ulises kernel: [sys_setsockopt+120/178] sys_setsockopt+0x78/0xb2
Dec 2 22:11:28 ulises kernel: [vfs_read+176/281] vfs_read+0xb0/0x119
Dec 2 22:11:28 ulises kernel: [sys_read+66/99] sys_read+0x42/0x63
Dec 2 22:11:28 ulises kernel: [sysenter_past_esp+82/113] sysenter_past_esp+0x52/0x71
Dec 2 22:11:28 ulises kernel:
Dec 2 22:11:28 ulises kernel: Trying to fix it up, but a reboot is needed
Dec 2 22:11:29 ulises kernel: ------------[ cut here ]------------
Dec 2 22:11:29 ulises kernel: kernel BUG at mm/vmscan.c:280!
Dec 2 22:11:29 ulises kernel: invalid operand: 0000 [#2]
Dec 2 22:11:29 ulises kernel: CPU: 0
Dec 2 22:11:29 ulises kernel: EIP: 0060:[shrink_list+1061/1099] Not tainted
Dec 2 22:11:29 ulises kernel: EFLAGS: 00010202
Dec 2 22:11:29 ulises kernel: EIP is at shrink_list+0x425/0x44b
Dec 2 22:11:29 ulises kernel: eax: 01010041 ebx: d5d57b34 ecx: c13588a0 edx: c104b7c0
Dec 2 22:11:29 ulises kernel: esi: c104b7a8 edi: ce393cfc ebp: c0463b0c esp: ce393c38
Dec 2 22:11:29 ulises kernel: ds: 007b es: 007b ss: 0068
Dec 2 22:11:29 ulises kernel: Process proftpd (pid: 30881, threadinfo=ce392000 task=d3a43900)
Dec 2 22:11:29 ulises kernel: Stack: 00000000 00000000 00000000 00000000 00000000 00000000 ce393c50 ce393c50
Dec 2 22:11:29 ulises kernel: 0000f1ce 00000000 00000000 00000050 0000f1ce c012d871 dea9a5d8 0000f1ce
Dec 2 22:11:29 ulises kernel: 00000000 00000001 c012d5f9 c13877c8 c13877c8 00000000 c026ecea dea9a5d8
Dec 2 22:11:29 ulises kernel: Call Trace:
Dec 2 22:11:29 ulises kernel: [find_or_create_page+33/170] find_or_create_page+0x21/0xaa
Dec 2 22:11:29 ulises kernel: [unlock_page+21/82] unlock_page+0x15/0x52
Dec 2 22:11:29 ulises kernel: [_pagebuf_lookup_pages+725/941] _pagebuf_lookup_pages+0x2d5/0x3ad
Dec 2 22:11:29 ulises kernel: [__pagevec_lru_add_active+150/165] __pagevec_lru_add_active+0x96/0xa5
Dec 2 22:11:29 ulises kernel: [shrink_cache+353/637] shrink_cache+0x161/0x27d
Dec 2 22:11:29 ulises kernel: [xfs_da_buf_make+508/513] xfs_da_buf_make+0x1fc/0x201
Dec 2 22:11:29 ulises kernel: [xfs_da_do_buf+738/2470] xfs_da_do_buf+0x2e2/0x9a6
Dec 2 22:11:29 ulises kernel: [shrink_caches+167/189] shrink_caches+0xa7/0xbd
Dec 2 22:11:29 ulises kernel: [try_to_free_pages+159/351] try_to_free_pages+0x9f/0x15f
Dec 2 22:11:29 ulises kernel: [__alloc_pages+469/788] __alloc_pages+0x1d5/0x314
Dec 2 22:11:29 ulises kernel: [__get_free_pages+31/65] __get_free_pages+0x1f/0x41
Dec 2 22:11:29 ulises kernel: [__pollwait+133/198] __pollwait+0x85/0xc6
Dec 2 22:11:29 ulises kernel: [tcp_poll+52/346] tcp_poll+0x34/0x15a
Dec 2 22:11:29 ulises kernel: [sock_poll+41/49] sock_poll+0x29/0x31
Dec 2 22:11:29 ulises kernel: [do_select+582/675] do_select+0x246/0x2a3
Dec 2 22:11:29 ulises kernel: [__pollwait+0/198] __pollwait+0x0/0xc6
Dec 2 22:11:29 ulises kernel: [sys_select+711/1251] sys_select+0x2c7/0x4e3
Dec 2 22:11:29 ulises kernel: [sysenter_past_esp+82/113] sysenter_past_esp+0x52/0x71
Dec 2 22:11:29 ulises kernel:
Dec 2 22:11:29 ulises kernel: Code: 0f 0b 18 01 ca cd 40 c0 e9 7c fc ff ff 8b 84 24 b0 00 00 00
Dec 2 22:11:32 ulises kernel: ------------[ cut here ]------------
Dec 2 22:11:32 ulises kernel: kernel BUG at mm/vmscan.c:280!
Dec 2 22:11:32 ulises kernel: invalid operand: 0000 [#3]
Dec 2 22:11:32 ulises kernel: CPU: 0
Dec 2 22:11:32 ulises kernel: EIP: 0060:[shrink_list+1061/1099] Not tainted
Dec 2 22:11:32 ulises kernel: EFLAGS: 00010202
Dec 2 22:11:32 ulises kernel: EIP is at shrink_list+0x425/0x44b
Dec 2 22:11:32 ulises kernel: eax: 01020049 ebx: 00000000 ecx: c141ae00 edx: c10bc100
Dec 2 22:11:32 ulises kernel: esi: c10bc0e8 edi: d653dcfc ebp: c0463b0c esp: d653dc38
Dec 2 22:11:32 ulises kernel: ds: 007b es: 007b ss: 0068
Dec 2 22:11:32 ulises kernel: Process proftpd (pid: 30503, threadinfo=d653c000 task=c80ba080)
Dec 2 22:11:32 ulises kernel: Stack: c036608a dffe799c dddd5180 de49e680 00000000 00000000 d653dc50 d653dc50
Dec 2 22:11:32 ulises kernel: 00000000 c036a066 dddd5180 d653dc68 0000012c 00000001 00000001 c0545e90
Dec 2 22:11:32 ulises kernel: 00000000 00000001 00000013 00000980 c04c1380 dfd39ba0 c010b04a 00000292
Dec 2 22:11:32 ulises kernel: Call Trace:
Dec 2 22:11:32 ulises kernel: [kfree_skbmem+36/44] kfree_skbmem+0x24/0x2c
Dec 2 22:11:32 ulises kernel: [net_tx_action+58/172] net_tx_action+0x3a/0xac
Dec 2 22:11:32 ulises kernel: [do_IRQ+196/223] do_IRQ+0xc4/0xdf
Dec 2 22:11:32 ulises kernel: [start_this_handle+172/688] start_this_handle+0xac/0x2b0
Dec 2 22:11:32 ulises kernel: [__pagevec_lru_add_active+150/165] __pagevec_lru_add_active+0x96/0xa5
Dec 2 22:11:32 ulises kernel: [shrink_cache+353/637] shrink_cache+0x161/0x27d
Dec 2 22:11:32 ulises kernel: [__block_commit_write+136/138] __block_commit_write+0x88/0x8a
Dec 2 22:11:32 ulises kernel: [__ext3_journal_stop+36/80] __ext3_journal_stop+0x24/0x50
Dec 2 22:11:32 ulises kernel: [shrink_caches+167/189] shrink_caches+0xa7/0xbd
Dec 2 22:11:32 ulises kernel: [try_to_free_pages+159/351] try_to_free_pages+0x9f/0x15f
Dec 2 22:11:32 ulises kernel: [__alloc_pages+469/788] __alloc_pages+0x1d5/0x314
Dec 2 22:11:32 ulises kernel: [__get_free_pages+31/65] __get_free_pages+0x1f/0x41
Dec 2 22:11:32 ulises kernel: [__pollwait+133/198] __pollwait+0x85/0xc6
Dec 2 22:11:32 ulises kernel: [tcp_poll+52/346] tcp_poll+0x34/0x15a
Dec 2 22:11:32 ulises kernel: [sock_poll+41/49] sock_poll+0x29/0x31
Dec 2 22:11:32 ulises kernel: [do_select+582/675] do_select+0x246/0x2a3
Dec 2 22:11:32 ulises kernel: [__pollwait+0/198] __pollwait+0x0/0xc6
Dec 2 22:11:32 ulises kernel: [sys_select+711/1251] sys_select+0x2c7/0x4e3
Dec 2 22:11:32 ulises kernel: [sysenter_past_esp+82/113] sysenter_past_esp+0x52/0x71
Dec 2 22:11:32 ulises kernel:
Dec 2 22:11:32 ulises kernel: Code: 0f 0b 18 01 ca cd 40 c0 e9 7c fc ff ff 8b 84 24 b0 00 00 00
[...]

This machine is Pentium IV with 512 MB of RAM, IDE & SATA disks, RAID 0 over the
2 SATA disks, vanilla 2.6.0-test11, Debian testing, apache2 and proftpd.

And yes :-) it's compiled from a fresh cleaned tree.

Any ideas? Should I switch to 2.6.0-test10-mm1, in case that will be
more stable?

Thanks in advance,


Ender.
- --
Another C-3PO: E chu ta!
C-3PO: How rude!
-- C-3PO (The Empire Strikes Back)
- --
Servicios de red - Network services
Centro de Comunicaciones CSIC/RedIRIS
Spanish Academic Network for Research and Development
Madrid (Spain)
Tlf (+34) 91.585.49.05
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/zeJeWs/EhA1iABsRAv1dAKCqjUn0lzLiYlpUuh6Wm9G20Qt2iwCeJRaL
rNXUG4kfKEIfm0DVyIsMez8=
=D0lV
-----END PGP SIGNATURE-----

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

William Lee Irwin III

nepřečteno,
3. 12. 2003 8:35:1003.12.03
komu:
On Wed, Dec 03, 2003 at 02:17:18PM +0100, David Mart?nez Moreno wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> Hello again. I'm testing 2.6.0-test11 in one of my servers. In about a day or
> so under a web/FTP server load, the kernel starts to spit messages:
> Dec 2 22:07:25 ulises kernel: Bad page state at prep_new_page
> Dec 2 22:07:25 ulises kernel: flags:0x0102002c mapping:d50c8a28 mapped:0 count:1
> Dec 2 22:07:25 ulises kernel: Backtrace:
> Dec 2 22:07:25 ulises kernel: Call Trace:
> Dec 2 22:07:25 ulises kernel: [bad_page+93/133] bad_page+0x5d/0x85
> Dec 2 22:07:25 ulises kernel: [prep_new_page+50/81] prep_new_page+0x32/0x51
> Dec 2 22:07:25 ulises kernel: [buffered_rmqueue+165/264] buffered_rmqueue+0xa5/0x108

Someone is freeing pages still in the pagecache.

What filesystem? What .config? Could you try CONFIG_DEBUG_PAGEALLOC?


-- wli

Linus Torvalds

nepřečteno,
3. 12. 2003 11:01:4203.12.03
komu:

On Wed, 3 Dec 2003, David [iso-8859-15] Martínez Moreno wrote:
>
> Hello again. I'm testing 2.6.0-test11 in one of my servers. In about a day or
> so under a web/FTP server load, the kernel starts to spit messages:
>
> Dec 2 22:07:25 ulises kernel: Bad page state at prep_new_page

> [ ... ]


>
> This machine is Pentium IV with 512 MB of RAM, IDE & SATA disks, RAID 0 over the
> 2 SATA disks, vanilla 2.6.0-test11, Debian testing, apache2 and proftpd.

Interesting. Another RAID 0 problem report..

Is there any way you can test the same setup _without_ using RAID? We seem
to be narrowing down the current 2.6.x problems to RAID usage, but it
would be good to verify that.

Linus

Jens Axboe

nepřečteno,
3. 12. 2003 11:24:2403.12.03
komu:
On Wed, Dec 03 2003, Linus Torvalds wrote:
>
>
> On Wed, 3 Dec 2003, David [iso-8859-15] Martínez Moreno wrote:
> >
> > Hello again. I'm testing 2.6.0-test11 in one of my servers. In about a day or
> > so under a web/FTP server load, the kernel starts to spit messages:
> >
> > Dec 2 22:07:25 ulises kernel: Bad page state at prep_new_page
> > [ ... ]
> >
> > This machine is Pentium IV with 512 MB of RAM, IDE & SATA disks, RAID 0 over the
> > 2 SATA disks, vanilla 2.6.0-test11, Debian testing, apache2 and proftpd.
>
> Interesting. Another RAID 0 problem report..

Hmm did _all_ reports include raid-0, or just "some" raid? I'm looking
at the bio_pair stuff which raid-0 is the only user of, something looks
fishy there.

--
Jens Axboe

Jens Axboe

nepřečteno,
3. 12. 2003 11:31:3103.12.03
komu:
On Wed, Dec 03 2003, Jens Axboe wrote:
> On Wed, Dec 03 2003, Linus Torvalds wrote:
> >
> >
> > On Wed, 3 Dec 2003, David [iso-8859-15] Martínez Moreno wrote:
> > >
> > > Hello again. I'm testing 2.6.0-test11 in one of my servers. In about a day or
> > > so under a web/FTP server load, the kernel starts to spit messages:
> > >
> > > Dec 2 22:07:25 ulises kernel: Bad page state at prep_new_page
> > > [ ... ]
> > >
> > > This machine is Pentium IV with 512 MB of RAM, IDE & SATA disks, RAID 0 over the
> > > 2 SATA disks, vanilla 2.6.0-test11, Debian testing, apache2 and proftpd.
> >
> > Interesting. Another RAID 0 problem report..
>
> Hmm did _all_ reports include raid-0, or just "some" raid? I'm looking
> at the bio_pair stuff which raid-0 is the only user of, something looks
> fishy there.

Looks like some raid-5 and others were involved in other reports, so
that's probably not it.

The bio_pair stuff has a few flaws, but nothing that could bad things to
happen.

Linus Torvalds

nepřečteno,
3. 12. 2003 11:53:2403.12.03
komu:

On Wed, 3 Dec 2003, Jens Axboe wrote:
> >
> > Interesting. Another RAID 0 problem report..
>
> Hmm did _all_ reports include raid-0, or just "some" raid? I'm looking
> at the bio_pair stuff which raid-0 is the only user of, something looks
> fishy there.

The ones I've seen seem to be raid-0, but Nathan (nat...@sgi.com)
reported problems in RAID-5 under load. I didn't decode the full oops on
that one, but it really looked like a stale "bi" bio that trapped on the
PAGE_ALLOC debug code.

So we may have more than one problem.

Linus

David Martínez Moreno

nepřečteno,
3. 12. 2003 11:53:2303.12.03
komu:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

El Miércoles, 3 de Diciembre de 2003 16:59, Linus Torvalds escribió:
> On Wed, 3 Dec 2003, David [iso-8859-15] Martínez Moreno wrote:
> > Hello again. I'm testing 2.6.0-test11 in one of my servers. In about a
> > day or so under a web/FTP server load, the kernel starts to spit
> > messages:
> >
> > Dec 2 22:07:25 ulises kernel: Bad page state at prep_new_page
> > [ ... ]
> >
> > This machine is Pentium IV with 512 MB of RAM, IDE & SATA disks, RAID 0
> > over the 2 SATA disks, vanilla 2.6.0-test11, Debian testing, apache2 and
> > proftpd.
>

> Interesting. Another RAID 0 problem report..
>

> Is there any way you can test the same setup _without_ using RAID? We seem
> to be narrowing down the current 2.6.x problems to RAID usage, but it
> would be good to verify that.
>
> Linus

I've just rebooted about six hours ago, and it's giving panics elsewhere:

[...]
Ending XFS recovery on filesystem: md0 (dev: md0)
b44: eth0: Link is down.
b44: eth0: Link is up at 100 Mbps, full duplex.
b44: eth0: Flow control is on for TX and on for RX.
eth0: no IPv6 routers present
Unable to handle kernel paging request at virtual address 00100104
printing eip:
c012d195
*pde = 00000000
Oops: 0002 [#1]
CPU: 0
EIP: 0060:[<c012d195>] Not tainted
EFLAGS: 00010286
EIP is at __remove_from_page_cache+0x2d/0x62
eax: 00100100 ebx: c120d5c8 ecx: c120d5d0 edx: 00200200
esi: dbcb1d28 edi: dfdc5e64 ebp: dbcb1d28 esp: dfdc5d8c


ds: 007b es: 007b ss: 0068

Process kswapd0 (pid: 8, threadinfo=dfdc4000 task=dfdcacc0)
Stack: dbcb1d2c 0001b18a 00000001 c120d5c8 c0135b56 c120d5c8 00000088 0000001e
00000001 00000070 00000000 dfdc5db8 dfdc5db8 dfd16000 0000b400 dfdc5e44
c11632d0 c0463af4 00000002 c0548540 c0134efe 00000000 00000001 c13efaf8
Call Trace:
[<c0135b56>] shrink_list+0x2af/0x44b
[<c0134efe>] __pagevec_lru_add+0x84/0x93
[<c0135e53>] shrink_cache+0x161/0x27d
[<c0135737>] shrink_slab+0x75/0x153
[<c01367ac>] balance_pgdat+0x178/0x1f0
[<c0136936>] kswapd+0x112/0x122
[<c0119533>] autoremove_wake_function+0x0/0x4f
[<c0119533>] autoremove_wake_function+0x0/0x4f
[<c0136824>] kswapd+0x0/0x122
[<c0106fa5>] kernel_thread_helper+0x5/0xb

Code: 89 50 04 89 02 c7 41 04 00 02 20 00 c7 43 08 00 01 10 00 c7


Bad page state at prep_new_page

flags:0x0102000c mapping:dbcb1d28 mapped:0 count:2
Backtrace:
Call Trace:
[<c01306c5>] bad_page+0x5d/0x85
[<c01309b7>] prep_new_page+0x32/0x51
[<c0130dd8>] buffered_rmqueue+0xa5/0x108
[<c0130ee1>] __alloc_pages+0xa6/0x314
[<c0132a9a>] do_page_cache_readahead+0xdd/0x109
[<c0362f99>] sock_sendpage+0x61/0x70
[<c0132b84>] page_cache_readahead+0xbe/0x14f
[<c012daeb>] do_generic_mapping_read+0xba/0x39f
[<c012e1c8>] file_send_actor+0x0/0x73
[<c012e2a3>] generic_file_sendfile+0x68/0x75
[<c012e1c8>] file_send_actor+0x0/0x73
[<c0277b72>] xfs_sendfile+0xb8/0x195
[<c012e1c8>] file_send_actor+0x0/0x73
[<c0272889>] linvfs_sendfile+0x57/0x5f
[<c012e1c8>] file_send_actor+0x0/0x73
[<c0145aab>] do_sendfile+0x1de/0x2b5
[<c012e1c8>] file_send_actor+0x0/0x73
[<c0145be6>] sys_sendfile+0x64/0xd0
[<c0108cd1>] sysenter_past_esp+0x52/0x71

Trying to fix it up, but a reboot is needed

[...more errors...]


I can rebuild the Debian mirror for not using the RAID and using the SATA
disks separately, but will be tomorrow, it's a lot of space to move, and I
need remote intervention.

Anyway I'd love to know before doing if it will be useful, looking at what
Jens has said just ten minutes ago about RAIDs 0/5. Will it help to you? Say
so and I'll go for it.

I attach the .config, but I think it's nothing strange.

Regards,


Ender.
- --
And need I remind you that I am naked in the snow...? I
can't feel any of my extremities, and I mean *any* of them...
-- Skinner (The League of Extraordinary Gentlemen)


- --
Servicios de red - Network services
Centro de Comunicaciones CSIC/RedIRIS
Spanish Academic Network for Research and Development
Madrid (Spain)
Tlf (+34) 91.585.49.05
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/zhOUWs/EhA1iABsRAoF/AKCT8AmLwTj4EIge67VmL29uTrZFFgCeMZ09
xtvZBp7h53LONzbBPX+45d8=
=FHph
-----END PGP SIGNATURE-----

.config

David Martínez Moreno

nepřečteno,
3. 12. 2003 11:57:0403.12.03
komu:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

El Miércoles, 3 de Diciembre de 2003 14:31, William Lee Irwin III escribió:
> On Wed, Dec 03, 2003 at 02:17:18PM +0100, David Mart?nez Moreno wrote:

> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1

> > Hello again. I'm testing 2.6.0-test11 in one of my servers. In about a
> > day or so under a web/FTP server load, the kernel starts to spit
> > messages: Dec 2 22:07:25 ulises kernel: Bad page state at prep_new_page

> > Dec 2 22:07:25 ulises kernel: flags:0x0102002c mapping:d50c8a28 mapped:0

> > count:1 Dec 2 22:07:25 ulises kernel: Backtrace:
> > Dec 2 22:07:25 ulises kernel: Call Trace:
> > Dec 2 22:07:25 ulises kernel: [bad_page+93/133] bad_page+0x5d/0x85
> > Dec 2 22:07:25 ulises kernel: [prep_new_page+50/81]
> > prep_new_page+0x32/0x51 Dec 2 22:07:25 ulises kernel:

> > [buffered_rmqueue+165/264] buffered_rmqueue+0xa5/0x108
>
> Someone is freeing pages still in the pagecache.
>
> What filesystem? What .config? Could you try CONFIG_DEBUG_PAGEALLOC?

Ext3 in /, XFS over RAID 0:

ulises:/home# mount
/dev/hda5 on / type ext3 (rw,errors=remount-ro)
proc on /proc type proc (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
none on /sys type sysfs (rw)
/dev/hda1 on /boot type ext3 (rw)
/dev/hda6 on /var type ext3 (rw)
/dev/hda8 on /home type ext3 (rw)
/dev/md0 on /mirror type xfs (rw)

Attached is .config. I'll try to build another kernel with
CONFIG_DEBUG_PAGEALLOC support enabled today or tomorrow.

Regards,


Ender.
- --
El conceto es el conceto.
-- Pazos (Airbag)


- --
Servicios de red - Network services
Centro de Comunicaciones CSIC/RedIRIS
Spanish Academic Network for Research and Development
Madrid (Spain)
Tlf (+34) 91.585.49.05
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/zhQcWs/EhA1iABsRAv7+AKDSTmSrYxYU6y9JhLZZyIXq+SPN/gCgrTH1
dFcP3l+sT4SRVYOZghR8ves=
=ji2Q
-----END PGP SIGNATURE-----

.config

Kevin P. Fleming

nepřečteno,
3. 12. 2003 12:29:4303.12.03
komu:
Linus Torvalds wrote:

> The ones I've seen seem to be raid-0, but Nathan (nat...@sgi.com)
> reported problems in RAID-5 under load. I didn't decode the full oops on
> that one, but it really looked like a stale "bi" bio that trapped on the
> PAGE_ALLOC debug code.

The problem I reported was also with RAID-5, and I have also found a
problem similar to Nathan's (probably the same one) by just trying to
run bonnie++ on an XFS filesystem on DM over RAID-5, even after
formatting the XFS filesystem to forcibly align everything to RAID-5
stripes (64K units).

Linus Torvalds

nepřečteno,
3. 12. 2003 12:29:4303.12.03
komu:

On Wed, 3 Dec 2003, David Martínez Moreno wrote:
>
> I've just rebooted about six hours ago, and it's giving panics elsewhere:
>
> [...]
> Ending XFS recovery on filesystem: md0 (dev: md0)
> b44: eth0: Link is down.
> b44: eth0: Link is up at 100 Mbps, full duplex.
> b44: eth0: Flow control is on for TX and on for RX.
> eth0: no IPv6 routers present
> Unable to handle kernel paging request at virtual address 00100104

That's the LIST_POISON stuff: 00100100 is the "bad list pointer". Somebody
tried to remove a page twice.

Doesn't mean a lot - if your "struct page" got corrupted, anything can
happen. Quite possibly it's a double free.

> I can rebuild the Debian mirror for not using the RAID and using the SATA
> disks separately, but will be tomorrow, it's a lot of space to move, and I
> need remote intervention.
>
> Anyway I'd love to know before doing if it will be useful, looking at what
> Jens has said just ten minutes ago about RAIDs 0/5. Will it help to you? Say
> so and I'll go for it.

It might be more useful to leave it as RAID0, if you're willing to try out
patches to try to debug this. The slab-debugging thing I sent out earlier
is one such patch (but may well cause out-of-memory problems under load),
and possibly the atomic-decrement checker patch (appended). And maybe Jens
and Neil can come up with something..

Linus

----
===== arch/i386/lib/dec_and_lock.c 1.1 vs edited =====
--- 1.1/arch/i386/lib/dec_and_lock.c Tue Feb 5 09:40:21 2002
+++ edited/arch/i386/lib/dec_and_lock.c Sun Nov 2 09:07:53 2003
@@ -19,7 +19,7 @@
counter = atomic_read(atomic);
newcount = counter-1;

- if (!newcount)
+ if (newcount <= 0)
goto slow_path;

asm volatile("lock; cmpxchgl %1,%2"
===== include/asm-i386/atomic.h 1.5 vs edited =====
--- 1.5/include/asm-i386/atomic.h Mon Aug 18 19:46:23 2003
+++ edited/include/asm-i386/atomic.h Sun Nov 2 09:40:42 2003
@@ -2,6 +2,8 @@
#define __ARCH_I386_ATOMIC__

#include <linux/config.h>
+#include <linux/kernel.h>
+#include <asm/bug.h>

/*
* Atomic operations that C can't guarantee us. Useful for
@@ -136,12 +138,17 @@
*/
static __inline__ int atomic_dec_and_test(atomic_t *v)
{
- unsigned char c;
+ static int count = 2;
+ unsigned char c, neg;

__asm__ __volatile__(
- LOCK "decl %0; sete %1"
- :"=m" (v->counter), "=qm" (c)
+ LOCK "decl %0; sete %1; sets %2"
+ :"=m" (v->counter), "=qm" (c), "=qm" (neg)
:"m" (v->counter) : "memory");
+ if (count && neg) {
+ count--;
+ WARN_ON(neg);
+ }
return c != 0;

Neil Brown

nepřečteno,
3. 12. 2003 15:08:5803.12.03
komu:
On Wednesday December 3, ax...@suse.de wrote:
> >
> > Interesting. Another RAID 0 problem report..
>
> Hmm did _all_ reports include raid-0, or just "some" raid? I'm looking
> at the bio_pair stuff which raid-0 is the only user of, something looks
> fishy there.

I think this is the first raid0 related problem I have seen for a
while. Others were raid1 and raid5.

Once an array is set up and running there is minimal common code
between the different levels so it is very unlikely to be the same
problem in all three cases.

xfs seems to figure almost as prominantly as raid (the raid1 bug was
ext3), but maybe it's just that xfs over raid is a popular
configuration.

NeilBrown

Neil Brown

nepřečteno,
3. 12. 2003 15:12:1103.12.03
komu:
On Wednesday December 3, torv...@osdl.org wrote:

>
>
> On Wed, 3 Dec 2003, Jens Axboe wrote:
> > >
> > > Interesting. Another RAID 0 problem report..
> >
> > Hmm did _all_ reports include raid-0, or just "some" raid? I'm looking
> > at the bio_pair stuff which raid-0 is the only user of, something looks
> > fishy there.
>
> The ones I've seen seem to be raid-0, but Nathan (nat...@sgi.com)
> reported problems in RAID-5 under load. I didn't decode the full oops on
> that one, but it really looked like a stale "bi" bio that trapped on the
> PAGE_ALLOC debug code.
>

Nathan's had a second oops that turned out to be a bi_next pointer
being bad in a bio that raid5 had just about finished writing out.
So there does seem to be something wrong with bio handling, quite
possibly in raid5.

The only thing I could find was that if raid5 received two overlapping
bios concurrently (or atleast received the second before it had
finished with the first) it could get confused. I've asked Nathan to
try a patch that BUGs when that happens.

Linus Torvalds

nepřečteno,
3. 12. 2003 15:16:3103.12.03
komu:


On Thu, 4 Dec 2003, Neil Brown wrote:
>
> xfs seems to figure almost as prominantly as raid (the raid1 bug was
> ext3), but maybe it's just that xfs over raid is a popular
> configuration.

I suspect that is the case - people seem to be reporting it with other
filesystems too, and it's more likely just a case of "somebody who feels
he needs to set up raid probably is also likely to feel he needs XFS".

That effect is then quite possibly exaggerated by XFS having different IO
patterns than some other loads (ie it is likely that a lot of RAID
developers are themselves using just ext3 and may thus have hit the
"normal" bugs and fixed them already).

Linus

Nathan Scott

nepřečteno,
4. 12. 2003 1:45:2504.12.03
komu:
On Thu, Dec 04, 2003 at 07:09:10AM +1100, Neil Brown wrote:
> On Wednesday December 3, torv...@osdl.org wrote:
> >
> >
> > On Wed, 3 Dec 2003, Jens Axboe wrote:
> > > >
> > > > Interesting. Another RAID 0 problem report..
> > >
> > > Hmm did _all_ reports include raid-0, or just "some" raid? I'm looking
> > > at the bio_pair stuff which raid-0 is the only user of, something looks
> > > fishy there.
> >
> > The ones I've seen seem to be raid-0, but Nathan (nat...@sgi.com)
> > reported problems in RAID-5 under load. I didn't decode the full oops on
> > that one, but it really looked like a stale "bi" bio that trapped on the
> > PAGE_ALLOC debug code.
> >
>
> Nathan's had a second oops that turned out to be a bi_next pointer
> being bad in a bio that raid5 had just about finished writing out.
> So there does seem to be something wrong with bio handling, quite
> possibly in raid5.
>
> The only thing I could find was that if raid5 received two overlapping
> bios concurrently (or atleast received the second before it had
> finished with the first) it could get confused. I've asked Nathan to
> try a patch that BUGs when that happens.

I haven't tripped the bug so far today, although have been
running with page-sized fs blocksize so far - perhaps that
is implicated, and makes it less likely to trigger (when I
say "the bug" there, I mean neither the panic, nor the new
BUG_ON(); I'll revert back to smaller block sizes next).

That error path bio_put issue you spotted in XFS, Neil, I
think is a valid problem - I'm not sure that is reachable
code in practice (possibly overly defensive XFS bio code),
I'll go investigate that some more.

cheers.

--
Nathan

Jens Axboe

nepřečteno,
4. 12. 2003 7:46:5904.12.03
komu:

I can reproduce on raid5 with linear dm on top (using XFS). I need to
kill the slab and memory debugging, I've put some bio debugging in there
instead (the memory debugging interferes with it). It's definitely a bio
use after free case, clone_endio() ends up with a freed bio.

Program received signal SIGEMT, Emulation trap.
0xc02dd454 in handle_stripe (sh=0xc17cf630) at drivers/md/raid5.c:1009
1009 wbi = wbi2;
(gdb) bt
#0 0xc02dd454 in handle_stripe (sh=0xc17cf630) at drivers/md/raid5.c:1009
#1 0xc02de31f in raid5d (mddev=0xdfd2d200) at drivers/md/raid5.c:1436
#2 0xc02e675a in md_thread (arg=0xdffdc1a0) at drivers/md/md.c:2692
#3 0xc010752d in kernel_thread_helper () at arch/i386/kernel/process.c:226

wbi (dev->written) has already been freed by someone else.

My puny 512MB test box cannot use your slab-debug patch :). The
atomic-checker didn't catch anything.

--
Jens Axboe

David Martínez Moreno

nepřečteno,
4. 12. 2003 7:59:2804.12.03
komu:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

El Miércoles, 3 de Diciembre de 2003 18:25, Linus Torvalds escribió:
> It might be more useful to leave it as RAID0, if you're willing to try out
> patches to try to debug this. The slab-debugging thing I sent out earlier
> is one such patch (but may well cause out-of-memory problems under load),
> and possibly the atomic-decrement checker patch (appended). And maybe Jens
> and Neil can come up with something..

Ok, Linus, I've rebooted the server with your patch applied. It'll be very
easy to check. People are hitting all the time the archive, downloading the
Debian CD images, so in several hours I'll tell you the result.

Thanks in advance,


Ender.

- --
Oh, I saw...Very American. Fire enough bullets and hope
they hit the target!
-- Allan Quatermain (The League of Extraordinary Gentlemen)


- --
Servicios de red - Network services
Centro de Comunicaciones CSIC/RedIRIS
Spanish Academic Network for Research and Development
Madrid (Spain)
Tlf (+34) 91.585.49.05
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/zy5eWs/EhA1iABsRAslYAKDP9AbmxeYkYMX1VDMbyQvLNhJzEQCgslrF
ZHEPuaAf52qTJ+hfOwZETIQ=
=VmD7
-----END PGP SIGNATURE-----

Jens Axboe

nepřečteno,
4. 12. 2003 9:12:0704.12.03
komu:
On Thu, Dec 04 2003, Jens Axboe wrote:
> On Wed, Dec 03 2003, Linus Torvalds wrote:
> >
> >
> > On Wed, 3 Dec 2003, David Martínez Moreno wrote:
> > >
> > > I've just rebooted about six hours ago, and it's giving panics elsewhere:
> > >
> > > [...]
> > > Ending XFS recovery on filesystem: md0 (dev: md0)
> > > b44: eth0: Link is down.
> > > b44: eth0: Link is up at 100 Mbps, full duplex.
> > > b44: eth0: Flow control is on for TX and on for RX.
> > > eth0: no IPv6 routers present
> > > Unable to handle kernel paging request at virtual address 00100104
> >
> > That's the LIST_POISON stuff: 00100100 is the "bad list pointer". Somebody
> > tried to remove a page twice.
> >
> > Doesn't mean a lot - if your "struct page" got corrupted, anything can
> > happen. Quite possibly it's a double free.
> >
> > > I can rebuild the Debian mirror for not using the RAID and using the SATA
> > > disks separately, but will be tomorrow, it's a lot of space to move, and I
> > > need remote intervention.
> > >
> > > Anyway I'd love to know before doing if it will be useful, looking at what
> > > Jens has said just ten minutes ago about RAIDs 0/5. Will it help to you? Say
> > > so and I'll go for it.
> >
> > It might be more useful to leave it as RAID0, if you're willing to try out
> > patches to try to debug this. The slab-debugging thing I sent out earlier
> > is one such patch (but may well cause out-of-memory problems under load),
> > and possibly the atomic-decrement checker patch (appended). And maybe Jens
> > and Neil can come up with something..
>
> I can reproduce on raid5 with linear dm on top (using XFS). I need to
> kill the slab and memory debugging, I've put some bio debugging in there
> instead (the memory debugging interferes with it). It's definitely a bio
> use after free case, clone_endio() ends up with a freed bio.
>
> Program received signal SIGEMT, Emulation trap.
> 0xc02dd454 in handle_stripe (sh=0xc17cf630) at drivers/md/raid5.c:1009
> 1009 wbi = wbi2;
> (gdb) bt
> #0 0xc02dd454 in handle_stripe (sh=0xc17cf630) at drivers/md/raid5.c:1009
> #1 0xc02de31f in raid5d (mddev=0xdfd2d200) at drivers/md/raid5.c:1436
> #2 0xc02e675a in md_thread (arg=0xdffdc1a0) at drivers/md/md.c:2692
> #3 0xc010752d in kernel_thread_helper () at arch/i386/kernel/process.c:226
>
> wbi (dev->written) has already been freed by someone else.
>
> My puny 512MB test box cannot use your slab-debug patch :). The
> atomic-checker didn't catch anything.

I can just as easily reproduce with ext2, so I don't think XFS has
anything to do with it. This is still raid5 with dm linear on top.

(gdb) bt
#0 0x0b10dead in ?? ()
#1 0xc0170e25 in bio_endio (bio=0xda37fae0, bytes_done=0, error=0)
at fs/bio.c:689
#2 0xc02e8a0d in clone_endio (bio=0xda38c120, done=7168, error=0)
at drivers/md/dm.c:266
#3 0xc02dd78c in handle_stripe (sh=0xdfc18de0) at drivers/md/raid5.c:1209
#4 0xc02de38f in raid5d (mddev=0xdfd62200) at drivers/md/raid5.c:1437
#5 0xc02e67da in md_thread (arg=0xdfd58260) at drivers/md/md.c:2692
#6 0xc010752d in kernel_thread_helper () at arch/i386/kernel/process.c:226
(gdb) p *(struct bio *) 0xda37fae0
$1 = {bi_sector = 42974, bi_next = 0x0, bi_bdev = 0xb10dead, bi_flags =
1041,
bi_rw = 1, bi_vcnt = 3, bi_idx = 0, bi_phys_segments = 0,
bi_hw_segments = 0, bi_size = 0, bi_max_vecs = 0, bi_io_vec =
0xda357ce0,
bi_end_io = 0xb10dead, bi_cnt = {counter = 0}, bi_private = 0xb10dead,
bi_destructor = 0xc0170050 <bio_destructor>, free_eip = 0xc02e8a26}

EIP is bad, bio debug magic 0x0b10dead. bi_flags has the uptodate bit
set, the clone bit, and the 10th bit (debug dead bit). The bio itself
was freed by 0xc02e8a26, dm.c:clone_endio().

--
Jens Axboe

Jens Axboe

nepřečteno,
4. 12. 2003 9:18:2404.12.03
komu:

As far as I can see, dm bio handling looks perfectly fine (no bad usage
in there wrt references). So this looks more and more like a raid
problem. I tried looking at handle_strip() and associated raid5
functions, but I think I'll leave that to Neil... It's not a straight
forward read.

Neil Brown

nepřečteno,
4. 12. 2003 22:11:2004.12.03
komu:
On Thursday December 4, ax...@suse.de wrote:
>
> I can just as easily reproduce with ext2, so I don't think XFS has
> anything to do with it. This is still raid5 with dm linear on top.
>

Thanks. The more evidence the better....

After staring at the code over and ove again, I finally saw the
assumption that I was making that was invalid. I also found a
possible data-corruption bug in the process.

If you have been having problems with raid5, please try this patch and
see if it helps.

Thanks,
NeilBrown

### Comments for ChangeSet
Fix bugs in md/raid5

This patch does three things:
1/ make sure raid5 doesn't try to handle multiple overlaping
requests at the same time as this would confuse things badly.
Currently it justs BUGs if this is attempted.
2/ Fix a possible data-loss-on-write problem. If two or
more bio's that write to the same page are processed at the
same time, only the first was actually commited to storage.
3/ Fix a use-after-free bug. raid5 keeps the bio's it is given
in linked lists when more than one bio touch a single page.
In some cases the tail of this list can be freed, and
the current test for 'are we at the end' isn't reliable.
This patch strengths the test to make it reliable.


----------- Diffstat output ------------
./drivers/md/raid5.c | 34 +++++++++++++++++++++++++---------
1 files changed, 25 insertions(+), 9 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~ 2003-12-03 16:25:54.000000000 +1100
+++ ./drivers/md/raid5.c 2003-12-05 13:27:49.000000000 +1100
@@ -40,6 +40,16 @@

#define stripe_hash(conf, sect) ((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK])

+/* bio's attached to a stripe+device for I/O are linked together in bi_sector
+ * order without overlap. There may be several bio's per stripe+device, and
+ * a bio could span several devices.
+ * When walking this list for a particular stripe+device, we must never proceed
+ * beyond a bio that extends past this device, as the next bio might no longer
+ * be valid.
+ * This macro is used to determine the 'next' bio in the list, given the sector
+ * of the current stripe+device
+ */
+#define r5_next_bio(bio, sect) ( ( bio->bi_sector + (bio->bi_size>>9) < sect + STRIPE_SECTORS) ? bio->bi_next : NULL)
/*
* The following can be used to debug the driver
*/
@@ -613,7 +623,7 @@ static void copy_data(int frombio, struc
int i;

for (;bio && bio->bi_sector < sector+STRIPE_SECTORS;
- bio = bio->bi_next) {
+ bio = r5_next_bio(bio, sector) ) {
int page_offset;
if (bio->bi_sector >= sector)
page_offset = (signed)(bio->bi_sector - sector) * 512;
@@ -738,7 +748,11 @@ static void compute_parity(struct stripe
for (i = disks; i--;)
if (sh->dev[i].written) {
sector_t sector = sh->dev[i].sector;
- copy_data(1, sh->dev[i].written, sh->dev[i].page, sector);
+ struct bio *wbi = sh->dev[i].written;
+ while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) {
+ copy_data(1, wbi, sh->dev[i].page, sector);
+ wbi = r5_next_bio(wbi, sector);
+ }

set_bit(R5_LOCKED, &sh->dev[i].flags);
set_bit(R5_UPTODATE, &sh->dev[i].flags);
@@ -791,8 +805,10 @@ static void add_stripe_bio (struct strip
bip = &sh->dev[dd_idx].towrite;
else
bip = &sh->dev[dd_idx].toread;
- while (*bip && (*bip)->bi_sector < bi->bi_sector)
+ while (*bip && (*bip)->bi_sector < bi->bi_sector) {
+ BUG_ON((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector);
bip = & (*bip)->bi_next;
+ }
/* FIXME do I need to worry about overlapping bion */
if (*bip && bi->bi_next && (*bip) != bi->bi_next)
BUG();
@@ -813,7 +829,7 @@ static void add_stripe_bio (struct strip
for (bi=sh->dev[dd_idx].towrite;
sector < sh->dev[dd_idx].sector + STRIPE_SECTORS &&
bi && bi->bi_sector <= sector;
- bi = bi->bi_next) {
+ bi = r5_next_bio(bi, sh->dev[dd_idx].sector)) {
if (bi->bi_sector + (bi->bi_size>>9) >= sector)
sector = bi->bi_sector + (bi->bi_size>>9);
}
@@ -883,7 +899,7 @@ static void handle_stripe(struct stripe_
spin_unlock_irq(&conf->device_lock);
while (rbi && rbi->bi_sector < dev->sector + STRIPE_SECTORS) {
copy_data(0, rbi, dev->page, dev->sector);
- rbi2 = rbi->bi_next;
+ rbi2 = r5_next_bio(rbi, dev->sector);
spin_lock_irq(&conf->device_lock);
if (--rbi->bi_phys_segments == 0) {
rbi->bi_next = return_bi;
@@ -928,7 +944,7 @@ static void handle_stripe(struct stripe_
if (bi) to_write--;

while (bi && bi->bi_sector < sh->dev[i].sector + STRIPE_SECTORS){
- struct bio *nextbi = bi->bi_next;
+ struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
clear_bit(BIO_UPTODATE, &bi->bi_flags);
if (--bi->bi_phys_segments == 0) {
md_write_end(conf->mddev);
@@ -941,7 +957,7 @@ static void handle_stripe(struct stripe_
bi = sh->dev[i].written;
sh->dev[i].written = NULL;
while (bi && bi->bi_sector < sh->dev[i].sector + STRIPE_SECTORS) {
- struct bio *bi2 = bi->bi_next;
+ struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
clear_bit(BIO_UPTODATE, &bi->bi_flags);
if (--bi->bi_phys_segments == 0) {
md_write_end(conf->mddev);
@@ -957,7 +973,7 @@ static void handle_stripe(struct stripe_
sh->dev[i].toread = NULL;
if (bi) to_read--;
while (bi && bi->bi_sector < sh->dev[i].sector + STRIPE_SECTORS){
- struct bio *nextbi = bi->bi_next;
+ struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
clear_bit(BIO_UPTODATE, &bi->bi_flags);
if (--bi->bi_phys_segments == 0) {
bi->bi_next = return_bi;
@@ -1000,7 +1016,7 @@ static void handle_stripe(struct stripe_
wbi = dev->written;
dev->written = NULL;
while (wbi && wbi->bi_sector < dev->sector + STRIPE_SECTORS) {
- wbi2 = wbi->bi_next;
+ wbi2 = r5_next_bio(wbi, dev->sector);
if (--wbi->bi_phys_segments == 0) {
md_write_end(conf->mddev);
wbi->bi_next = return_bi;

Nathan Scott

nepřečteno,
4. 12. 2003 22:27:0504.12.03
komu:
On Wed, Dec 03, 2003 at 10:25:27AM -0700, Kevin P. Fleming wrote:
> Linus Torvalds wrote:
>
> >The ones I've seen seem to be raid-0, but Nathan (nat...@sgi.com)
> >reported problems in RAID-5 under load. I didn't decode the full oops on
> >that one, but it really looked like a stale "bi" bio that trapped on the
> >PAGE_ALLOC debug code.
>
> The problem I reported was also with RAID-5, and I have also found a
> problem similar to Nathan's (probably the same one) by just trying to
> run bonnie++ on an XFS filesystem on DM over RAID-5, even after
> formatting the XFS filesystem to forcibly align everything to RAID-5
> stripes (64K units).

FWIW, this doesn't align _everything_ (space allocations done
through the XFS allocator are influenced, which means "most")
-- log IO is still going to be sector aligned, as are any IOs
to the four XFS allocation group header metadata structures.

cheers.

--
Nathan

Kevin P. Fleming

nepřečteno,
4. 12. 2003 22:51:5504.12.03
komu:
Nathan Scott wrote:

>>The problem I reported was also with RAID-5, and I have also found a
>>problem similar to Nathan's (probably the same one) by just trying to
>>run bonnie++ on an XFS filesystem on DM over RAID-5, even after
>>formatting the XFS filesystem to forcibly align everything to RAID-5
>>stripes (64K units).
>
>
> FWIW, this doesn't align _everything_ (space allocations done
> through the XFS allocator are influenced, which means "most")
> -- log IO is still going to be sector aligned, as are any IOs
> to the four XFS allocation group header metadata structures.

OK, I thought that "-l sunit=..." being set to a block-size multiple
would take care of that as well, but apparently not.

Nathan Scott

nepřečteno,
4. 12. 2003 23:35:0904.12.03
komu:
On Fri, Dec 05, 2003 at 02:07:35PM +1100, Neil Brown wrote:
> On Thursday December 4, ax...@suse.de wrote:
> >
> > I can just as easily reproduce with ext2, so I don't think XFS has
> > anything to do with it. This is still raid5 with dm linear on top.
> >
>
> Thanks. The more evidence the better....
>
> After staring at the code over and ove again, I finally saw the
> assumption that I was making that was invalid. I also found a
> possible data-corruption bug in the process.
>
> If you have been having problems with raid5, please try this patch and
> see if it helps.

Yes, thats fixed it. That reproducible test case I had no longer
shows any problems.

cheers.

--
Nathan

Kevin P. Fleming

nepřečteno,
4. 12. 2003 23:35:1304.12.03
komu:
Neil Brown wrote:

> After staring at the code over and ove again, I finally saw the
> assumption that I was making that was invalid. I also found a
> possible data-corruption bug in the process.
>
> If you have been having problems with raid5, please try this patch and
> see if it helps.

Initial indications are good here, my system (using XFS on linear DM
over a 6-disk RAID-5) just passed a bonnie++ run while the array was
resyncing, which it has not done without oopsing before.

Odpovědět všem
Odpovědět autorovi
Přeposlat
0 nových zpráv