[Lustre-discuss] OST I/O problems

Heiko Schröter

unread,

Dec 4, 2009, 11:59:33 AM12/4/09

to lustre-...@lists.lustre.org, sysa...@iup.physik.uni-bremen.de

Hello,

we do see those messages (see below) on our OSTs when under heavy _read_ load (or when 60+ Jobs are trying to read data at approx the same time).
The OSTs freezes and even console output is down to a few bytes the minute.
After some time the OSTs do revocer.

How to interpret those messages to get a clue where to look further ?

1) Is the underlying Raid to slow to handle the amount of requests ?
1a) Is the MDS to slow to follow ?
2) Is it related to network issues (network too slow, too busy ??) ?
3) Would this be cured by upgrading to 1.8.x ?

Status -5 looks to us like I/O errors.
~ # grep '5' /usr/include/*/*errno*
/usr/include/asm-generic/errno-base.h:#define EIO 5 /* I/O error */

But together with the 3ware support we are pretty sure to have replaced all snipish disks and data transfer looks ok when not used by lustre (i.e. verifying up to 30-90MB/s/disk throughput).

OSTs with 2GB RAM on 16port controller, 4GB RAM on 24port controller
Raid6 (write once, read often archive system)
lustre-1.6.6
vanilla-kernel 2.6.22.19
3ware 9650se (16 and 24port) latest 9.5.3 Version
Seagate 31000340NS disks, HITACHI 1TB disks

Thanks and Regards
Heiko

Dec 4 12:42:13 sadosrd24 LustreError: 4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc ffff810070c9e000
Dec 4 12:42:13 sadosrd24 LustreError: 4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc ffff810076b78000
Dec 4 12:42:13 sadosrd24 LustreError: 4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc ffff810071e88000
Dec 4 12:42:13 sadosrd24 LustreError: 4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc ffff810070c80000
Dec 4 12:42:13 sadosrd24 LustreError: 4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc ffff810076a80000
Dec 4 12:42:56 sadosrd24 LustreError: 4744:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s req@ffff81007efa7e00 x7869690/t0 o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0@NET_0x20000c0a8106f_UUID:0/0 lens 384/336 e 0 to 0 dl 1259926976 ref 1 fl Interpret:/0/0 rc 0/0
Dec 4 12:42:56 sadosrd24 Lustre: 4744:0:(ost_handler.c:939:ost_brw_read()) scia-OST001d: ignoring bulk IO comm error with eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0@NET_0x20000c0a8106f_UUID id 12345-192.168.16.111@tcp - client will retry
Dec 4 12:42:57 sadosrd24 LustreError: 4754:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s req@ffff81007dedb400 x7869691/t0 o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0@NET_0x20000c0a8106f_UUID:0/0 lens 384/336 e 0 to 0 dl 1259926977 ref 1 fl Interpret:/0/0 rc 0/0
Dec 4 12:42:57 sadosrd24 Lustre: 4754:0:(ost_handler.c:939:ost_brw_read()) scia-OST001d: ignoring bulk IO comm error with eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0@NET_0x20000c0a8106f_UUID id 12345-192.168.16.111@tcp - client will retry
Dec 4 12:43:00 sadosrd24 LustreError: 4757:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s req@ffff81007e8de000 x7869700/t0 o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0@NET_0x20000c0a8106f_UUID:0/0 lens 384/336 e 0 to 0 dl 1259926980 ref 1 fl Interpret:/0/0 rc 0/0
Dec 4 12:43:00 sadosrd24 LustreError: 4757:0:(ost_handler.c:882:ost_brw_read()) Skipped 2 previous similar messages
Dec 4 12:43:00 sadosrd24 Lustre: 4757:0:(ost_handler.c:939:ost_brw_read()) scia-OST001d: ignoring bulk IO comm error with eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0@NET_0x20000c0a8106f_UUID id 12345-192.168.16.111@tcp - client will retry
Dec 4 12:43:00 sadosrd24 Lustre: 4757:0:(ost_handler.c:939:ost_brw_read()) Skipped 2 previous similar messages
Dec 4 12:43:16 sadosrd24 LustreError: 4746:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 94+0s req@ffff81006de99850 x7869735/t0 o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0@NET_0x20000c0a8106f_UUID:0/0 lens 384/336 e 0 to 0 dl 1259926996 ref 1 fl Interpret:/0/0 rc 0/0
Dec 4 12:43:16 sadosrd24 LustreError: 4746:0:(ost_handler.c:882:ost_brw_read()) Skipped 2 previous similar messages
Dec 4 12:43:16 sadosrd24 Lustre: 4746:0:(ost_handler.c:939:ost_brw_read()) scia-OST001d: ignoring bulk IO comm error with eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0@NET_0x20000c0a8106f_UUID id 12345-192.168.16.111@tcp - client will retry
Dec 4 12:43:16 sadosrd24 Lustre: 4746:0:(ost_handler.c:939:ost_brw_read()) Skipped 2 previous similar messages
Dec 4 12:44:13 sadosrd24 Lustre: 4728:0:(ldlm_lib.c:538:target_handle_reconnect()) scia-OST001d: eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 reconnecting

_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Bernd Schubert

unread,

Dec 4, 2009, 1:35:01 PM12/4/09

to lustre-...@lists.lustre.org, sysa...@iup.physik.uni-bremen.de, Bernd Schubert

On Friday 04 December 2009, Heiko Schröter wrote:
> Hello,
>
> we do see those messages (see below) on our OSTs when under heavy _read_
> load (or when 60+ Jobs are trying to read data at approx the same time).
> The OSTs freezes and even console output is down to a few bytes the
> minute. After some time the OSTs do revocer.

>ler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after
> 100+0s req@ffff81007efa7e00 x7869690/t0

This error message means "You have a flaky network". For example it comes up
if you set a high MTU, but your switch does not support it.

Cheers,
Bernd

--
Bernd Schubert
DataDirect Networks

Andreas Dilger

unread,

Dec 5, 2009, 3:59:55 AM12/5/09

to Heiko Schröter, sysa...@iup.physik.uni-bremen.de, lustre-...@lists.lustre.org

On 2009-12-04, at 04:59, Heiko Schröter wrote:
> we do see those messages (see below) on our OSTs when under heavy
> _read_ load (or when 60+ Jobs are trying to read data at approx the
> same time).
> The OSTs freezes and even console output is down to a few bytes the
> minute.
> After some time the OSTs do revocer.
>

> But together with the 3ware support we are pretty sure to have
> replaced all snipish disks and data transfer looks ok when not used
> by lustre (i.e. verifying up to 30-90MB/s/disk throughput).
>
> OSTs with 2GB RAM on 16port controller, 4GB RAM on 24port controller
> Raid6 (write once, read often archive system)
> lustre-1.6.6
> vanilla-kernel 2.6.22.19
> 3ware 9650se (16 and 24port) latest 9.5.3 Version
> Seagate 31000340NS disks, HITACHI 1TB disks

We've had problems in the past with 3ware controllers at other sites
in the past - the performance is not as good as expected, since they
rely heavily on readahead to get good performance.

That said:

> Dec 4 12:42:56 sadosrd24 LustreError: 4744:0:(ost_handler.c:
> 882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s
> req@ffff81007efa7e00 x7869690/t0 o3->eb2e7e64-c1d9-
> d1f6-8f9d-1ba9629ff4c0@NET_0x20000c0a8106f_UUID:0/0 lens 384/336 e 0
> to 0 dl 1259926976 ref 1 fl Interpret:/0/0 rc 0/0

This means that the IO didn't complete before the timeout. This could
be because the OST IO is so slow that no RPC can complete before the
timeout, or because there is packet loss.

Some things to try:
- reduce the number of OSS threads via module parameter:
option ost oss_num_threads=N
- increase the lustre timeout (details in the manual)

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Heiko Schröter

unread,

Dec 7, 2009, 10:03:26 AM12/7/09

to Bernd Schubert, sysa...@iup.physik.uni-bremen.de, lustre-...@lists.lustre.org

Am Freitag 04 Dezember 2009 14:35:01 schrieb Bernd Schubert:

> >ler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after
> > 100+0s req@ffff81007efa7e00 x7869690/t0
>
> This error message means "You have a flaky network". For example it comes up
> if you set a high MTU, but your switch does not support it.
>

Thank you very much for the hint.
The root cause for our problem was a D-Link Gigabit switch who's QoS has been set to 'Normal'. Additionally two old 100MBit devices were plugged into this switch.

Best Regards
Heiko

Heiko Schröter

unread,

Dec 7, 2009, 10:09:41 AM12/7/09

to Andreas Dilger, sysa...@iup.physik.uni-bremen.de, lustre-...@lists.lustre.org

Am Samstag 05 Dezember 2009 04:59:55 schrieb Andreas Dilger:
>
> We've had problems in the past with 3ware controllers at other sites
> in the past - the performance is not as good as expected, since they
> rely heavily on readahead to get good performance.

True. But we had serious problems (incl data loss) with adaptec controllers before.
The performance is superb but usability and maintenance is a nightmare in a Linux environment.

>
> That said:
>
> > Dec 4 12:42:56 sadosrd24 LustreError: 4744:0:(ost_handler.c:
> > 882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s
> > req@ffff81007efa7e00 x7869690/t0 o3->eb2e7e64-c1d9-
> > d1f6-8f9d-1ba9629ff4c0@NET_0x20000c0a8106f_UUID:0/0 lens 384/336 e 0
> > to 0 dl 1259926976 ref 1 fl Interpret:/0/0 rc 0/0
>
> This means that the IO didn't complete before the timeout. This could
> be because the OST IO is so slow that no RPC can complete before the
> timeout, or because there is packet loss.

In our case a misconfigured D-Link switch caused the problems. So we bought the 'packet loss' option in our case.

>
> Some things to try:
> - reduce the number of OSS threads via module parameter:
> option ost oss_num_threads=N
> - increase the lustre timeout (details in the manual)

Thank you very much for your help showing the direction together with Bernd Schubert.

Best Regards
Heiko

Reply all

Reply to author

Forward