[Lustre-discuss] tuning for small I/O

615 views

Skip to first unread message

Jay Christopherson

unread,

Jan 7, 2010, 4:17:44 PM1/7/10

to lustre-...@lists.lustre.org

I'm attempting to run a pair of ActiveMQ java instances, using a shared Lustre filesystem mounted with flock for failover purposes. There's lots of ways to do ActiveMQ failover and shared filesystem just happens to be the easiest.

ActiveMQ, at least the way we are using it, does a lot of small I/O's, like 600 - 800 IOPS worth of 6K I/O's. When I attempt to use Lustre as the shared filesystem, I see major IO wait time on the cpu's, like 40 - 50%. My OSS's and MDS don't seem to be particularly busy being 90% idle or more while this is running. If I remove Lustre from the equation and simply write to local disk OR to an iSCSI mounted SAN disk, my ActiveMQ instances don't seem to have any problems.

The disk that is backing the OSS's are all SAS 15K disks in a RAID1 config. The OSS's (2 of them) each have 8GB of memory and 4 cpu cores and are doing nothing else except being OSS's. The MDS has one cpu and 4G of memory and is 98% idle while under this ActiveMQ load. The network I am using for Lustre is dedicated gigabit ethernet and there are 8 clients, two of which are these ActiveMQ clients.

So, my question is:

1. What should I be looking at to tune my Lustre FS for this type of IO? I've tried upping the lru_size of the MDT and OST namespaces in /proc/fs/lustre/ldlm to 5000 and 2000 respectively, but I don't really see much difference. I have also ensured that striping is disabled (lfs setstripe -d) on the shared directory.

I guess I am just not experienced enough yet with Lustre to know how to track down and resolve this issue. I would think Lustre should be able to handle this load, but I must be missing something. For the record, NFS was not able to handle this load either, at least with default export settings (async was improved, but async is not an option).

- Jay

Atul Vidwansa

unread,

Jan 8, 2010, 4:49:34 AM1/8/10

to Jay Christopherson, lustre-...@lists.lustre.org

Hi Jay,

There are multiple ways to tune Lustre for small IO. If you seach
lustre-discuss archives, you will find many threads on same topic. I
have some suggestions below:

Jay Christopherson wrote:
> I'm attempting to run a pair of ActiveMQ java instances, using a
> shared Lustre filesystem mounted with flock for failover purposes.
> There's lots of ways to do ActiveMQ failover and shared filesystem
> just happens to be the easiest.
>
> ActiveMQ, at least the way we are using it, does a lot of small I/O's,
> like 600 - 800 IOPS worth of 6K I/O's. When I attempt to use Lustre
> as the shared filesystem, I see major IO wait time on the cpu's, like
> 40 - 50%. My OSS's and MDS don't seem to be particularly busy being
> 90% idle or more while this is running. If I remove Lustre from the
> equation and simply write to local disk OR to an iSCSI mounted SAN
> disk, my ActiveMQ instances don't seem to have any problems.
>
> The disk that is backing the OSS's are all SAS 15K disks in a RAID1
> config. The OSS's (2 of them) each have 8GB of memory and 4 cpu cores
> and are doing nothing else except being OSS's. The MDS has one cpu
> and 4G of memory and is 98% idle while under this ActiveMQ load. The
> network I am using for Lustre is dedicated gigabit ethernet and there
> are 8 clients, two of which are these ActiveMQ clients.

First of all, I would suggest benchmarking your Lustre setup for small
file workload. For example, use Bonnie++ in IOPS mode to create small
sized files on Lustre. That will tell you limit of Lustre setup. I got
about 6000 creates/sec on my 12 disk (Seagate SAS 15K RPM 300 GB) RAID10
setup.

>
> So, my question is:
>
> 1. What should I be looking at to tune my Lustre FS for this type of
> IO? I've tried upping the lru_size of the MDT and OST namespaces in
> /proc/fs/lustre/ldlm to 5000 and 2000 respectively, but I don't really
> see much difference. I have also ensured that striping is disabled
> (lfs setstripe -d) on the shared directory.

Try disabling Lustre debug messages on all clients:

sysctl -w lnet.debug=0

Try increasing dirty cache on client nodes:

lctl set_param osc.*.max_dirty_mb=256

Also, you can bump up max rpcs in flight from 8 to 32 but given that you
have gigabit ethernet network, I don't think it will improve performance.

Cheers,
-Atul

>
> I guess I am just not experienced enough yet with Lustre to know how
> to track down and resolve this issue. I would think Lustre should be
> able to handle this load, but I must be missing something. For the
> record, NFS was not able to handle this load either, at least with
> default export settings (async was improved, but async is not an option).
>
> - Jay

> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Sheila Barthel

unread,

Jan 8, 2010, 10:31:27 AM1/8/10

to Atul Vidwansa, lustre-...@lists.lustre.org

Also, the Lustre manual includes a section on improving performance when
working with small files:

http://manual.lustre.org/manual/LustreManual18_HTML/LustreTroubleshootingTips.html#50532481_pgfId-1291398

Sheila

Peter Grandi

unread,

Jan 10, 2010, 5:19:05 PM1/10/10

to Lustre discussion

The subject of this email "[Lustre-discuss] tuning for small I/O" is
a bit in the category of "tuning jackhammers to cut diamonds".

Lustre has been designed for massive streaming parallel IO, and does
OK-ish for traditional ("home dir") situations. Not necessarily
for shared message databases.

>>> I'm attempting to run a pair of ActiveMQ java instances,

Life will improve I hope :-).

>>> using a shared Lustre filesystem mounted with flock for failover
>>> purposes.

The 'flock' is the key issue here, probably even more than the
"small I/O" issue.

Consider this thread on a very similar topic:

http://lists.lustre.org/pipermail/lustre-discuss/2008-October/009001.html
"The other alternative is "-o flock", which is coherent locking
across all clients, but has a noticable performance impact"

http://lists.lustre.org/pipermail/lustre-discuss/2009-February/009690.html
"It is both not very optimized and slower than local system since
it needs to send network rpcs for locking (Except for the
localflock which is same speed as for local fs)."

http://lists.lustre.org/pipermail/lustre-discuss/2004-August/000425.html
"We faced similar issues when we tried to access/modify a single
file concurrently from multiple processes (across multiple
clients) using the MPI-IO interfaces. We faced similar issues
with other file systems as well, so we resorted to implementing
our own file/record-locking in the MPI-IO middleware (on top of
file-systems)."

http://lists.lustre.org/pipermail/lustre-discuss/2009-February/009679.html

etc.

>>> There's lots of ways to do ActiveMQ failover and shared
>>> filesystem just happens to be the easiest.

Easiest is very tempting, but perhaps not the most effective. If
you really cared about getting meaningful replies you would have
provided these links BTW:

http://activemq.apache.org/shared-file-system-master-slave.html
http://activemq.apache.org/replicated-message-store.html

Doing a bit more searching it turns out that there are several
ways to tune ActiveMQ and this may reduce the number of barrier
operations/committed transactions. Maybe. There seem to be
something vaguely interesting here:

http://fusesource.com/docs/broker/5.0/persistence/persistence.pdf

Otherwise I'd use the master/slave replication feature, but this
is just an impression.

>>> ActiveMQ, at least the way we are using it, does a lot of small
>>> I/O's, like 600 - 800 IOPS worth of 6K I/O's.

Thats seems pretty reasonable. I guess that is a few hundred/s worth
of journal updates. Problem is, they will be mostly hitting the same
files, thus the need for 'flock' and synchronous updates.

So it matters *very much* how many of those 6K IOs are
transactional, that is involve locking and flushing to disk.
I suspect from your problems asn the later statement "async is
not an option" that each of them is a transaction.

>>> When I attempt to use Lustre as the shared filesystem, I see
>>> major IO wait time on the cpu's, like 40 - 50%.

Why do many people fixate on IO wait? just because it is easy to
see? Bah!

If there is one, what is the performance problem *on the client*
in terms of *client application issues*? That's what matters.

>>> My OSS's and MDS don't seem to be particularly busy

Unsurprisingly. How many OSSes and how many OSTs per OSS, and
how many disks? just curiosity, it is not that important.

>>> being 90% idle or more while this is running.

Ideally they would be 100% idle :-).

>>> If I remove Lustre from the equation and simply write to local
>>> disk OR to an iSCSI mounted SAN disk, my ActiveMQ instances
>>> don't seem to have any problems.

And which problems do you have when running with Lustre? You haven't
said. "major IO wait" and "90% idle" are not problems, they are
statistics, and they could mean something else.

>>> The disk that is backing the OSS's are all SAS 15K disks in a
>>> RAID1 config.

RAID1 is nice, but how many? That would be a very important detail.

>>> 1. What should I be looking at to tune my Lustre FS for this
>>> type of IO?

Not really. It is both a storage system problem and a network
protocol problem.

The "small I/O" problem is the least of the two, the real problem is
that you have "small I/O on a shared filesystem with distributed
interlocked updates to the same files", that is network protocol
problem.

The network protocol problem is very very difficult, because the
server needs to synchronize two clients, and present a current image
of the files to the two clients; that is when one client does an
update, the other client must be able to see it "immediately", which
is not easy. For example I have heard reports that when writing from
a client to a Lustre server, sometimes (in a small percentage of
cases) another client only sees the update dozens of seconds later
for example (but your use of locking may help with that). I wonder
if locking is enabled and used on that system BTW.

>>> [ ... ] I have also ensured that striping is disabled (lfs

>>> setstripe -d) on the shared directory.

Unless your files are real big that does not matter. Uhm, the
message store seems to actually use a few biggish (32MB?) journal
files plus (perhaps smaller) indices:

http://activemq.apache.org/persistence.html
http://activemq.apache.org/kahadb.html
http://activemq.apache.org/amq-message-store.html
http://activemq.apache.org/should-i-use-transactions.html
http://activemq.apache.org/how-lightweight-is-sending-a-message.html

So perhaps the striping does have an effect.

>>> I guess I am just not experienced enough yet with Lustre to know
>>> how to track down and resolve this issue. I would think Lustre
>>> should be able to handle this load, but I must be missing
>>> something.

Sure it is able to handle that load -- it does, at great effort and
going the against the grain of what Lustre has been designed for.

The basic problem is that instead of using a low latency distributed
communication system for interlocked updating of the message store,
you are attempting to use the implicit one in a filesystem because
"shared filesystem just happens to be the easiest", even if they are
not meant to give you a high transaction rate with low latency for a
shared database.

In ordinary shared filesystems locking is provided for coarse
protection and IO is expected to be on fairly coarse granularity
too, and more so for Lustre and other cluster systems.

>>> For the record, NFS was not able to handle this load either, at
>>> least with default export settings

It is very very difficult to handle that workload acrowss multiple
clients in a distributed filesystem. Then NFSv3 or older have their
own additional issues.

>>> (async was improved, but async is not an option).

If 'async' is not an option you have a big problem in general as
hinted above.

Also, but not very related here, the NFS client for Linux has some
significant performance problems with writing. To the point that
sometimes I think that Lustre can be used to replace NFS even when
no clustering is desired (single OSS), simply because its protocol
is better (and there is a point also to LNET).

>> First of all, I would suggest benchmarking your Lustre setup for
>> small file workload.

I may have misunderstood but the original poster nowhere wrote
"small file workload", but "small I/O" which is quite different.

The shared message store he has setup is to be updated concurrently
receiving small I/O transactions, but it is contained in journals of
probably a few dozen MB each.

>> For example, use Bonnie++ in IOPS mode to create small sized
>> files on Lustre. That will tell you limit of Lustre setup. I got
>> about 6000 creates/sec on my 12 disk (Seagate SAS 15K RPM 300 GB)
>> RAID10 setup.

Small sized files and creates/sec seem not what the original posted
is worries, even if 1000 metadata operations/s per disk pairs seems
nice indeed.

>> Try disabling Lustre debug messages on all clients: sysctl -w
>> lnet.debug=0

That may help, I hadn't thought of that.

>> Try increasing dirty cache on client nodes: lctl set_param
>> osc.*.max_dirty_mb=256 Also, you can bump up max rpcs in flight
>> from 8 to 32 but given that you have gigabit ethernet network, I
>> don't think it will improve performance.

That can be counterproductive as the problem seems to be concurrent
interlocked updates from multiple clients to the persistent database
of a shared queue system. As clear from the point "async is not an
option").

> Also, the Lustre manual includes a section on improving performance when
> working with small files:
> http://manual.lustre.org/manual/LustreManual18_HTML/LustreTroubleshootingTips.html#50532481_pgfId-1291398

The real problem as hinted above is that interlocking is of the
essence in the application above, for a message store used by many
distributed clients, and the message store is not made up of small
files.

However, there is an interesting point there about precisely the
type of issue I was alluding above about interlocking:

"By default, Lustre enforces POSIX coherency semantics, so it
results in lock ping-pong between client nodes if they are all
writing to the same file at one time."

Perhaps the other advice may be also relevant:

"Add more disks or use SSD disks for the OSTs. This dramatically
improves the IOPS rate."

but I think it is mostly a locking latency issue. If it were a small
transaction issue, a barrier every 6K may not work well with SSDs
which have an erase page size usually around 32KiB.

Good luck ;-).

Peter Grandi

unread,

Jan 16, 2010, 12:58:51 PM1/16/10

to Lustre discussion

I have received some offline updates about this story:

>>> I'm attempting to run a pair of ActiveMQ java instances,

>>> using a shared Lustre filesystem mounted with flock for
>>> failover purposes.

> The 'flock' is the key issue here, probably even more than the

> "small I/O" issue. [ ... ]

>>> [ ... ] ActiveMQ, at least the way we are using it, does a

>>> lot of small I/O's, like 600 - 800 IOPS worth of 6K I/O's.

> Thats seems pretty reasonable. I guess that is a few hundred/s
> worth of journal updates. Problem is, they will be mostly
> hitting the same files, thus the need for 'flock' and
> synchronous updates. So it matters *very much* how many of
> those 6K IOs are transactional, that is involve locking and
> flushing to disk. I suspect from your problems asn the later
> statement "async is not an option" that each of them is a
> transaction.

>>> The disk that is backing the OSS's are all SAS 15K disks in
>>> a RAID1 config.

> RAID1 is nice, but how many? That would be a very important
> detail.

This apparently is a 14-drive RAID10 (hopefully a true RAID10
7x(1+1) rather than the RAID01 7+7 mentioned offline).

That means a total rate of perhaps 100-120 6K transactions per
disk, if lucky (that depends on the number of log files and
their spread).

The total data rate over Lustre is around 5MB/s, and even with
just 6K per operations Lustre should be doing that, even if I
suspect that the achievable 'flock' rate depends more on the MDS
storage system than the OSS one.

If every write is a transaction, and (hopefully) ActiveMQ
requests committing to stable storage every transaction, then it
is both an 'flock' and 'fsync' problem.

Then depending on the size of the queue I'd also look, if not
already done, at using host adapters with a quite large battery
back buffer/cache for both the MDS and the OSSes, as latency may
be because of waiting for uncached writes. Sure, the setup seems
to work fast enough when the disks are local, already, which may
mean over-wire latencies add too much, but reducing the storage
system latency may help, even if not needed in the local case.

That is purely a storage layer issue (for both MDTs and OSTs),
and nothing to do with Lustre itself, while the 'flock' issue
(and flushing from the *clients*) has to do with Lustre (even if
it *may* too be alleviated with very low latency battery backed
buffers/caches).

Again, interlocked stable ('flock'/'fsync') storage operations
between two clients via a third server are difficult to make
fast, because of latency and flushing issues, in the context of
remote file access, either general purpose like NFS or parallel
bulk streaming like Lustre.

Reply all

Reply to author

Forward

0 new messages