[Lustre-discuss] Fwd: Lustre and Large Pages

Nathan Rutman

unread,

Aug 19, 2010, 6:34:37 PM8/19/10

to lustre-...@lists.lustre.org, James Browne

Jim, I'm forwarding this to lustre-discuss to get a broader community input. I'm sure somebody has some experience with this.

Begin forwarded message:

I am looking for information on how Lustre assigns and holds pages on client nodes across jobs. The motivation is that we want to make "huge" pages available to users. We have found that it is almost impossible to allocate very many "huge" pages since Lustre holds scattered small pages across jobs. In fact, typically about 1/3 of compute node memory can be allocated as huge pages.

We have done quite a lot of performance studies which show that a substantial percentage of jobs on Ranger have TLB misses as a major performance bottleneck. We estimate we might recover as much as an additional 5%-10% throughput if users could use huge pages.

Therefore we would like to find a way to minimize the client memory which Lustre holds across jobs.

Have you had anyone else mention this situation to you?

Regards,

Jim Browne

Kevin Van Maren

unread,

Aug 19, 2010, 6:44:17 PM8/19/10

to Nathan Rutman, lustre-...@lists.lustre.org, James Browne

Easy way to reduce the client memory used by "Lustre" is to have an
Epilogue script run by SGE (or whatever scheduler/resource manager) that
does something like this on every node:
# sync ; sleep 1 ; sync
# echo 3 > /proc/sys/vm/drop_caches

Kevin

> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Andreas Dilger

unread,

Aug 19, 2010, 7:07:24 PM8/19/10

to Kevin Van Maren, lustre-...@lists.lustre.org, James Browne

On 2010-08-19, at 16:44, Kevin Van Maren wrote:
> Easy way to reduce the client memory used by "Lustre" is to have an
> Epilogue script run by SGE (or whatever scheduler/resource manager) that
> does something like this on every node:
> # sync ; sleep 1 ; sync
> # echo 3 > /proc/sys/vm/drop_caches

Actually, my understanding is that /proc/sys/vm/drop_caches is NOT safe for production usage in all cases (i.e. there are bugs in some kernels, and it isn't actually meant for regular use from what I've read).

Others use huge pages in their configuration, but they reserve them at node boot time. See https://bugzilla.lustre.org/show_bug.cgi?id=14323 for details.

If you want to flush all the memory used by a Lustre client between jobs, you can do "lctl set_param ldlm.namespaces.*.lru_size=clear". Unlike Kevin's suggestion it is Lustre-specific, while drop_caches will try to flush memory from everything.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Oleg Drokin

unread,

Aug 20, 2010, 12:10:06 AM8/20/10

to Andreas Dilger, lustre-discuss@lists.lustre.org discuss, James Browne

Hello!

On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote:
> If you want to flush all the memory used by a Lustre client between jobs, you can do "lctl set_param ldlm.namespaces.*.lru_size=clear". Unlike Kevin's suggestion it is Lustre-specific, while drop_caches will try to flush memory from everything.

Actually there is one extra bit that won't get freed by dropping locks that is lustre debug logs (assuming non-zero debug level).
It could be cleared with lctl clear

Bye,
Oleg

Bernd Schubert

unread,

Aug 20, 2010, 3:11:22 AM8/20/10

to lustre-...@lists.lustre.org, James Browne

Last week there was an article on lwn.net about "Transparent hugepages"
discussed during "The fourth Linux storage and filesystem summit". According
to that article, we might be luckily and those patches might go into RHEL6

If you do not have an lwn.net account you might need to wait a few weeks:
http://lwn.net/Articles/398846/

It links an older article about it, which should be already avaible for all:
http://lwn.net/Articles/359158/

And another one:
http://lwn.net/Articles/374424/

Cheers,
Bernd

--
Bernd Schubert
DataDirect Networks

John Hammond

unread,

Aug 20, 2010, 9:00:29 AM8/20/10

to Andreas Dilger, lustre-...@lists.lustre.org, James Browne

Hi Andreas,

On 08/19/2010 06:07 PM, Andreas Dilger wrote:
> On 2010-08-19, at 16:44, Kevin Van Maren wrote:
>> Easy way to reduce the client memory used by "Lustre" is to have
>> an Epilogue script run by SGE (or whatever scheduler/resource
>> manager) that does something like this on every node: # sync ;
>> sleep 1 ; sync # echo 3> /proc/sys/vm/drop_caches
>
> Actually, my understanding is that /proc/sys/vm/drop_caches is NOT
> safe for production usage in all cases (i.e. there are bugs in some
> kernels, and it isn't actually meant for regular use from what I've
> read).

That's good to know. But, there are two parts to drop_caches, depending
on what you write---do you know if the unsafety comes from the part that
calls the 'slab' shrinkers or the part that calls
invalidate_inode_pages()? I suppose it's the latter. Do you have a
pointer to a more specific description? I'm curious about which kernels
are affected. I looked but didn't turn up much.

Thanks,

-John

--
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jham...@ices.utexas.edu
(512) 471-9304

John Hammond

unread,

Aug 20, 2010, 9:21:16 AM8/20/10

to Oleg Drokin, James Browne, lustre-discuss@lists.lustre.org discuss

On 08/19/2010 11:10 PM, Oleg Drokin wrote:
> Hello!
>
> On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote:
>> If you want to flush all the memory used by a Lustre client
>> between jobs, you can do "lctl set_param
>> ldlm.namespaces.*.lru_size=clear". Unlike Kevin's suggestion it is
>> Lustre-specific, while drop_caches will try to flush memory from
>> everything.
>
>
> Actually there is one extra bit that won't get freed by dropping
> locks that is lustre debug logs (assuming non-zero debug level). It
> could be cleared with lctl clear

Indeed, thanks. On Ranger, the compute nodes use compact flash drives
for /, and so they depend on tmpfs's for /tmp, /var/run, /var/log, and
of course /dev/shm. So cleaning up these ram backed filesystems as much
as practical before asking for any hugepages is also a win.

Also, in imitation of the systems that pre-allocate all needed hugepages
at boot time, we are considering the idea of first pre-allocating a
large chunk of memory (say 7/8) in hugepages, then mounting the Lustre
filesystems, then releasing the hugepages. The hope is that Lustre's
persistent structures will be fit into a more compact region of memory
thereby.

The main obstacle in testing all of this is that benchmarking the gains
gotten by a particular approach is difficult, since we have not yet
found an easy way of producing external fragmentation of physical memory
in short order. Suggestions are welcome.

Best,

-John

--
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jham...@ices.utexas.edu
(512) 471-9304

James C. Browne

unread,

Aug 20, 2010, 9:07:23 AM8/20/10

to Oleg Drokin, James Browne, lustre-discuss@lists.lustre.org discuss, Bill Barth, John Hammond, Tommy Minyard

Oleg,

Thanks for the clarification.

Jim Browne

At 11:10 PM 8/19/2010, Oleg Drokin wrote:
>Hello!
>
>On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote:
> > If you want to flush all the memory used by a Lustre client
> between jobs, you can do "lctl set_param
> ldlm.namespaces.*.lru_size=clear". Unlike Kevin's suggestion it is
> Lustre-specific, while drop_caches will try to flush memory from everything.
>
>
>Actually there is one extra bit that won't get freed by dropping
>locks that is lustre debug logs (assuming non-zero debug level).
>It could be cleared with lctl clear
>
>Bye,
> Oleg

James C. Browne
Department of Computer Science
University of Texas
Austin, Texas 78712
Phone - 512-471-9579
Fax - 512-471-8885
bro...@cs.utexas.edu
http://www.cs.utexas.edu/users/browne

Andreas Dilger

unread,

Aug 20, 2010, 2:31:21 PM8/20/10

to John Hammond, Oleg Drokin, lustre-discuss@lists.lustre.org discuss, James Browne

On 2010-08-20, at 07:21, John Hammond wrote:
> Indeed, thanks. On Ranger, the compute nodes use compact flash drives for /, and so they depend on tmpfs's for /tmp, /var/run, /var/log, and of course /dev/shm. So cleaning up these ram backed filesystems as much as practical before asking for any hugepages is also a win.
>
> Also, in imitation of the systems that pre-allocate all needed hugepages at boot time, we are considering the idea of first pre-allocating a large chunk of memory (say 7/8) in hugepages, then mounting the Lustre filesystems, then releasing the hugepages. The hope is that Lustre's persistent structures will be fit into a more compact region of memory thereby.

As discussed in https://bugzilla.lustre.org/show_bug.cgi?id=14323 that I previously referenced, the Lustre tunables are based on the total number of pages, and do not take huge pages into account.

Also, if the hugepages are released, there is no guarantee that you will be able to allocate them all again due to small pinned memory structures _somewhere_ in the middle of each huge page.

If you are running an prologue/epilogue script then you should tune the Lustre cache size based on the number of huge pages that will be used. The last time this was investigated, there was no way for Lustre to know how many huge pages were allocated from within the kernel w/o patching it. If that has changed in newer kernels, it would be possible to dynamically adjust the cache size based on this.

> The main obstacle in testing all of this is that benchmarking the gains gotten by a particular approach is difficult, since we have not yet found an easy way of producing external fragmentation of physical memory in short order. Suggestions are welcome.

Running something like "slocate" across multiple filesystems will fill all of RAM with inodes/dentries, and if you pin some of these in memory (e.g. start a shell with some deep directory as CWD), you should quickly be able to fragment your memory with unfreeable inode/dentry allocations.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

_______________________________________________

Reply all

Reply to author

Forward