I am looking for information on how Lustre assigns and holds pages on client nodes across jobs. The motivation is that we want to make "huge" pages available to users. We have found that it is almost impossible to allocate very many "huge" pages since Lustre holds scattered small pages across jobs. In fact, typically about 1/3 of compute node memory can be allocated as huge pages.
We have done quite a lot of performance studies which show that a substantial percentage of jobs on Ranger have TLB misses as a major performance bottleneck. We estimate we might recover as much as an additional 5%-10% throughput if users could use huge pages.
Therefore we would like to find a way to minimize the client memory which Lustre holds across jobs.
Have you had anyone else mention this situation to you?
Regards,
Jim Browne
Kevin
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Actually, my understanding is that /proc/sys/vm/drop_caches is NOT safe for production usage in all cases (i.e. there are bugs in some kernels, and it isn't actually meant for regular use from what I've read).
Others use huge pages in their configuration, but they reserve them at node boot time. See https://bugzilla.lustre.org/show_bug.cgi?id=14323 for details.
If you want to flush all the memory used by a Lustre client between jobs, you can do "lctl set_param ldlm.namespaces.*.lru_size=clear". Unlike Kevin's suggestion it is Lustre-specific, while drop_caches will try to flush memory from everything.
Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.
On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote:
> If you want to flush all the memory used by a Lustre client between jobs, you can do "lctl set_param ldlm.namespaces.*.lru_size=clear". Unlike Kevin's suggestion it is Lustre-specific, while drop_caches will try to flush memory from everything.
Actually there is one extra bit that won't get freed by dropping locks that is lustre debug logs (assuming non-zero debug level).
It could be cleared with lctl clear
Bye,
Oleg
If you do not have an lwn.net account you might need to wait a few weeks:
http://lwn.net/Articles/398846/
It links an older article about it, which should be already avaible for all:
http://lwn.net/Articles/359158/
And another one:
http://lwn.net/Articles/374424/
Cheers,
Bernd
--
Bernd Schubert
DataDirect Networks
On 08/19/2010 06:07 PM, Andreas Dilger wrote:
> On 2010-08-19, at 16:44, Kevin Van Maren wrote:
>> Easy way to reduce the client memory used by "Lustre" is to have
>> an Epilogue script run by SGE (or whatever scheduler/resource
>> manager) that does something like this on every node: # sync ;
>> sleep 1 ; sync # echo 3> /proc/sys/vm/drop_caches
>
> Actually, my understanding is that /proc/sys/vm/drop_caches is NOT
> safe for production usage in all cases (i.e. there are bugs in some
> kernels, and it isn't actually meant for regular use from what I've
> read).
That's good to know. But, there are two parts to drop_caches, depending
on what you write---do you know if the unsafety comes from the part that
calls the 'slab' shrinkers or the part that calls
invalidate_inode_pages()? I suppose it's the latter. Do you have a
pointer to a more specific description? I'm curious about which kernels
are affected. I looked but didn't turn up much.
Thanks,
-John
--
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jham...@ices.utexas.edu
(512) 471-9304
Indeed, thanks. On Ranger, the compute nodes use compact flash drives
for /, and so they depend on tmpfs's for /tmp, /var/run, /var/log, and
of course /dev/shm. So cleaning up these ram backed filesystems as much
as practical before asking for any hugepages is also a win.
Also, in imitation of the systems that pre-allocate all needed hugepages
at boot time, we are considering the idea of first pre-allocating a
large chunk of memory (say 7/8) in hugepages, then mounting the Lustre
filesystems, then releasing the hugepages. The hope is that Lustre's
persistent structures will be fit into a more compact region of memory
thereby.
The main obstacle in testing all of this is that benchmarking the gains
gotten by a particular approach is difficult, since we have not yet
found an easy way of producing external fragmentation of physical memory
in short order. Suggestions are welcome.
Best,
-John
--
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jham...@ices.utexas.edu
(512) 471-9304
Thanks for the clarification.
Jim Browne
At 11:10 PM 8/19/2010, Oleg Drokin wrote:
>Hello!
>
>On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote:
> > If you want to flush all the memory used by a Lustre client
> between jobs, you can do "lctl set_param
> ldlm.namespaces.*.lru_size=clear". Unlike Kevin's suggestion it is
> Lustre-specific, while drop_caches will try to flush memory from everything.
>
>
>Actually there is one extra bit that won't get freed by dropping
>locks that is lustre debug logs (assuming non-zero debug level).
>It could be cleared with lctl clear
>
>Bye,
> Oleg
James C. Browne
Department of Computer Science
University of Texas
Austin, Texas 78712
Phone - 512-471-9579
Fax - 512-471-8885
bro...@cs.utexas.edu
http://www.cs.utexas.edu/users/browne
As discussed in https://bugzilla.lustre.org/show_bug.cgi?id=14323 that I previously referenced, the Lustre tunables are based on the total number of pages, and do not take huge pages into account.
Also, if the hugepages are released, there is no guarantee that you will be able to allocate them all again due to small pinned memory structures _somewhere_ in the middle of each huge page.
If you are running an prologue/epilogue script then you should tune the Lustre cache size based on the number of huge pages that will be used. The last time this was investigated, there was no way for Lustre to know how many huge pages were allocated from within the kernel w/o patching it. If that has changed in newer kernels, it would be possible to dynamically adjust the cache size based on this.
> The main obstacle in testing all of this is that benchmarking the gains gotten by a particular approach is difficult, since we have not yet found an easy way of producing external fragmentation of physical memory in short order. Suggestions are welcome.
Running something like "slocate" across multiple filesystems will fill all of RAM with inodes/dentries, and if you pin some of these in memory (e.g. start a shell with some deep directory as CWD), you should quickly be able to fragment your memory with unfreeable inode/dentry allocations.
Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.
_______________________________________________