profiling memory

38 views

Skip to first unread message

Nicholas Clark

unread,

Dec 14, 2009, 10:54:20 AM12/14/09

to develnyt...@googlegroups.com

So, at work we have some server processes which are glutons for memory.
We'd like to work out why, which is tricky. At least, I think that it's tricky.

So, I considered what would be the simplest thing that *might* work.
Devel::NYTProf can measure the elapsed time at a statement granularity, so
why not use that framework to also measure memory usage?

First go was trying to measure VMSize, which is relatively easy on Linux from
/proc, but turned out not to be that useful, because glibc on the work x86_64
systems was allocating 33 *pages* at a time. Pah!

So, I've reached iteration two.

I've written a malloc() intereception library, which lets me keep track of how
much memory is currently allocated by malloc()/realloc(), and a *bodge* to
NYTprof.xs to interrogate it in all the places where NYTProf.xs currently
reads the time.

This seems to work "reasonably" well, although it turns out that considerable
hoop-jumping is needed on x86_64 Linux, because dlsym() calls malloc(), which
isn't "helpful" when you're in your own malloc() calling dlsym() to find the
real malloc(). But the shared library works on Linux (x86 and x86_64 tested)
and FreeBSD without recompiling perl, via LD_PRELOAD.

Sadly OS X doesn't have LD_PRELOAD, so you have to explicitly link with the
interception library.

Should also work on Solaris, or anywhere else where DynaLoader uses
dl_dlopen.xs, but not tested.

I'm not sure where to go from here, or whether it's really something that
Devel::NYTProf would want to support as an option. Right now the
implementation (appended) is definately a "prototype":

* whilst both memory delta and times are written to/read from the profile file
the report generation is unchanged - right now the memory delta is bodged
in, in place of the time, so the line-by-line columns in "seconds" are
actually bytes.
(and have large integers, which the formatting code doesn't expect.
and can be negative which the colouring code doesn't expected')

I'm not sure how to procede here. 1 or 2 more columns in the HTML reports?
And hence modification to all the layers between the stream reader and HTML?

* callbacks dont' change their tag, but have 1 extra parameter with the memory
size delta or undef if there isn't a delta present.
Likewise the block_line_num and sub_line_num are undef if they're not
present, to keep the memory delta at the same position.

This is completely untested and feels like a bit of a bodge.

* hardcoded prototype for get_total_allocated();
* no good handling of output format overflows

The appended patch *will* fail tests, because the tests expect (positive)
seconds, not signed memory deltas. :-)

To build the shared library, you need something like

gcc -shared -Os -Wall -L/usr/local/lib -fstack-protector malloc_intercept.c -o malloc_intercept.so

or

gcc -fPIC -shared -g -Wall -L/usr/local/lib -fstack-protector malloc_intercept.c -o malloc_intercept.so -ldl

(or something different on OS X, but that command line is on a laptop at home)

and you need to run your profile something like this:

LD_PRELOAD=~/Perl/malloc_intercept/malloc_intercept.so PERL5OPT=-d:NYTProf script

(or under OS X you need to explicitly link your perl binary with
malloc_intercept.dylib)

I'm still digetsing whether it actually produces useful-enough results to be
good at pointing out where memory is going. I'm confident that the
interception library is recording exactly what *is* allocated. However, I'm
not sure if this is really that useful a metric for solving actual problems.

Nicholas Clark

Index: NYTProf.xs
===================================================================
--- NYTProf.xs (revision 949)
+++ NYTProf.xs (working copy)
@@ -19,6 +19,8 @@
#define PERL_NO_GET_CONTEXT /* we want efficiency */
#endif

+#define SIZES_TOO
+
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
@@ -33,6 +35,10 @@
# include "ppport.h"
#endif

+#ifdef SIZES_TOO
+size_t get_total_allocated();
+#endif
+
#ifdef USE_HARD_ASSERT
#undef NDEBUG
#include <assert.h>
@@ -124,6 +130,8 @@
#define NYTP_TAG_COMMENT '#' /* till newline */
#define NYTP_TAG_TIME_BLOCK '*'
#define NYTP_TAG_TIME_LINE '+'
+#define NYTP_TAG_TIME_BLOCK_MEM '('
+#define NYTP_TAG_TIME_LINE_MEM ')'
#define NYTP_TAG_DISCOUNT '-'
#define NYTP_TAG_NEW_FID '@'
#define NYTP_TAG_SRC_LINE 'S' /* fid, line, str */
@@ -342,6 +350,10 @@
#endif
static time_of_day_t start_time;
static time_of_day_t end_time;
+#ifdef SIZES_TOO
+static size_t start_size;
+static size_t end_size;
+#endif

static unsigned int last_executed_line;
static unsigned int last_executed_fid;
@@ -1851,14 +1863,51 @@
if (overflow) /* XXX later output overflow to file */
logwarn("profile time overflow of %ld seconds discarded\n", overflow);

+#ifdef SIZES_TOO
+ end_size = get_total_allocated();
+#endif
+
reinit_if_forked(aTHX);

if (last_executed_fid) {
+#ifdef SIZES_TOO
+ if (start_size && end_size) {
+ unsigned int output;

- output_tag_int((unsigned char)((profile_blocks)
- ? NYTP_TAG_TIME_BLOCK : NYTP_TAG_TIME_LINE), elapsed);
- output_int(last_executed_fid);
- output_int(last_executed_line);
+ if (end_size >= start_size) {
+ size_t increase = end_size - start_size;
+ if (increase > (UINT_MAX >> 1)) {
+ warn("profile size delta of +%lu seconds clipped",
+ (unsigned long) increase);
+ increase = UINT_MAX >> 1;
+ }
+ output = increase << 1;
+ } else {
+ size_t decrease = start_size - end_size;
+ if (decrease > (UINT_MAX >> 1)) {
+ warn("profile size delta of -%lu seconds clipped",
+ (unsigned long) decrease);
+ decrease = UINT_MAX >> 1;
+ }
+ output = 1 | (decrease << 1);
+ }
+ output_tag_int((unsigned char)((profile_blocks)
+ ? NYTP_TAG_TIME_BLOCK_MEM : NYTP_TAG_TIME_LINE_MEM), elapsed);
+
+ output_int(last_executed_fid);
+ output_int(last_executed_line);
+ output_int(output);
+ } else
+#endif
+ {
+ output_tag_int((unsigned char)((profile_blocks)
+ ? NYTP_TAG_TIME_BLOCK : NYTP_TAG_TIME_LINE), elapsed);
+
+ output_int(last_executed_fid);
+ output_int(last_executed_line);
+ }
+
+
if (profile_blocks) {
output_int(last_block_line);
output_int(last_sub_line);
@@ -1934,6 +1983,10 @@
get_ticks_between(end_time, start_time, elapsed, overflow);
cumulative_overhead_ticks += elapsed;

+#ifdef SIZES_TOO
+ start_size = get_total_allocated();
+#endif
+
SETERRNO(saved_errno, 0);
return;
}
@@ -3075,6 +3128,9 @@

/* discard time spent since profiler was disabled */
get_time_of_day(start_time);
+#ifdef SIZES_TOO
+ start_size = get_total_allocated();
+#endif

return prev_is_profiling;
}
@@ -3161,6 +3217,8 @@
PL_perldb |= PERLDBf_SINGLE; /* start (after BEGINs) with single-step on XXX still needed? */
}

+ start_size = get_total_allocated();
+
if (profile_opts & NYTP_OPTf_OPTIMIZE)
PL_perldb &= ~PERLDBf_NOOPT;
else PL_perldb |= PERLDBf_NOOPT;
@@ -3990,6 +4048,10 @@
break;
}

+#ifdef SIZES_TOO
+ case NYTP_TAG_TIME_LINE_MEM: /*FALLTHRU*/
+ case NYTP_TAG_TIME_BLOCK_MEM: /*FALLTHRU*/
+#endif
case NYTP_TAG_TIME_LINE: /*FALLTHRU*/
case NYTP_TAG_TIME_BLOCK:
{
@@ -4001,11 +4063,29 @@
unsigned int ticks = read_int();
unsigned int file_num = read_int();
unsigned int line_num = read_int();
+#ifndef SIZES_TOO
+ const bool is_block = (c == NYTP_TAG_TIME_BLOCK);
+#else
+ const bool is_block = (c == NYTP_TAG_TIME_BLOCK || c == NYTP_TAG_TIME_BLOCK_MEM);
+ int allocated_diff = 0;
+ bool got_diff;
+#endif

+#ifdef SIZES_TOO
+ if (c == NYTP_TAG_TIME_LINE_MEM || c == NYTP_TAG_TIME_BLOCK_MEM) {
+ const unsigned int diff_raw = read_int();
+ allocated_diff = diff_raw >> 1;
+ if (diff_raw & 1)
+ allocated_diff = -allocated_diff;
+ got_diff = TRUE;
+ } else
+ got_diff = FALSE;
+#endif
+
if (cb) {
PUSHMARK(SP);

- XPUSHs(c == NYTP_TAG_TIME_BLOCK ? cb_TIME_BLOCK_tag : cb_TIME_LINE_tag);
+ XPUSHs(is_block ? cb_TIME_BLOCK_tag : cb_TIME_LINE_tag);

i = 0;
sv_setiv(cb_args[i], eval_file_num); XPUSHs(cb_args[i++]);
@@ -4014,10 +4094,25 @@
sv_setiv(cb_args[i], file_num); XPUSHs(cb_args[i++]);
sv_setiv(cb_args[i], line_num); XPUSHs(cb_args[i++]);

- if (c == NYTP_TAG_TIME_BLOCK) {
+ if (is_block) {
sv_setiv(cb_args[i], read_int()); XPUSHs(cb_args[i++]); /* block_line_num */
sv_setiv(cb_args[i], read_int()); XPUSHs(cb_args[i++]); /* sub_line_num */
+#ifdef SIZES_TOO
+ if (got_diff) {
+ sv_setiv(cb_args[i], allocated_diff); XPUSHs(cb_args[i++]);
+ }
+#endif
}
+#ifdef SIZES_TOO
+ else {
+ if (got_diff) {
+ SvOK_off(cb_args[i]); XPUSHs(cb_args[i++]); /* block_line_num */
+ SvOK_off(cb_args[i]); XPUSHs(cb_args[i++]); /* sub_line_num */
+ sv_setiv(cb_args[i], allocated_diff); XPUSHs(cb_args[i++]);
+ }
+ }
+#endif
+ assert(i <= C_ARRAY_LENGTH(cb_args));

PUTBACK;
call_sv(cb, G_DISCARD);
@@ -4051,25 +4146,37 @@
}

add_entry(aTHX_ fid_line_time_av, file_num, line_num,
+#ifdef SIZES_TOO
+ allocated_diff, eval_file_num, eval_line_num,
+#else
seconds, eval_file_num, eval_line_num,
+#endif
1-statement_discount
);

- if (c == NYTP_TAG_TIME_BLOCK) {
+ if (is_block) {
unsigned int block_line_num = read_int();
unsigned int sub_line_num = read_int();

if (!fid_block_time_av)
fid_block_time_av = newAV();
add_entry(aTHX_ fid_block_time_av, file_num, block_line_num,
+#ifdef SIZES_TOO
+ allocated_diff, eval_file_num, eval_line_num,
+#else
seconds, eval_file_num, eval_line_num,
+#endif
1-statement_discount
);

if (!fid_sub_time_av)
fid_sub_time_av = newAV();
add_entry(aTHX_ fid_sub_time_av, file_num, sub_line_num,
+#ifdef SIZES_TOO
+ allocated_diff, eval_file_num, eval_line_num,
+#else
seconds, eval_file_num, eval_line_num,
+#endif
1-statement_discount
);

@@ -4078,7 +4185,11 @@
}

total_stmts_measured++;
+#ifdef SIZES_TOO
+ total_stmts_duration += allocated_diff;
+#else
total_stmts_duration += seconds;
+#endif
statement_discount = 0;
last_file_num = file_num;
last_line_num = line_num;

malloc_intercept.c

Tim Bunce

unread,

Dec 14, 2009, 3:26:09 PM12/14/09

to develnyt...@googlegroups.com

On Mon, Dec 14, 2009 at 03:54:20PM +0000, Nicholas Clark wrote:
> So, at work we have some server processes which are glutons for memory.
> We'd like to work out why, which is tricky. At least, I think that it's tricky.
>
> So, I considered what would be the simplest thing that *might* work.
> Devel::NYTProf can measure the elapsed time at a statement granularity, so
> why not use that framework to also measure memory usage?

You've read my messages in this thread, right?...
http://groups.google.com/group/develnytprof-dev/browse_thread/thread/c711c132216a3cea

With your more detailed knowledge of the internals I'd be grateful for
any details and observations you could add to that thread. (I'm pretty
sure sme of my assumptions are flawed, I didn't research it in any great
depth.)

> I've written a malloc() intereception library,

> This seems to work "reasonably" well

A good solution should be able to account for 'unused' memory in places
like arenas.

> Sadly OS X doesn't have LD_PRELOAD, so you have to explicitly link with the
> interception library.

On OS X you'd use DYLD_INSERT_LIBRARIES.
http://stackoverflow.com/questions/929893/how-can-i-override-malloc-calloc-free-etc-under-os-x

> I'm not sure where to go from here, or whether it's really something that
> Devel::NYTProf would want to support as an option.

It's where I want NYTProf to go for v4, so I'm delighted you're
blazing a trail in that direction.

> Right now the
> implementation (appended) is definately a "prototype":
>
> * whilst both memory delta and times are written to/read from the profile file
> the report generation is unchanged - right now the memory delta is bodged
> in, in place of the time, so the line-by-line columns in "seconds" are
> actually bytes.
> (and have large integers, which the formatting code doesn't expect.
> and can be negative which the colouring code doesn't expected')
>
> I'm not sure how to procede here. 1 or 2 more columns in the HTML reports?
> And hence modification to all the layers between the stream reader and HTML?
>
> * callbacks dont' change their tag, but have 1 extra parameter with the memory
> size delta or undef if there isn't a delta present.
> Likewise the block_line_num and sub_line_num are undef if they're not
> present, to keep the memory delta at the same position.
>
> This is completely untested and feels like a bit of a bodge.
>
> * hardcoded prototype for get_total_allocated();
> * no good handling of output format overflows
>
> The appended patch *will* fail tests, because the tests expect (positive)
> seconds, not signed memory deltas. :-)

> I'm still digetsing whether it actually produces useful-enough results to be
> good at pointing out where memory is going. I'm confident that the
> interception library is recording exactly what *is* allocated. However, I'm
> not sure if this is really that useful a metric for solving actual problems.

I'd be interested in your post-digested thoughts on this. Any chance you
could post a link to a report? (If your app isn't open source then
perhaps profile some utility that is, like perldoc (small) or perlcritic
(big)).

The addition of your thoughts to the earlier thread would be good, and
help pin down the issues of what's possible and expand the use-cases.
(Perhaps you could, for example, list the kinds of memory allocations
that don't use arenas. Ops spring to mind but I'm sure there are others.)

Is statement level detail important for memory profiling?
Statement level seems like 'too much detail' to me.

I was thinking of adding memory profiling to the subroutine profiler
rather than the statement profiler. So we'd get inclusive and exclusive
memory usage but only as an end-of-run overview. It would also be less
expensive.

I'm also thinking of optionally adding sub call info into the data
stream as the calls/returns happen. That might be relevant here also.

Overall, though, I suspect we'd get best results from walking the arenas
and optrees (at finish_profile time) and dumping detailed information
about what we find. That sounds simple but to be really useful we'd need
to be able to do things like tell which SVs are in pads of which subs in
which packages.

Tim.

p.s. Is integer byte level detail important for memory profiling?
I was thinking of using doubles instead of ints (so we'd avoid
overflow problems).

Nicholas Clark

unread,

Dec 15, 2009, 6:34:11 AM12/15/09

to develnyt...@googlegroups.com

On Mon, Dec 14, 2009 at 08:26:09PM +0000, Tim Bunce wrote:
> On Mon, Dec 14, 2009 at 03:54:20PM +0000, Nicholas Clark wrote:
> > So, at work we have some server processes which are glutons for memory.
> > We'd like to work out why, which is tricky. At least, I think that it's tricky.
> >
> > So, I considered what would be the simplest thing that *might* work.
> > Devel::NYTProf can measure the elapsed time at a statement granularity, so
> > why not use that framework to also measure memory usage?
>
> You've read my messages in this thread, right?...
> http://groups.google.com/group/develnytprof-dev/browse_thread/thread/c711c132216a3cea
>
> With your more detailed knowledge of the internals I'd be grateful for
> any details and observations you could add to that thread. (I'm pretty
> sure sme of my assumptions are flawed, I didn't research it in any great
> depth.)

Um, I think I read it. "It's hard" and "I can't see a way to do it efficiently"
was all I really thought. In particular, measuring elapsed time for a perl
OP is a simple, scalar problem. Measuring total memory usage is analogous.

Anything more than that is not taking and recording a scalar value, so is
in a completely different problem space. I'd prefer to explore the problem
space by working out what *is* measurable in practical ways by experimentation.

(And this is very much work driven, so experimentation is likely to stop or
scale down once we think we've identified the cause of our RAM consumption)

> > I've written a malloc() intereception library,
> > This seems to work "reasonably" well
>
> A good solution should be able to account for 'unused' memory in places
> like arenas.

I thought about this. Right now I wasn't too worried about this, as most
every SV type other than undef and numbers allocates more memory via malloc,
so I assumed that I'd be able to spot problems via that.

The only way I can see to do it efficiently would be
to patch the core to track how much memory is allocated to arenas, and how
many items of each type.

Or, alternatively, recompile perl with -DPURIFY which causes it to use malloc
everywhere rather than arenas. Which seemed a much simpler solution.

> > Sadly OS X doesn't have LD_PRELOAD, so you have to explicitly link with the
> > interception library.
>
> On OS X you'd use DYLD_INSERT_LIBRARIES.
> http://stackoverflow.com/questions/929893/how-can-i-override-malloc-calloc-free-etc-under-os-x

The key part seems to be "export DYLD_FORCE_FLAT_NAMESPACE=1", which isn't
needed on an ld.so based system. Otherwise, I think, you can only override
libraries that you're already actually linked against.

So it's not directly analogous (and why I couldn't work it out when I was
offline with the Mac laptop writing the code)

> > I'm still digetsing whether it actually produces useful-enough results to be
> > good at pointing out where memory is going. I'm confident that the
> > interception library is recording exactly what *is* allocated. However, I'm
> > not sure if this is really that useful a metric for solving actual problems.
>
> I'd be interested in your post-digested thoughts on this. Any chance you
> could post a link to a report? (If your app isn't open source then
> perhaps profile some utility that is, like perldoc (small) or perlcritic
> (big)).

The apps are servers used for work, and sadly most definitely not publishable.
I was initially testing by profiling installman. That's definitely open
source, and definitely slow. But I don't think I'll have anything to write
home about for a while yet. In particular, we know they go bloaty in
production, but I don't think that we yet know *why*, or what test data to
feed them to replicate this on a development environment.

> The addition of your thoughts to the earlier thread would be good, and
> help pin down the issues of what's possible and expand the use-cases.
> (Perhaps you could, for example, list the kinds of memory allocations
> that don't use arenas. Ops spring to mind but I'm sure there are others.)
>
> Is statement level detail important for memory profiling?
> Statement level seems like 'too much detail' to me.

I'd have to have some thoughts before I can add them.

Although currently I was envisaging total memory profiling at a statement
level as a relatively fast way to locate potential points of interest,
where more invasive techniques could then be used.

> p.s. Is integer byte level detail important for memory profiling?
> I was thinking of using doubles instead of ints (so we'd avoid
> overflow problems).

I suspect that integers are not important. By the time you get to 2**53
bytes allocated, you're not really going to care about the odd byte or two.
(2**53 bytes is 8 petabytes. I doubt that Perl 5 programs will get to be
that large in the foreseeable future)

Nicholas Clark

Tim Bunce

unread,

Dec 15, 2009, 9:21:30 AM12/15/09

to develnyt...@googlegroups.com

On Tue, Dec 15, 2009 at 11:34:11AM +0000, Nicholas Clark wrote:
> On Mon, Dec 14, 2009 at 08:26:09PM +0000, Tim Bunce wrote:
> > On Mon, Dec 14, 2009 at 03:54:20PM +0000, Nicholas Clark wrote:
> > > So, at work we have some server processes which are glutons for memory.
> > > We'd like to work out why, which is tricky. At least, I think that it's tricky.
> > >
> > > So, I considered what would be the simplest thing that *might* work.
> > > Devel::NYTProf can measure the elapsed time at a statement granularity, so
> > > why not use that framework to also measure memory usage?
> >
> > You've read my messages in this thread, right?...
> > http://groups.google.com/group/develnytprof-dev/browse_thread/thread/c711c132216a3cea
> >
> > With your more detailed knowledge of the internals I'd be grateful for
> > any details and observations you could add to that thread. (I'm pretty
> > sure sme of my assumptions are flawed, I didn't research it in any great
> > depth.)
>
> Um, I think I read it. "It's hard" and "I can't see a way to do it efficiently"
> was all I really thought. In particular, measuring elapsed time for a perl
> OP is a simple, scalar problem. Measuring total memory usage is analogous.
>
> Anything more than that is not taking and recording a scalar value, so is
> in a completely different problem space. I'd prefer to explore the problem
> space by working out what *is* measurable in practical ways by experimentation.

The lessons learnt from your experimentation will certainly be useful.

> (And this is very much work driven, so experimentation is likely to stop or
> scale down once we think we've identified the cause of our RAM consumption)

Understood.

> Or, alternatively, recompile perl with -DPURIFY which causes it to use malloc
> everywhere rather than arenas. Which seemed a much simpler solution.

Yeap.

> > > Sadly OS X doesn't have LD_PRELOAD, so you have to explicitly link with the
> > > interception library.
> >
> > On OS X you'd use DYLD_INSERT_LIBRARIES.
> > http://stackoverflow.com/questions/929893/how-can-i-override-malloc-calloc-free-etc-under-os-x
>
> The key part seems to be "export DYLD_FORCE_FLAT_NAMESPACE=1", which isn't
> needed on an ld.so based system. Otherwise, I think, you can only override
> libraries that you're already actually linked against.
> So it's not directly analogous (and why I couldn't work it out when I was
> offline with the Mac laptop writing the code)

Perhaps not directly analogous, but hopefully close enough to work in this case.
(Though I 've not tried it myself yet.)

> > > I'm still digetsing whether it actually produces useful-enough results to be
> > > good at pointing out where memory is going. I'm confident that the
> > > interception library is recording exactly what *is* allocated. However, I'm
> > > not sure if this is really that useful a metric for solving actual problems.
> >
> > I'd be interested in your post-digested thoughts on this. Any chance you
> > could post a link to a report? (If your app isn't open source then
> > perhaps profile some utility that is, like perldoc (small) or perlcritic
> > (big)).
>
> The apps are servers used for work, and sadly most definitely not publishable.
> I was initially testing by profiling installman. That's definitely open
> source, and definitely slow. But I don't think I'll have anything to write
> home about for a while yet. In particular, we know they go bloaty in
> production, but I don't think that we yet know *why*, or what test data to
> feed them to replicate this on a development environment.

[A side note (wherein I'm probably saying what you already know):
One of my favourite tricks for identifying bloaty code is to wait till a
process is very big, then get a core file, then 'simply' look for
patterns in the core file. I usually start with
strings core | sort | uniq -c | sort -rn | less
It's amazing how often that reveals the answer, or at least a useful
clue. If it doesn't then I'll quickly skim through the core file using
less, or similar, looking for repeating patterns of similarly sized
gibberish.]

> I'd have to have some thoughts before I can add them.
>
> Although currently I was envisaging total memory profiling at a statement
> level as a relatively fast way to locate potential points of interest,
> where more invasive techniques could then be used.