How to profile code in scalding?

21 views
Skip to first unread message

Jing Lu

unread,
Jun 21, 2018, 6:28:19 PM6/21/18
to Scalding Development
Hey,

Is there any easy way to profile code with scalding?


Thanks,
Jing

Alex Levenson

unread,
Jun 21, 2018, 6:57:15 PM6/21/18
to Jing Lu, Scalding Development
One very easy thing you can do for anything that runs in hadoop (or the jvm for that matter) is to turn on xrpof.

I'm sure there's better docs for it, but here's a short slide on how to do it in hadoop:

It'll print profiling info into your mapper / reducer stdout logs, and it has so little impact on performance you can just leave it on all the time. It only tells you where time is being spent, it won't tell you the full call stack. So it can tell you you're spending lots of time in method X but not who called method X.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK

Jing Lu

unread,
Jun 21, 2018, 7:04:17 PM6/21/18
to alexle...@twitter.com, scaldi...@googlegroups.com
Awesome, thanks! 

On Thu, Jun 21, 2018 at 3:57 PM Alex Levenson <alexle...@twitter.com> wrote:
One very easy thing you can do for anything that runs in hadoop (or the jvm for that matter) is to turn on xrpof.

I'm sure there's better docs for it, but here's a short slide on how to do it in hadoop:

It'll print profiling info into your mapper / reducer stdout logs, and it has so little impact on performance you can just leave it on all the time. It only tells you where time is being spent, it won't tell you the full call stack. So it can tell you you're spending lots of time in method X but not who called method X.
On Thu, Jun 21, 2018 at 3:28 PM, Jing Lu <aji...@gmail.com> wrote:
Hey,

Is there any easy way to profile code with scalding?


Thanks,
Jing

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK

Alex Levenson

unread,
Jun 21, 2018, 7:11:50 PM6/21/18
to Jing Lu, Scalding Development
I spelled "xprof" wrong in my email. It's a flag to the jvm you can enable "-Xprof" but the slides show how to configure it via hadoop configuration. The next few slides show what the output looks like.

On Thu, Jun 21, 2018 at 4:04 PM, Jing Lu <aji...@gmail.com> wrote:
Awesome, thanks! 

On Thu, Jun 21, 2018 at 3:57 PM Alex Levenson <alexle...@twitter.com> wrote:
One very easy thing you can do for anything that runs in hadoop (or the jvm for that matter) is to turn on xrpof.

I'm sure there's better docs for it, but here's a short slide on how to do it in hadoop:

It'll print profiling info into your mapper / reducer stdout logs, and it has so little impact on performance you can just leave it on all the time. It only tells you where time is being spent, it won't tell you the full call stack. So it can tell you you're spending lots of time in method X but not who called method X.
On Thu, Jun 21, 2018 at 3:28 PM, Jing Lu <aji...@gmail.com> wrote:
Hey,

Is there any easy way to profile code with scalding?


Thanks,
Jing

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK



--
Alex Levenson
@THISWILLWORK

Jing Lu

unread,
Jun 21, 2018, 7:34:44 PM6/21/18
to alexle...@twitter.com, scaldi...@googlegroups.com
Thanks! Is there any way to know what higher level functions are not efficient?



On Thu, Jun 21, 2018 at 4:11 PM Alex Levenson <alexle...@twitter.com> wrote:
I spelled "xprof" wrong in my email. It's a flag to the jvm you can enable "-Xprof" but the slides show how to configure it via hadoop configuration. The next few slides show what the output looks like.
On Thu, Jun 21, 2018 at 4:04 PM, Jing Lu <aji...@gmail.com> wrote:
Awesome, thanks! 

On Thu, Jun 21, 2018 at 3:57 PM Alex Levenson <alexle...@twitter.com> wrote:
One very easy thing you can do for anything that runs in hadoop (or the jvm for that matter) is to turn on xrpof.

I'm sure there's better docs for it, but here's a short slide on how to do it in hadoop:

It'll print profiling info into your mapper / reducer stdout logs, and it has so little impact on performance you can just leave it on all the time. It only tells you where time is being spent, it won't tell you the full call stack. So it can tell you you're spending lots of time in method X but not who called method X.
On Thu, Jun 21, 2018 at 3:28 PM, Jing Lu <aji...@gmail.com> wrote:
Hey,

Is there any easy way to profile code with scalding?


Thanks,
Jing

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK



--
Alex Levenson
@THISWILLWORK

Alex Levenson

unread,
Jun 21, 2018, 8:00:46 PM6/21/18
to Jing Lu, Scalding Development
To do that I think you need a more invasive profiler, like yourkit? I have't tried to set that up, maybe someone else on this list has. The problem with those is they can often degrade / affect the performance enough that what you're measuring is very different from what happens in production (the profiler can interfere with important optimizations that the jvm runtime normally makes for example). That's one of the nice things about -Xprof is it has no impact on performance / doesn't mess up jvm optimizations. It's actually implemented by taking advantage of the data tracked by the jvm already to do the runtime optimizations.

I think if you search around for ways to profile hadoop, they will also apply to scalding.

On Thu, Jun 21, 2018 at 4:34 PM, Jing Lu <aji...@gmail.com> wrote:
Thanks! Is there any way to know what higher level functions are not efficient?


On Thu, Jun 21, 2018 at 4:11 PM Alex Levenson <alexle...@twitter.com> wrote:
I spelled "xprof" wrong in my email. It's a flag to the jvm you can enable "-Xprof" but the slides show how to configure it via hadoop configuration. The next few slides show what the output looks like.
On Thu, Jun 21, 2018 at 4:04 PM, Jing Lu <aji...@gmail.com> wrote:
Awesome, thanks! 

On Thu, Jun 21, 2018 at 3:57 PM Alex Levenson <alexle...@twitter.com> wrote:
One very easy thing you can do for anything that runs in hadoop (or the jvm for that matter) is to turn on xrpof.

I'm sure there's better docs for it, but here's a short slide on how to do it in hadoop:

It'll print profiling info into your mapper / reducer stdout logs, and it has so little impact on performance you can just leave it on all the time. It only tells you where time is being spent, it won't tell you the full call stack. So it can tell you you're spending lots of time in method X but not who called method X.
On Thu, Jun 21, 2018 at 3:28 PM, Jing Lu <aji...@gmail.com> wrote:
Hey,

Is there any easy way to profile code with scalding?


Thanks,
Jing

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK



--
Alex Levenson
@THISWILLWORK



--
Alex Levenson
@THISWILLWORK

Jing Lu

unread,
Jun 21, 2018, 9:58:15 PM6/21/18
to Alex Levenson, Scalding Development
Thanks for the tip! I will add -Xprof argument to yarn first. Xprof is already pretty informative.


Best,
Jing

Awesome, thanks! 

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK



--
Alex Levenson
@THISWILLWORK



--
Alex Levenson
@THISWILLWORK

P. Oscar Boykin

unread,
Jun 21, 2018, 11:55:16 PM6/21/18
to Jing Lu, Alex Levenson, Scalding Development
It’s almost always serialization. Doing registering serializers, using OrderedSerialization, or sending less data are usually the best bets on optimization.

Alex Levenson

unread,
Jun 22, 2018, 9:38:27 PM6/22/18
to P. Oscar Boykin, Jing Lu, Scalding Development
It’s worth meausuring if you’re IO or CPU constrained as well. That impacts what changes you need to make.
--
Sent from my phone

Jing Lu

unread,
Jun 23, 2018, 12:18:09 AM6/23/18
to Alex Levenson, oscar....@gmail.com, Scalding Development
How to measuring IO versus CPU constrained? By Xprof as well?

Oscar Boykin

unread,
Jun 23, 2018, 12:54:21 AM6/23/18
to Jing Lu, Alex Levenson, P. Oscar Boykin, Scalding Development
You can get an idea by how many records per second you are processing, how much total data you are reading and writing at each phase.

If the job has a low number of records/sec (e.g. 100), you probably have a CPU issue. If you are shuffling a ton of data (say more than you are reading) you may have more of an IO issue.

There are more things to look at, probably can't give a very complete mapreduce optimization guide in this email.

Awesome, thanks! 

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK



--
Alex Levenson
@THISWILLWORK



--
Alex Levenson
@THISWILLWORK

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
--
Sent from my phone

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages