Is anyone doing anything with SWIM or active in this group.

105 views
Skip to first unread message

D Boyd

unread,
Jan 3, 2013, 12:54:41 PM1/3/13
to swimapredu...@googlegroups.com
Hi:
   I have been playing with SWIM and would love to discuss and see how others are using it.

One tweak I have done is move everything from a bash shell script approach to a multi-threaded approach
that reads the parameters from a config file and submits all the jobs from a single hadoop jar instance.

I am interested specifically in how folks have approached analyzing and comparing the output and
doing tuning.

Yanpei Chen

unread,
Jan 3, 2013, 1:50:28 PM1/3/13
to swimapredu...@googlegroups.com
Thanks for your note. This group has been pretty quiet. I would also love to hear how other folks are using SWIM!

Here at Cloudera we're doing a lot with SWIM. I wish I could say more, but performance and performance methodology really do give competitive advantages. Become a partner and you'll find out very quickly!

The particular change you're proposing sounds super useful! We've been thinking about something like that for a while but haven't got around to building it. Please don't hesitate to submit patches. I'll look at it in a timely manner.

Re tuning - Your best bet for a proof-of-concept type deployment would be to deploy CDH using Cloudera Manager, because that would give you sensible default configurations based on our field experience, and verified using SWIM and other tools. Large-scale clusters benefit from additional per-workload, per-hardware type tuning, but that's customer-specific enough that our field/support folks handle it.

Re analyzing - In general, it often makes sense to compare per-workload metrics, such as median job duration, 99th percentile job duration, or average/peak task-slot utilization. I assume the script included with SWIM is inadequate for you. Are you allowed to talk about why? If so I hope I can suggest something more specific.

To give you a flavor of some things we're doing with SWIM, see this (circa 2011) post from one of our partners.



--
You received this message because you are subscribed to the Google Groups "SWIMapReduce-general" group.
To post to this group, send email to swimapredu...@googlegroups.com.
To unsubscribe from this group, send email to swimapreduce-gen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/swimapreduce-general/-/eihHADzdhjoJ.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Andrew Ferguson

unread,
Jan 3, 2013, 3:50:35 PM1/3/13
to swimapredu...@googlegroups.com
On Jan 3, 2013, at 12:54 PM, D Boyd wrote:
Hi:
   I have been playing with SWIM and would love to discuss and see how others are using it.

at Brown, we have been using it over the last few months to generate workloads which are more representative of the "real world" than previous benchmarks used in research papers. such previous benchmarks chiefly consisted of wordcount, sort, grep, etc. on, say, Wikipedia or Shakespeare, and did not have a sensible schedule of job arrival times.

we have two ongoing projects in this area. the first is investigating the suitability of current networks (TCP parameters, switch & QOS configurations, etc.), and evaluating some proposed improvements we have developed. the second is developing improved debug/tracing infrastructure for MapReduce application developers, based on our group's previous X-Trace project.  if anyone is interested in further details, feel free to get in touch.

One tweak I have done is move everything from a bash shell script approach to a multi-threaded approach
that reads the parameters from a config file and submits all the jobs from a single hadoop jar instance.

cool. is your code on Github anywhere?

we've recently added support for creating the randomwriter_conf.xsl file automatically, based on the values provided to "java GenerateReplayScript"   (this would replace part of "Step 4" on the "Performance measurement" wiki page). for the impatient, this change is in a repo here: https://github.com/jeffra/SWIM ... as long as Yanpei doesn't object, I am hoping to merge that particular change into the upstream master in the next few days.

we've also added support for jobs to be submitted to different queues (as you might want if you are using the Capacity or Fair Scheduler instead of the FIFO scheduler). at the moment, we weren't planning to add this to upstream quickly since it's a somewhat incompatible change, but if others are interested, we can clean it up for faster submission.


cheers,
Andrew

David Boyd

unread,
Jan 4, 2013, 10:35:43 AM1/4/13
to swimapredu...@googlegroups.com
Yanpei:
     Data Tactics is in fact a Cloudera partner and a number of our folks work with Joey Echeverria and Erin Hawley and our CEO meets pretty regularly with Mike Olson.    We should have a more detailed discussion via private E-mails.

      I still have some tuning and enhancements to how tool handles job output and histories and getting rid of the separate input path files as well.   Once I have that complete I will probably push the whole thing up to my own github repo and/or generate a patch file for you.   Right now one of my clusters is just finishing up a run based on the 2010 Facebook traces.

     Right now I am looking to accomplish a couple of different things from this and other Benchmarks (TPC-H, PigMix, etc.).   First is obviously to test and verify my clusters.  I have two clusters currently a 36 node one with 64GB of RAM each, and a 20 node with 382GB of ram each. (I will soon have a 10node cluster with dual GPUs on each node as well).  The end goal will be to have a turn-key suite of benchmarks/tests we can use for all of our customer deployments.

    My second, is to develop some rules and ideas for how different tuning parameters affect different clusters.  (E.g. does increasing the per/map or per/reduce heap size have the same effects on large memory machines versus smaller ones.  (For example, it benefited the high mem cluster but hurt my lower mem cluster).    I have also looked at the effects of tuning on the performance of Java M/R jobs vs equivalent PIG jobs.

    My final, thing would be to be able to compare my clusters against other clusters for similar workloads.

    The scripts with SWIM are fine.  I guess I was hoping to find if anyone had any canned analysis spreadsheets or graphs that I could use with my results.
-- 
========= mailto:db...@data-tactics.com ============
David W. Boyd                     
Director, Engineering, Research and Development       
Data Tactics Corporation    
7901 Jones Branch, Suite 240   
Mclean, VA 22102         
office:   +1-703-506-3735, ext 308    
fax:     +1-703-506-6703    
cell:     +1-703-402-7908
============== http://www.data-tactics.com/ ============
 

The information contained in this message may be privileged 
and/or confidential and protected from disclosure.  
If the reader of this message is not the intended recipient 
or an employee or agent responsible for delivering this message 
to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication 
is strictly prohibited.  If you have received this communication 
in error, please notify the sender immediately by replying to 
this message and deleting the material from any computer.

 
Reply all
Reply to author
Forward
0 new messages