Disco benchmark suite?

Samuel Harrold

unread,

Mar 14, 2014, 2:54:48 PM3/14/14

to disc...@googlegroups.com

Hello,

I'm evaluating Disco for use by researchers here at UT Austin, and I'd like to make sure I'm testing appropriately. I'm conducting tests similar to those in the post "RE: Performance comparison - Disco vs Hadoop" from 1/2012. My project is to compare Disco's performance with that of Hadoop as measured by the HiBench Hadoop benchmark suite.

Within the standard HiBench Hadoop benchmarks, the most important ones for these research applications are WordCount, TeraSort, and K-means Clustering. Could you recommend any analogous benchmark programs for Disco? If there are none, could you recommend any guidelines for adapting the example scripts wordcount.py/count_words.py and kclustering.py?

Thanks for any advice you can offer, and thanks for making such a useful tool!

Sam

--------------------

Samuel Harrold

Intern, PhD student

Texas Advanced Computing Center

University of Texas at Austin

Shayan Pooya

unread,

Mar 19, 2014, 4:19:12 PM3/19/14

to samuel....@gmail.com, disc...@googlegroups.com

Hello Sam,

* There is not a counterpart of hibench for disco at the moment.

* Using the examples as benchmarks should be straightforward. Just run the example on your favorite dataset.

* I just added some comments to the kclustering example that should clear things up a little bit:
https://github.com/discoproject/disco/blob/develop/examples/datamining/kclustering.py

* We will be adding more examples soon. The Disco integration tests can also be consulted for some other tricky things that can be done with Disco.

* The current kclustering example uses map-reduce and is not very efficient. This example will be ported to Disco pipelines to show a better way for implementing such an algorithm.

Regards.

--
You received this message because you are subscribed to the Google Groups "Disco-development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to disco-dev+...@googlegroups.com.
To post to this group, send email to disc...@googlegroups.com.
Visit this group at http://groups.google.com/group/disco-dev.
For more options, visit https://groups.google.com/d/optout.

Samuel Harrold

unread,

Mar 19, 2014, 4:28:51 PM3/19/14

to Shayan Pooya, disc...@googlegroups.com

Hi Shayan,

Thank you for the pointers. I'll certainly keep them in mind as I compare the test results. Thanks for adding the comments to kclustering. I look forward to more examples of good Disco integration.

Thank you

Parkway

unread,

Mar 28, 2014, 4:26:49 AM3/28/14

to disc...@googlegroups.com, Shayan Pooya

Samuel: Will the disco vs hadoop benchmarks be published when completed? Very interested in performance difference between erlang/python and java implementation.

Vivian Delplace

unread,

Mar 28, 2014, 5:25:56 AM3/28/14

to disc...@googlegroups.com

Samuel: As a Master Thesis, I have done similar job in comparing Disco to Mars, a Map-Reduce GPU implementation ( link ). But my point of vue was the energy consumption. If it can help you..

2014-03-28 9:26 GMT+01:00 Parkway <dinesh...@hotmail.com>:

Samuel: Will the disco vs hadoop benchmarks be published when completed? Very interested in performance difference between erlang/python and java implementation.

--

Samuel Harrold

unread,

Apr 25, 2014, 8:42:51 PM4/25/14

to disc...@googlegroups.com

Hi Parkway,

Sorry for missing your message. The results of my tests will be incorporated into a guide for users at Texas Advanced Computing Center. As the project stands, researchers on our end will probably choose to use Disco or other tools based on what they can program fastest in instead of computational performance. My tests are meant to be quick examples on big data, not authoritative assessments. If/when the guide is in a useful final state, I'll post a link back to this thread.

Samuel Harrold

unread,

Apr 25, 2014, 8:46:37 PM4/25/14

to disc...@googlegroups.com

Hi Vivian,

Thanks for the link!

Reply all

Reply to author

Forward