Spark 100TB sort record using Netty

373 views
Skip to first unread message

Reynold Xin

unread,
Nov 7, 2014, 4:19:02 AM11/7/14
to ne...@googlegroups.com
Hi Netty group,

Just want to keep you guys posted about this. We recently participated in the Sort Benchmark using the latest version of Spark. Netty played a nontrivial part in the performance improvements!

Norman / Trustin: Thanks for fixing the bugs I reported so quickly and the feedback on our code!  In the Spark project, we are switching data plane (shuffle and broadcast) to Netty, and possibly control plane messaging in the future too.

---------- Forwarded message ----------
From: Reynold Xin <rx...@databricks.com>
Date: Wed, Nov 5, 2014 at 3:11 PM
Subject: Re: Breaking the previous large-scale sort record with Spark
To: user <us...@spark.apache.org>, "d...@spark.apache.org" <d...@spark.apache.org>


Hi all,

We are excited to announce that the benchmark entry has been reviewed by the Sort Benchmark committee and Spark has officially won the Daytona GraySort contest in sorting 100TB of data.

Our entry tied with a UCSD research team building high performance systems and we jointly set a new world record. This is an important milestone for the project, as it validates the amount of engineering work put into Spark by the community. 

As Matei said, "For an engine to scale from these multi-hour petabyte batch jobs down to 100-millisecond streaming and interactive queries is quite uncommon, and it's thanks to all of you folks that we are able to make this happen."





On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia <matei....@gmail.com> wrote:
Hi folks,

I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer nodes. There's a detailed writeup at http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html. Summary: while Hadoop MapReduce held last year's 100 TB world record by sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.

I want to thank Reynold Xin for leading this effort over the past few weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for providing the machines to make this possible. Finally, this result would of course not be possible without the many many other contributions, testing and feature requests from throughout the community.

For an engine to scale from these multi-hour petabyte batch jobs down to 100-millisecond streaming and interactive queries is quite uncommon, and it's thanks to all of you folks that we are able to make this happen.

Matei
---------------------------------------------------------------------
To unsubscribe, e-mail: user-uns...@spark.apache.org
For additional commands, e-mail: user...@spark.apache.org



Norman Maurer

unread,
Nov 7, 2014, 4:25:49 AM11/7/14
to ne...@googlegroups.com, Reynold Xin
That is awesome! Thanks a lot for the email. This really made my day :)

-- 
Norman Maurer
--
You received this message because you are subscribed to the Google Groups "Netty discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to netty+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/netty/CAPh_B%3DZa6YTckveF1S3FEzX5MjqwgNDMeoeF5ZD6VXgFSWmF2g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

이희승 (Trustin Lee)

unread,
Nov 19, 2014, 4:57:59 AM11/19/14
to ne...@googlegroups.com, Reynold Xin
Yeah, I'm very happy to know this great news, Reynold! Thank you for sharing. :-)
 
Reply all
Reply to author
Forward
0 new messages