|Some Performance Results on Shark/Impala/Hive||Shane Huang||11/8/12 11:35 PM|
We did some simple performance comparison between Shark, Impala (Real-time Query engine recently open-sourced by Cloudera) and Hive on a 4-node in house cluster. Results looks interesting (as shown in chart below).
We tested 3 input data-sets with different sizes. On each of the data-set we ran 3 pre-defined queries
Results shows shark works quite well for small jobs (see 2.5G & 24G result, and 56G count result). With input table in cache Shark performs much faster than or at least equal to Impala. Even without cache Shark is better in most of the cases (except 2.5G count case). It is quite compelling that in 56G count test Shark with cache outperforms the other 3 a lot. On the other hand, when it comes to more complex query with larger inputs, it seems the behavior of Shark WITH table cached becomes somewhat unstable. Specifically, in 56G groupby test, Shark with cache runs significantly slower than MapReduce Hive, and in 56G/2.3G join Shark with cache fails with FileNotFound Exceptions around shuffle stage, probably memory issue). It looks delicate memory tuning is needed when memory is a potential bottleneck in the systems. Anyway I didn't apply the Shark patch to allow disk spilling yet. I'll do test again after applying the patch and see if that helps out.
|Re: [shark-users] Some Performance Results on Shark/Impala/Hive||Matei Zaharia||11/8/12 11:41 PM|
Not bad -- thanks for posting this! We've been meaning to set up Impala ourselves. We are also working on a couple of things that will improve performance for large shuffles, and to add disk spilling by default.
Just to double-check, this is Shark 0.2 with Spark 0.6?
Also, in general, there are some low-hanging optimizations in Shark (e.g. compiling Hive's expressions to Java bytecode) that we'll work on to make it even faster :).
|Re: [shark-users] Some Performance Results on Shark/Impala/Hive||Shane Huang||11/8/12 11:55 PM|
Yes. Shark 0.2 with Spark 0.6 Standalone mode. SPARK_MEM is set to 28G and spark worker mem 42G.
I'm still working trying to figure out why join kept failing when input table is cached.
BTW, Impala also fails to response when we increase the input data set to 110G/4.4G. Maybe we can run the tests again when Cloudera released a stable version of Impala later.
|Re: [shark-users] Some Performance Results on Shark/Impala/Hive||Matei Zaharia||11/9/12 12:00 AM|
As a first cut, I'd suggest increasing your number of reduce tasks in the join:
This decreases the amount of "scratch space" needed per task to do its join, which is the main reason why Spark/Shark would run out of memory. The disk spilling patch will likely also help.
> To view this discussion on the web visit https://groups.google.com/d/msg/shark-users/-/tYhEBcV3M8AJ.
|Re: [shark-users] Some Performance Results on Shark/Impala/Hive||Reynold Xin||11/9/12 12:12 AM|
Thanks for posting this - it is encouraging! As Matei said, we've already identified some pretty low hanging fruit for performance and will implement them soon.
Out of curiosity, did you set mapred.reduce.tasks when you do the large group by and join? By default it is set to 1 and all data get routed to a single node for group by and join, which would be slow.
AMPLab, UC Berkeley
To view this discussion on the web visit https://groups.google.com/d/msg/shark-users/-/tYhEBcV3M8AJ.
|Re: [shark-users] Some Performance Results on Shark/Impala/Hive||Shane Huang||11/9/12 1:34 AM|
Sure. Actually I described my problem in more detail in my other thread "weird problem of FileNotFoundException".
I've tried setting mapred.reduce.tasks 240 and 960 and both failed.
Without input caching I use the 240 reducers for all tests.
|Re: [shark-users] Some Performance Results on Shark/Impala/Hive||Henry Robinson||11/9/12 11:24 AM|
Hey all -
Great to see the comparison!
Full disclosure - I work on Impala so I'd like to understand those green bars in particular. Without wanting to hijack the thread, Shane would you mind posting the actual queries that you ran? In particular I'm concerned about join ordering, so if you can tell us in which order you joined the smaller and the larger table that would be very helpful.
|Re: [shark-users] Some Performance Results on Shark/Impala/Hive||Shane Huang||11/11/12 5:24 PM|
I took the groupby and join queries from HiBench/hivebench and simplified them.
You can get latest HiBench from https://github.com/hibench/HiBench-2.1
For Join ordering, it's the small table that comes first in join.
|Re: [shark-users] Some Performance Results on Shark/Impala/Hive||Shane Huang||11/11/12 5:29 PM|
Sorry a mistake about HiBench. The lastest available version can be downloaded at https://github.com/intel-hadoop/HiBench. It's version 2.2
|Re: Some Performance Results on Shark/Impala/Hive||Hobin Yoon||5/17/13 11:55 AM|
It is very interesting! Do you have any update with latest versions?
|Re: Some Performance Results on Shark/Impala/Hive||miket...@googlemail.com||7/15/13 8:27 AM|
Thanks for sharing!
Where can I find more details about the data set and queries?
I would like to run the same queries on ParStream and see the difference our indexing approach makes.