Performance & Programming Comparison of JAQL, Hive, Pig and Java

452 views
Skip to first unread message

Rob Stewart

unread,
Mar 23, 2010, 10:11:55 AM3/23/10
to jaql-...@googlegroups.com
Hi folks,

As promised, today I have made available my findings and experiment results from my research project, examining the high level languages: Pig, Hive and JAQL.

The project extends from existing studies, by evaluating the scale up, scale out, and runtime for 3 benchmarking applications. It also examines the ease of programming, and the computational power of each language.

I've created two documents:
- Publication - A slide-by-slide presentation. 16 slides - *Suitable for most readers*
- dissertation results chapter (18 pages of text)

You can find these documents at: http://www.macs.hw.ac.uk/~rs46/publications.html

Excuse the .HTML link - It is useful for me to record the number of hits the publication receives.

I welcome any feedback, either on this mailing list, or to my University email address for direct correspondence. Any questions regarding the benchmarks should be sent to my University email address.


Thanks for taking an interest,


Rob Stewart

vuk.ercegovac

unread,
Mar 23, 2010, 7:27:26 PM3/23/10
to Jaql Users
Thanks for putting this together.

Couple of things have changed, possibly since you've written the draft
(all projects are a moving target):

1. options for number of reducers (along with any option you want to
put into the conf) have been exposed to the language now.

2. we're surprised by the join results-- quite poor for both uniform
and skewed data. do you have the data generators and queries available
for us to have a look?

Also, it may be useful to separate extensibility, support for
embedding, and language expressibility. For example, all languages
surveyed have ways to be extended via UDF's/UDA's. Jaql has some
extras to define (higher-order)functions (in Jaql itself) and modules.
Yes, we have recursion as a result, but what we're after primarily is
reuse and modularity for scripts so that we can use the right level of
abstraction to help us manage complex tasks.

Thanks!

Vuk

Rob Stewart

unread,
Mar 23, 2010, 7:58:51 PM3/23/10
to Jaql Users
Hi Vuk,

Re:Controlling reducers
----------------------------------
I was going on the information posted by yourself on the 19th January
here:
http://groups.google.com/group/jaql-users/browse_thread/thread/7fbc0c3fffe9dbc6/a71006554befd114

I realize that it is a moving target, and I will revise the document
(1.1) saying that this functionality now exists.


Re: Join
--------------------
Sure, no problem. I've been using the DataGenerator package put
together by the devs over at Pig: http://wiki.apache.org/pig/DataGeneratorHadoop

This creates two files, of one column format. This column is used to
join the datasets together. Here is the JAQL script:

$dir1 = read(del("Inputs/join/file1.dat", { fields: ["name"] } ));
$dir2 = read(del("Inputs/join/file2.dat", { fields: ["name"] } ));
join $dir1, $dir2 where $dir1.name == $dir2.name
into {$dir1.name}
-> write(hdfs('Outputs/join/join_output.jaql'));

Kevin Beyer

unread,
Apr 5, 2010, 11:28:52 PM4/5/10
to Jaql Users
Hi, Rob --

What options did you use to generate the data? Can you send me your
exact command line arguments you used for the generator? Can you also
please share the pig scripts? Your results differ from similar
experiments that we ran, so we are trying to understand the
differences.

Thanks a bunch.

-Kevin

Rob Stewart

unread,
Apr 30, 2010, 7:16:30 PM4/30/10
to Jaql Users
Hi Kevin, and others.

You can find the complete code in the link below. As you may have
realized, the tool I used (developed by the Pig developers) does not
easily and neatly let you generate two files to "join", i.e. two
inputs, with some common values in both. So I created a "made to fit"
generating script.
Usage:
1. download all 5 files
2. Generate the test data: run makeTestData and upload all files to
the HDFS
3. Once complete, you'll want to benchmark the join applications
3.a) Run javaJoin
3.b) Run jaqlJoin
4. 3 should provide a load of readable files with the runtime for each
operation.

Hopefully with a bit of intuition what I was trying to do may make
sense. Give the scripts a try.

There is a fair chance that these scripts will not run on your first
try, because I've tidied them up somewhat, and my runtime environment
is almost certainly different to yours i.e. naming convention of input
files in the HDFS directory structure.

Let me know if you encounter any problems, or have any further
questions, I will do my very best to help you out.

http://www.macs.hw.ac.uk/~rs46/files/publications/MapReduce-Languages/source_code/

NOTE: You will *definitely* need to edit the classpaths in these
files, e.g. for the pig jar, zipfjar jar etc etc... These files will
not execute otherwise.

Rob Stewart

--
You received this message because you are subscribed to the Google Groups "Jaql Users" group.
To post to this group, send email to jaql-...@googlegroups.com.
To unsubscribe from this group, send email to jaql-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/jaql-users?hl=en.

Reply all
Reply to author
Forward
0 new messages