[GSOC Report] Update on Use zero-copy read path in new Hadoop APIs

40 views

Skip to first unread message

sunyu duan

unread,

Aug 13, 2014, 12:42:23 AM8/13/14

to twitte...@googlegroups.com

Hi everyone,

As most of the work has been done, I wrote this report to summarize what I have done and the result.

My Project is Using Zero-Copy read path in new Hadoop API. The goal is to exploit the Zero-Copy API introduced by Hadoop to improve read performance of parquet tasks running locally. My contribution is to replace byte array based API with ByteBuffer based API in the reading path to avoid byte array copy and keep compatible with old APIs. Here is the complete pull request. https://github.com/apache/incubator-parquet-mr/pull/6

My work includes two parts.

Make the whole read path use ByteBuffer directly.

Introduce an initFromPage interface in ValueRead and implement it in each ValueReader.
Introduce a ByteBufferInputStream.
Introduce a ByteBufferBytesInput.
Replace unpack8values method with a ByteBuffer version.
Use introduced ByteBuffer based method in the read path.

Introduce a Compatible layer to keep compatible with old Hadoop API

Introduce a CompatibilityUtil
Using the CompatiblityUtil to perform read action

After coding, I started to benchmark the improvement. After discussion with my mentor, I modified the TestInputOutputFormat test to inherit ClusterMapReduceTestCase which will start a MiniCluster for unit test. In the unit test, I enabled caching and read shortcircuiting. I created a 500MB and a 1GB log file on my dev box for the test. The test will read in the log file and write to the temporary parquet format file using MapReduce. Then it will read from the temporary parquet format file and write to an output file. I inserted time counter on the latter mapreduce task and used the time spent on the seconde MapReduce Job as an indicator. I ran the unit with and without Zero-Copy API enabled on 500MB and 1GB log file and compared the time spent on each situation. The result shows below.

File Size Average Reading Time(s) Improvement

Without Zero-Copy API 500MB 576s

Zero-Copy API 500MB 394s 46%

Without Zero-Copy API 1024MB 1080s

Zero-Copy API 1024MB 781s 38%

As we can see, there is about 30~50% improvement on reading performance which shows the project has reached its goal. But the benchmark is insufficient. My dev box has very limited resources and 1GB file is the maximum file I can put. After GSOC, it'd be better to invite more people to try it out on real cluster with larger file to read to benchmark its effect on real situation.

Best,

Sunyu

Chris Aniszczyk

unread,

Aug 13, 2014, 12:51:03 PM8/13/14

to twitte...@googlegroups.com

Awesome job, great to see these improvements!

Can you send these details to the new Parquet mailing list too (link to your code too)?

http://mail-archives.apache.org/mod_mbox/incubator-parquet-dev/201408.mbox/browser

You can subscribe by emailing 'dev-su...@parquet.incubator.apache.org'

--
You received this message because you are subscribed to the Google Groups "Twitter GSOC" group.
To unsubscribe from this group and stop receiving emails from it, send an email to twitter-gsoc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Cheers,

Chris Aniszczyk | Open Source | Twitter, Inc.
@cra | +1 512 961 6719

Reply all

Reply to author

Forward

0 new messages