A Design Question - Using Spark with HDFS/HBase with Multiple Files/Keys

123 views

Skip to first unread message

Eason Hu

unread,

Dec 18, 2013, 1:48:54 PM12/18/13

to spark...@googlegroups.com

Hi,

I have a question regarding the design to read multiple files/keys from HDFS/HBase. The problem is like the following:

I have many single-lined JSON format data, and each of them is about 150K. I want to map and parse each line of the record to fetch a new value, then reduce to get a total sum of the value across all records.

There are two approaches that I can think of:

1. I can store them in HDFS, then use Spark to read all files and do a map-reduce. In this case multiple JSON-format lines will be stored into one file (since each line is only about 150K, and HDFS block size is 64M).

2. I can also store them in HBase as key-value pairs, so pair's value will be the single-lined JSON record. And then use Spark to real all records and do a map-reduce.

My fundamental question is that will the above approaches be good, in terms of speed and space? Can Spark handle the above cases well and can distribute the work across all workers?

Please let me know if there are any questions or concerns.

Thank you for your help,

Eason

MLnick

unread,

Dec 20, 2013, 1:57:37 AM12/20/13

to spark...@googlegroups.com

Either of these approaches will work well. Both HDFS and HBase will handle the input splits so spark can distribute the work across slaves. I'd say unless you need the specific features of HBase (like the ability to quickly retrieve these JSON structures by key, for example they are user profiles that need random access and mutations in real time), then use HDFS first and see if that works for you. It's simpler to get started and already packaged up with Spark if you use the ec2 scripts to launch a cluster for example.

Reply all

Reply to author

Forward

0 new messages