Hi,
I have a question regarding the design to read multiple files/keys from HDFS/HBase. The problem is like the following:
I have many single-lined JSON format data, and each of them is about 150K. I want to map and parse each line of the record to fetch a new value, then reduce to get a total sum of the value across all records.
There are two approaches that I can think of:
1. I can store them in HDFS, then use Spark to read all files and do a map-reduce. In this case multiple JSON-format lines will be stored into one file (since each line is only about 150K, and HDFS block size is 64M).
2. I can also store them in HBase as key-value pairs, so pair's value will be the single-lined JSON record. And then use Spark to real all records and do a map-reduce.
My fundamental question is that will the above approaches be good, in terms of speed and space? Can Spark handle the above cases well and can distribute the work across all workers?
Please let me know if there are any questions or concerns.
Thank you for your help,
Eason