You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to chenn...@googlegroups.com
Hi,
I'm trying to import bulk data into HBase tables with MapReduce, by directly creating HFiles on Mapper job. As they are found to be efficient than using HTable client.
For that, examples I found online are based on importing data from a CSV/TSV files. For my case I wanted to make a 3rd party REST based API's (XML/JSON) as InputFormat. I'm not getting lead on this. In case, if any had thought about this problem, please let suggest me.
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to chenn...@googlegroups.com
On Tue, Jan 22, 2013 at 12:16 PM, ArunDhaJ <arun...@gmail.com> wrote:
For my case I wanted to make a 3rd party REST based API's (XML/JSON) as InputFormat
Bulk Import is generally for importing the data Exported by the Export utility. That's the reason it supports only TSV/CSV or by default Sequence Files.
If you still want to use MR, You can write a custom InputFormat to read from an External API, but I am just curious why would you want to do such a thing?
Few points to ponder, before you go ahead with an implementation
If it is a 1 time process, why not serially get the data once and import them into the HBase table?
Does the API provide an ability to split (say offset based) and receive the data?
Does your API service provider allow multiple concurrent connections (equal to number of Mappers in your Job)?
If you are making actual API Request you can use context.write to write the data to HBase Table directly and you don't need to write them as HFiles and do another import again.
If you are really interested have a look at the implementation of DBInputFormat, it uses LIMIT and OFFSET parameter in SELECT query to select the subset of data to process. You might need to use the same approach, instead of making a SQL query, you will be making the request to your API provider.
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to chenn...@googlegroups.com, ashwan...@googlemail.com
Hi Ashwanth,
Thanks for your elaborate response. You have given me a good point to rethink about the approach. I'll work based on your suggestions, and will get back.