Input to rmr2 - Questions

144 views
Skip to first unread message

Panagiotis Tzirakis

unread,
Mar 2, 2014, 1:29:21 AM3/2/14
to rha...@googlegroups.com
Hello,

I would like to ask a few questions that are related on the input of rmr2. My questions are:

1. in the tutorial ( https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md ) it is mentioned that "It is not possible to write out big data with to.dfs, not in a scalable way.
". What does this mean ? If I want to write for example 4 TB of data then it is not efficient?
2. The default input when to.dfs is used is with null key and the value contains the data. This way only one map function is called. If I have large amount of data (e.g. 4TB) then what will happen ? Only one map function will be called ?
3. If I want to change the default key,value pair and define my own key but the value is the same, how can I do this efficiently ? For example, if I have a dataset and the key defines different columns of the dataset and the value is the whole dataset, how can I do it efficiently ?
4. If I want to change the input format to the mapreduce then the whole dataset would be processed ? E.g. if I use make.input.format.

These are some questions that bothering me for some time now.
If someone knows please answer me.

Thank you in advanced


Antonio Piccolboni

unread,
Mar 2, 2014, 1:55:21 AM3/2/14
to RHadoop Google Group
On Sat, Mar 1, 2014 at 10:29 PM, Panagiotis Tzirakis <tzir...@gmail.com> wrote:
Hello,

I would like to ask a few questions that are related on the input of rmr2. My questions are:

1. in the tutorial ( https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md ) it is mentioned that "It is not possible to write out big data with to.dfs, not in a scalable way.
". What does this mean ? If I want to write for example 4 TB of data then it is not efficient?

It's not even a matter of efficient. The data would need to be in memory first. At current cost for RAM, that would set you back some $65K, assuming that you could mount all those modules into one system. If you have 4 TB of data, your only option is mapreduce (the call). 
 
2. The default input when to.dfs is used is with null key and the value contains the data. This way only one map function is called.

I don't believe there is a relation, but then change the key, be my guest. It's only a default.
 
If I have large amount of data (e.g. 4TB) then what will happen ? Only one map function will be called ?

No, the machine would have crashed long before you can even type to.dfs. You will never be in a position to test this.
 
3. If I want to change the default key,value pair and define my own key but the value is the same, how can I do this efficiently ?

You can't change the default key, but you can provide the first argument, indeed the key, as in 

to.dfs(keyval(key, value))

 
For example, if I have a dataset and the key defines different columns of the dataset and the value is the whole dataset, how can I do it efficiently ?

You can't because you don't have enough RAM. You are completely misunderstanding the role of to.dfs. It ships data from memory to the distributed file system. Memory is limited to 10s of GBs in most systems. 10GB is not big data unless you are asking some marketer with a fervent imagination.

4. If I want to change the input format to the mapreduce then the whole dataset would be processed ?

Depending on the input format.  The input format function could stop reading before the first record. But most reasonable input formats will try and read all the way to the end, unless there is an error of course.
 
E.g. if I use make.input.format.

These are some questions that bothering me for some time now.
If someone knows please answer me.

I hope this helps.

Antonio

 

Thank you in advanced


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages