Headers in Equijoin

15 views
Skip to first unread message

Raphael R.

unread,
Jul 30, 2014, 8:47:18 AM7/30/14
to rha...@googlegroups.com
Right now it is not possible to parse the headers from CSV files. As already mentioned in another post, a workaround would be to use the col.names parameter from read.table() in make.input.format(). Unfortunately this does not work if you have multiple different input files. Therefore it is not possible to use colnames in the equijoin() function. This is quite annoying, because code becomes very unreadable and unmaintainable.

Example:
joined <- equijoin(left.input=left.csv, right.input=right.csv,
                map.left=function(k,x){keyval(x$V3,x)},
                map.right=function(k,x){keyval(x$V2,x)},
                input.format=my.format, output.format=my.output.format)


Is there any workaround or is there a plan to fix this?


Antonio Piccolboni

unread,
Jul 30, 2014, 1:12:29 PM7/30/14
to RHadoop Google Group
Hi 
there aren't plans to fix this. You could enter an issue but that doesn't guarantee it will be picked up, only that it won't be forgotten. While this is a very reasonable suggestion, and while the input format subsystem is there exactly to encourage modularity and separation of concerns such as format handling and statistics, you have only yourself to blame for the code becoming unreadable and unmaintainable. What you would like to enter is, if I understand correctly

 equijoin(left.input=left.csv, right.input=right.csv,
                map.left=function(k,x){keyval(x$V3,x)},
                map.right=function(k,x){keyval(x$V2,x)},
                left.input.format=my.left.input.format, 
                right.input.fotmat=my.right.input.format,
                output.format=my.output.format)

What instead is possible is

 equijoin(left.input=left.csv, right.input=right.csv,
                map.left=function(k,x){x = structure(x, names = names.left); keyval(x$V3,x)},
                map.right=function(k,x){x = structure(x, names = names.right); keyval(x$V2,x)},
                input.format=my.format, output.format=my.output.format)

hardly the bane of software engineering. If the differential parsing logic between the two sides had been more radically different and more complex, you'd have had to define functions to help you keep complexity in check. That is, even if rmr2 fails to encourage good software engineering practices in this case, you can always follow them. Conversely, I've seen and complained about many examples of people conflating parsing and statistics in the map function, side-stepping or under-using the IO format features even in the case of single input. You can write good code in unaided assembly code, it's just quite a bit harder. OK a lot harder.


Antonio




--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages