How get the offset of a big file

21 views
Skip to first unread message

J. Pablo Redondo

unread,
Jan 7, 2014, 3:42:06 AM1/7/14
to rha...@googlegroups.com
Hi, I'm trying to perform a transformation of one big tall and skinny matrix, into a block diagonal matrix, but to do so, I need to keep order of the pieces that mapper takes. For example:

If I have a matrix

A1
A2
A3
A4

Where Ai are the different parts of the matrix that different mapper takes, I want to perform a multiplication with another matrix, but this matrix has to be transformed into a block diagonal, like this:

A1  0   0   0
0   A2  0   0
0    0  A3  0
0    0   0  A4

Since the data within matrix A is non sorted, so the block cant be identified with the information within itself.
Using normal Hadoop I would use the offset of the data that comes within the key in TextInputFormat. But I think that in RHadoop the key in mappers comes as null instead offset. Am I correct? How you will solve this?

Sorry for my English and I hope I explain myself properly.

Thanks.

Antonio Piccolboni

unread,
Jan 7, 2014, 9:34:35 AM1/7/14
to RHadoop Google Group
Hi,
the TextInputFormat is for natural language, so that's not the correct choice. I would guess that a csv format is what you have in this case. Second, and offset in a file doesn't mean much in Hadoop since data sets are in general stored into multiple files, called parts, therefore the offset doesn't provide a total order, that is multiple records could have the same offset in different files. I would recommend adding a column with the row number, but the exact steps to achieve that are unclear from the information your provide, at least to me.

Antonio


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

J. Pablo Redondo

unread,
Jan 7, 2014, 10:49:02 AM1/7/14
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio. 

Thanks for the quick answer. The main idea was to use the offset because keeps information regarding the actual line being processed. I though also to adding a column, but I was hoping that it would be better if there is a solution that doesn't involve changing the information stored by adding new columns. 

What I am trying to obtain is a numerically stable solution to calculate Q within a QR decomposition. Using the separate Q generated on each process, its possible to calculate a general and stable Q. The process involves creation of a block diagonal matrix with all sub Q created during map process, but so far I have been unable to achieve this algorithm in particular.

Thanks anyway.
Reply all
Reply to author
Forward
0 new messages