How get the offset of a big file

Skip to first unread message

J. Pablo Redondo

Jan 7, 2014, 3:42:06 AM1/7/14
Hi, I'm trying to perform a transformation of one big tall and skinny matrix, into a block diagonal matrix, but to do so, I need to keep order of the pieces that mapper takes. For example:

If I have a matrix


Where Ai are the different parts of the matrix that different mapper takes, I want to perform a multiplication with another matrix, but this matrix has to be transformed into a block diagonal, like this:

A1  0   0   0
0   A2  0   0
0    0  A3  0
0    0   0  A4

Since the data within matrix A is non sorted, so the block cant be identified with the information within itself.
Using normal Hadoop I would use the offset of the data that comes within the key in TextInputFormat. But I think that in RHadoop the key in mappers comes as null instead offset. Am I correct? How you will solve this?

Sorry for my English and I hope I explain myself properly.


Antonio Piccolboni

Jan 7, 2014, 9:34:35 AM1/7/14
to RHadoop Google Group
the TextInputFormat is for natural language, so that's not the correct choice. I would guess that a csv format is what you have in this case. Second, and offset in a file doesn't mean much in Hadoop since data sets are in general stored into multiple files, called parts, therefore the offset doesn't provide a total order, that is multiple records could have the same offset in different files. I would recommend adding a column with the row number, but the exact steps to achieve that are unclear from the information your provide, at least to me.


post: ||
unsubscribe: ||
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
For more options, visit

J. Pablo Redondo

Jan 7, 2014, 10:49:02 AM1/7/14
Hi Antonio. 

Thanks for the quick answer. The main idea was to use the offset because keeps information regarding the actual line being processed. I though also to adding a column, but I was hoping that it would be better if there is a solution that doesn't involve changing the information stored by adding new columns. 

What I am trying to obtain is a numerically stable solution to calculate Q within a QR decomposition. Using the separate Q generated on each process, its possible to calculate a general and stable Q. The process involves creation of a block diagonal matrix with all sub Q created during map process, but so far I have been unable to achieve this algorithm in particular.

Thanks anyway.
Reply all
Reply to author
0 new messages