Line numbers in Hadoop

Philippe Laflamme

unread,

May 31, 2012, 4:23:40 PM5/31/12

to cascading-user

Hi,

We're using cascading for validating files submitted by users. We want
to report errors with line numbers to the users. So if they wrote a
string where an int is expected, we'd like to say "Line 45: field X
should be an int".

I understand that hadoop cannot provide this information since it
splits files in arbitrary chunks. My question is how best to restore
these line numbers. I know we can simply add a column to the submitted
files that would contain that line number, but users will actually be
writing their files directly to HDFS, so I was wondering if we could
do this during in our Cascade. We're trying not to modify the
submitted files in any way.

I played around with it and came up with the following solution
(beware, I'm a newb):

* group by the "offset" field, effectively getting one "line" per
group, sorted by their "offset" (right?)
* create a custom Buffer that has a context object containing a long
and make an Every pipe after the GroupBy
* in the operate method of the Buffer, increment the context value and
emit it as a result

Now, I've tested this works on a single node cluster, but I'm not sure
this will work in a real distributed environment. For example, could
there be several of these Buffer instances with each their own Context
(effectively creating several lines with the same number)?

I don't fully understand how things are distributed, so probably
that's what I need to know.

Thanks!
Philippe

Philippe Laflamme

unread,

Jun 5, 2012, 9:32:35 AM6/5/12

to cascading-user

Sorry for bumping this message, I'd just like to get a little help.

I suspect that if I didn't get an answer it's because my strategy
won't work at all. I was just wondering if someone could tell me why
that is or maybe point me to some documentation I can read. I've read
the whole Cascading documentation, but maybe I need to learn more
about Map/Reduce itself to fully understand the issues.

Anyway, any pointers would be welcomed!

Thanks,
Philippe

On May 31, 4:23 pm, Philippe Laflamme <philippe.lafla...@gmail.com>
wrote:

Ken Krugler

unread,

Jun 5, 2012, 10:01:20 AM6/5/12

to cascadi...@googlegroups.com

Hi Philippe,

The simplest (though inefficient) approach I can think of is…

- Have an initial function that generates per-line tuples with three fields - file name, byte offset and text of the line

- Then do a GroupBy(filename field) where you sort by byte offset

- Followed by your parsing code in a Buffer

That should give you everything you need, since you can have a line number in your Buffer that always starts at one and increments in the iterator.

-- Ken

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

--------------------------

Ken Krugler

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Mahout & Solr

Philippe Laflamme

unread,

Jun 5, 2012, 10:50:37 AM6/5/12

to cascadi...@googlegroups.com

Thanks, that's similar to what I did, but instead of grouping on filename and sorting on offset, I grouped on offset. This strategy won't work in a distributed environment, will it?

I just read the GroupBy Javadoc more closely and it says this: "It should be noted for MapReduce systems, distributed group sorting is not 'total'. That is groups are sorted as seen by each Reducer, but they are not sorted across Reducers. See the MapReduce algorithm for details."

So in order to get 'total' grouping, you need to group by another field (one that groups ALL values I want to get sorted, filename in your example) and add a secondary sort. I can see how inefficient that is.