ArrayIndexOutOfBoundsException on Tuple.get

120 views
Skip to first unread message

VegHead

unread,
Dec 29, 2008, 6:47:03 PM12/29/08
to cascading-user
So I'm running into a an ArrayIndexOutOfBoundsException when using an
Identity operation with a Field selection like [0, 1, -1]. Of course,
it doesn't happen all the time and I'm not certain what data is
causing the problem.

Caused by: cascading.pipe.OperatorException: [mycascade]
[com.foo.Bar.run(Bar.java:292)] operator Each failed executing
operation
at cascading.pipe.Each$EachHandler.operate(Each.java:413)
at cascading.flow.stack.EachMapperStackElement.operateEach
(EachMapperStackElement.java:86)
... 21 more
Caused by: cascading.tuple.TupleException: unable to select from:
[UNKNOWN], using selector: [0, 1, -1]
at cascading.tuple.TupleEntry.selectTuple(TupleEntry.java:372)
at cascading.flow.Scope.getArgumentsEntry(Scope.java:281)
at cascading.pipe.Each$EachHandler.operate(Each.java:400)
... 22 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.get(ArrayList.java:324)
at cascading.tuple.Tuple.get(Tuple.java:352)
at cascading.tuple.Tuple.get(Tuple.java:370)
at cascading.tuple.TupleEntry.selectTuple(TupleEntry.java:368)
... 24 more

I've got some crummy data in a few spots, so I pull the data in
through a RegexSplitter without specifying the Fields (e.g.
Fields.UNKNOWN), pass it through a custom filter that skips any Tuples
that don't have the right number of fields and then through an
Identity operation to extract the fields I actually need (and rename
them in the process):

Pipe pipe = new Pipe("foo");
pipe = new Each(pipe, new Fields("line"), new RegexSplitter("\t"));
pipe = new Each(pipe, new InvalidSplitFilter(10));
pipe = new Each(pipe, new Fields(0, 1, -1), new Identity(new Fields
("foo", "bar", "text")));

So, I'm confused on two counts. First, how can I possibly have a Tuple
with the wrong number of fields? Second, where is it coming up with an
arraylist index of -1? I looked through the TupleEntry, Tuple and
Fields code ... the only thing I can come up with is that it got
cached erroneously, although I don't know how that could happen.

Is this a by-product of using RegexSplitter and Fields.UNKNOWN? Is
there a better approach to handling crummy data?

Any advice would be greatly appreciated!

-Sean

Chris K Wensel

unread,
Dec 29, 2008, 6:53:10 PM12/29/08
to cascadi...@googlegroups.com
on first blush, sounds like we have a bug. let me get back to you on
this.

cheers
ckw
--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/

VegHead

unread,
Dec 29, 2008, 7:05:05 PM12/29/08
to cascading-user
On Dec 29, 5:53 pm, Chris K Wensel <ch...@wensel.net> wrote:
> on first blush, sounds like we have a bug. let me get back to you on  
> this.

Heh. :) Anything's possible, I suppose... :)

But, I really wouldn't be surprised in the slightest if it's due to
bad data. Since I'm working with small files (less than 1MB each),
each Map job basically corresponds to a single file - so I can say
confidently that it's currently happening on approximately 118 files
out of 909, but I can't of course tell ~which~ 118 files. :(

I did try another variation of the selector:

pipe = new Each(pipe, new Fields(0, 1, Fields.LAST), new Identity(new
Fields("foo", "bar", "text")));

That gives me a variation on the exception that makes even less sense:

Caused by: cascading.tuple.TupleException: unable to select from:
[UNKNOWN], using selector: [0:1, [-1]]
at cascading.tuple.TupleEntry.selectTuple(TupleEntry.java:372)
at cascading.flow.Scope.getArgumentsEntry(Scope.java:281)
at cascading.pipe.Each$EachHandler.operate(Each.java:400)
... 22 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at cascading.tuple.Tuple.get(Tuple.java:352)
at cascading.tuple.Tuple.get(Tuple.java:370)
at cascading.tuple.TupleEntry.selectTuple(TupleEntry.java:368)
... 24 more

Which leads me to another interesting question: is it possible to tell
Cascading/Hadoop to ignore errors? For example, as much as I'd like to
process all 909 files... I honestly don't care if I use all of them.
I'd be perfectly happening processing only 700 of them. :) So, is
there a way to continue even if a few map tasks fail?

-Sean

Chris K Wensel

unread,
Dec 29, 2008, 7:21:40 PM12/29/08
to cascadi...@googlegroups.com
Let me fix the -1 out of bounds exception (which I can reproduce).
then I'll try to wrap my head around the remaining issues in your email.

should have a fix by tomorrow early.

ckw

VegHead

unread,
Dec 29, 2008, 7:27:14 PM12/29/08
to cascading-user
On Dec 29, 6:21 pm, Chris K Wensel <ch...@wensel.net> wrote:
> Let me fix the -1 out of bounds exception (which I can reproduce).  
> then I'll try to wrap my head around the remaining issues in your email.

Sweet. What was it? (if you don't mind me asking)

As far as my other issues (and I have a lot, I know *grin*), maybe
it's due to a misunderstanding on my part. Just to revisit, I'm using
a RegexSplitter to split apart my data:

pipe = new Each(pipe, new Fields("line"), new RegexSplitter("\t"));

Since I'm not explicitly specifying the names (and number) of the
Fields returned by the RegexSplitter, I'm assuming it will treated as
Fields.UNKNOWN. I pass that into a filter I wrote that is ~supposed~
to filter out tuples that don't have enough fields. Does the following
snippet look it's on the right track?

public boolean isRemove(FlowProcess flowProcess, FilterCall<Text>
filterCall)
{
// get the arguments TupleEntry
TupleEntry arguments = filterCall.getArguments();

if (arguments.size() != this.expectedFieldCount)
return false;

return true;
}

More to the point, does/should TupleEntry.size() work as expected and
return the number of actual fields stored in the Tuple, if the Fields
associated with the TupleEntry is set to Fields.UNKNOWN?

-Sean

Chris K Wensel

unread,
Dec 30, 2008, 11:40:51 AM12/30/08
to cascadi...@googlegroups.com
> On Dec 29, 6:21 pm, Chris K Wensel <ch...@wensel.net> wrote:
>> Let me fix the -1 out of bounds exception (which I can reproduce).
>> then I'll try to wrap my head around the remaining issues in your
>> email.
>
> Sweet. What was it? (if you don't mind me asking)
>

in theory, you can't select field names from Fields.UNKNOWN, since
there aren't any names, and this is somewhat true for positional
selectors.

so strictly speaking, the bug was not returning a valid exception that
give you more information. but since in this case we know how big the
tuple is during runtime, we can use the positional selectors.

so I updated the code to pass the actual tuple size down the stack
where appropriate. so you can now pick the last element from a tuple
of varying width. but note this position data isn't cached, so it must
be calculated on every operation call. not a big deal, but in the
large, it adds up. so do try to declare all fields whenever possible.

> As far as my other issues (and I have a lot, I know *grin*), maybe
> it's due to a misunderstanding on my part. Just to revisit, I'm using
> a RegexSplitter to split apart my data:
>
> pipe = new Each(pipe, new Fields("line"), new RegexSplitter("\t"));
>
> Since I'm not explicitly specifying the names (and number) of the
> Fields returned by the RegexSplitter, I'm assuming it will treated as
> Fields.UNKNOWN. I pass that into a filter I wrote that is ~supposed~
> to filter out tuples that don't have enough fields. Does the following
> snippet look it's on the right track?
>
> public boolean isRemove(FlowProcess flowProcess, FilterCall<Text>
> filterCall)
> {
> // get the arguments TupleEntry
> TupleEntry arguments = filterCall.getArguments();
>
> if (arguments.size() != this.expectedFieldCount)
> return false;
>
> return true;
> }
>
> More to the point, does/should TupleEntry.size() work as expected and
> return the number of actual fields stored in the Tuple, if the Fields
> associated with the TupleEntry is set to Fields.UNKNOWN?


yes, it should work fine. I'll write a test to confirm its working in
this particular case.

all my fixes on in the git repo in the 'working' branch. i'll move
them to master and svn sometime today.

Chris K Wensel

unread,
Dec 30, 2008, 2:34:34 PM12/30/08
to cascadi...@googlegroups.com
I've pushed this to svn and tested it. let me know how it works for you.

ckw

VegHead

unread,
Dec 30, 2008, 3:50:50 PM12/30/08
to cascading-user
On Dec 30, 1:34 pm, Chris K Wensel <ch...@wensel.net> wrote:
> I've pushed this to svn and tested it. let me know how it works for you.

It looks like those were committed to trunk in r588 and r587?

------------------------------------------------------------------------
r588 | cwensel | 2008-12-30 11:22:52 -0600 (Tue, 30 Dec 2008) | 1 line

added assertion to test to validate tuple size
------------------------------------------------------------------------
r587 | cwensel | 2008-12-30 11:22:41 -0600 (Tue, 30 Dec 2008) | 1 line

Fixed bug where positional selectors failed against Fields.UNKNOWN.
------------------------------------------------------------------------

I'll build it and give it a shot.

Thanks!

-Sean
Reply all
Reply to author
Forward
0 new messages