Bertrand - thank you for your response, I thought the same - and have cleaned input files - essentially escaping every non-letter/non-numeric character.
I am still seeing the error but now it seems that it moved into java.util.Arrays.copyOfRange - which suggests that results collection is simply too large for the heap i am allocating. I will certainly try larger heap (just to see if it works) but i would rather try to fix the issue. Any recommendations?
So more details:
simple outerJoin with a custom filter that removes rows that match good know criteria.
2 csv files LHS and RHS; each with 2,986,923 rows and 25 columns. Note: worked fine on a files with 1,322,267 rows with 25 columns.
Heap: 24G
file size 588Mb (each)
only 1 pipe running
operations:
CoGroup with outerJoin
Each with new Identity
Each with a custom filter
Also, does it means that i have reached some sort of a limitation of what cascading could do in a local mode?
java.util.Arrays.copyOfRange(char[], int, int)
java.lang.String.<init>(char[], int, int)
java.lang.String.substring(int, int)
java.lang.String.subSequence(int, int)
java.util.regex.Pattern.split(CharSequence, int)
cascading.scheme.util.DelimitedParser.createSplit(String, Pattern, int)