TextDelimited Redux

105 views
Skip to first unread message

JPatrick Davenport

unread,
Aug 13, 2012, 2:04:01 PM8/13/12
to cascadi...@googlegroups.com
Hello,
My question might be inline with TextDelimited, strictness and error trap but I don't want to hijack that thread.

I have a set of files at are CSV. The all contain the same domain data. We we first go to production there will be 93 columns. In the next release there will be 95 columns. The rules of the file is such that all the new columns will go at the end of the row. The older 93 column data is still valid, but doesn't have the extra 2 columns of data.

What I want to do is read in a standard set of fields that are defined statically. The Tap will use a TextDelimited to read the files in. This scheme will be set to strict = false. If the file has 93 columns, the last two should be set to null. If the file has 95 columns the last two columns will be read in.

My unit test method where I'm trying to prove this out is below.
 @Test
    @SuppressWarnings("rawtypes")
    public void load93File() throws IOException {

    final String intest = TestHelper.getPathFromCP("/cascading/93columns.dat");
    final Tap _93File = getPlatform().getTap(Helper.getLocalEnhancedClaimLineScheme(), intest, SinkMode.UPDATE);
    final Tap _93Out = getPlatform().getTap(Helper.getLocalEnhancedClaimLineScheme(), "/tmp/93columnsOut.dat", SinkMode.REPLACE);
   
    class CheckFunction extends BaseOperation<Tuple> implements Function<Tuple> {
        private static final long serialVersionUID = 1L;
       
        @Override
        public void operate(final FlowProcess arg0, final FunctionCall<Tuple> call) {
        final TupleEntry arguments = call.getArguments();
        assertEquals(95, call.getArgumentFields().size());
        //assertEquals("setEnhancementId", arguments.getString(EnhancedClaimLine.ENHANCEMENT_ID));
        for (int i = 0; i < 95; i++) {
            System.out.println(arguments.getObject(i));
        }
        call.getOutputCollector().add(arguments);
        }
       
        @Override
        public Fields getFieldDeclaration() {
            return EnhancedClaimLine.getEnhancedClaimLineFields();
        }
    }

    final Pipe p = new Each("JustChecking", new CheckFunction());
    final FlowDef flowDef = FlowDef.flowDef();
    flowDef.addSource(p, _93File);
    flowDef.addTailSink(new Pipe("out", p), _93Out);
    final FlowConnector flowConnector = getPlatform().getFlowConnector();
    final Flow connect = flowConnector.connect(flowDef);
    connect.complete();
    }

Helper.getLocalEnhancedClaimLineScheme()  is
@SuppressWarnings("rawtypes")
    public static Scheme getLocalEnhancedClaimLineScheme() {
    return new cascading.scheme.local.TextDelimited(EnhancedClaimLine.getEnhancedClaimLineFields(), true, false, ",", false, "", null, true);
    }

Whenever I run this with data, the output is a row with all nulled values like
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

The point of the unit test is to read in the file and just write it out.

What am I missing?

JPatrick Davenport

unread,
Aug 13, 2012, 2:06:30 PM8/13/12
to cascadi...@googlegroups.com
To add a point of clarification, there is no header involved. The data in the file is just data, line one is data. There should be no header written out, the first line should be the data.

JPatrick Davenport

unread,
Aug 13, 2012, 2:10:24 PM8/13/12
to cascadi...@googlegroups.com

One more point of information. I get this in the console before the out is attempted.
WARN  c.scheme.util.DelimitedParser - did not parse correct number of values from input data, expected: 95, got: 1:

JPatrick Davenport

unread,
Aug 13, 2012, 2:17:39 PM8/13/12
to cascadi...@googlegroups.com
Another point for debug information.

If the input file has only one line of data, the output is empty.

If the input file has two lines of data, the output has 1 line of data. That line of data is the second line of data from the input.

JPatrick Davenport

unread,
Aug 13, 2012, 2:33:48 PM8/13/12
to cascadi...@googlegroups.com
Okay, I got it.

If you want to do what I did, watch the constructor on TextDelimited
fields - the fields the file matches.
skipheader = false. You want the first row to be data.
writeHeader = false, if you want the first row to be data.
delimiter = your delimiter.
strict = false if you want the scheme to float.
quote = still not really sure what this is.
types - can be null if you are cool with everything being strings internally. This is what I do since I have 90+ columns that are perfectly fine being strings.
safe - if you're cool with type safe mean things will be null if you can't coerced.

This might be a rehash of the javadoc, but it helped me to write it out like this.

Bertrand Dechoux

unread,
Aug 14, 2012, 3:49:08 AM8/14/12
to cascading-user
I am kinda 'happy' I was not the only struggling with it.

I think you got the good answers.
The 'quote' is explained in the javadoc by default there is no
escaping so your delimiter you should not be in your values.

Bertrand
Reply all
Reply to author
Forward
0 new messages