String TargetFileName = "/user/userid/testpartitiontapoutput.snappy.parquet";
Boolean InputFileHasHeaderRowFlag=true;
Scheme SourceScheme=new TextDelimited(new Fields("line"),true,"\n");
Scheme TargetScheme=new TextDelimited(new Fields("line","loaddate"),false,"|");
Tap Source= new Hfs(SourceScheme,SourceFileName);
Tap Target= new Hfs(SourceScheme,TargetFileName);
Partition SourcePartition=new DelimitedPartition(new Fields("loaddate"),"_");
Source=new PartitionTap((Hfs)Source,SourcePartition);
FlowDef ConversionFlowDef = BuildCSVToParquetConversionFlowDef(Source, Target,"|");
Hadoop2MR1FlowConnector ConversionFlow = new Hadoop2MR1FlowConnector(properties);
ConversionFlow.connect(ConversionFlowDef).complete();
Source File name contains multiple files with name : 20160101_testinput.csv , 20160102_testinput.csv, 20160103_testinput.csv
The end goal is to get filename date into the data and then write to output file name in different format after intermediate processing in snappy parquet format.
The process works fine for single input and single output.
Works fine for TargetPartition when i try to write files using single input file and adding the loaddate field in the data manually.
I am using Cascading 3.0.2. Has anyone come across this error during reads ?