How to skip regex mismatches and make the job continue without failing?

10 views

Skip to first unread message

Bots A

unread,

Mar 26, 2015, 8:22:03 AM3/26/15

to cascadi...@googlegroups.com

Hello,

As shown below, I'm parsing my input data with the following regex. But it fails when there are null fields for the 5th group. All I want it to skip them and continue the job. Can someone help?

// Declare the field names used to parse out of the loan performance file

Fields loanPerfFields = new Fields( "LoanID", "month","day","year","name","intRate","upBalance");

// Define the regular expression used to parse the input file

String loanPerfRegex = "([A-z,0-9].*)\\|(\\d{2})\\/(\\d{2})\\/(\\d{4})\\|([A-z,0-9]+)\\|([0-9\\.]+)\\|([0-9\\.]+).*";

// Declare the groups from the above regex. Each group will be given a field name from 'loanPerfFields'

int[] allGroups = {1, 2, 3, 4, 5 ,6 ,7};

// Create the parser

RegexParser parser = new RegexParser( loanPerfFields, loanPerfRegex, allGroups );

// Create the main import pipe element, and with the input argument named "line"

Pipe processPipe = new Each( "processPipe", new Fields( "line" ), parser, Fields.RESULTS );

// Creating unique tuples of LoanID + Month combination

// Pipe uniquePipe = new Unique( processPipe, new Fields( "LoandID","month") );

Also, as you may have noticed in the last line, I'm trying to see if I can use "Unique" pipe to remove duplicated lines that may exist in the input. I am still having errors getting that to work. Any advice on that would be appreciated too.

Thanks.

Ken Krugler

unread,

Mar 26, 2015, 10:32:38 AM3/26/15

to cascadi...@googlegroups.com

From: Bots A

Sent: March 26, 2015 5:22:03am PDT

To: cascadi...@googlegroups.com

Subject: How to skip regex mismatches and make the job continue without failing?

Hello,

As shown below, I'm parsing my input data with the following regex. But it fails when there are null fields for the 5th group. All I want it to skip them and continue the job. Can someone help?

Since this is a text file, I assume you mean that the 5th group is empty, not null.

This seems like a regex issue, not a Cascading issue.

e.g. ([a-zA-Z0-9,]*) would match an empty field.

And then you could filter out records where that field is empty.

As far as the Unique pipe issue, it looks like the field name has a typo - I see "LoandID", not "LoanID".

Regards,

-- Ken

// Declare the field names used to parse out of the loan performance file
Fields loanPerfFields = new Fields( "LoanID", "month","day","year","name","intRate","upBalance");

// Define the regular expression used to parse the input file
String loanPerfRegex = "([A-z,0-9].*)\\|(\\d{2})\\/(\\d{2})\\/(\\d{4})\\|([A-z,0-9]+)\\|([0-9\\.]+)\\|([0-9\\.]+).*";

// Declare the groups from the above regex. Each group will be given a field name from 'loanPerfFields'
int[] allGroups = {1, 2, 3, 4, 5 ,6 ,7};

// Create the parser
RegexParser parser = new RegexParser( loanPerfFields, loanPerfRegex, allGroups );

// Create the main import pipe element, and with the input argument named "line"
Pipe processPipe = new Each( "processPipe", new Fields( "line" ), parser, Fields.RESULTS );

// Creating unique tuples of LoanID + Month combination
// Pipe uniquePipe = new Unique( processPipe, new Fields( "LoandID","month") );

Also, as you may have noticed in the last line, I'm trying to see if I can use "Unique" pipe to remove duplicated lines that may exist in the input. I am still having errors getting that to work. Any advice on that would be appreciated too.

Thanks.
A.

--------------------------

Ken Krugler

+1 530-210-6378

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Cassandra & Solr

Reply all

Reply to author

Forward

0 new messages