How to skip regex mismatches and make the job continue without failing?

10 views
Skip to first unread message

Bots A

unread,
Mar 26, 2015, 8:22:03 AM3/26/15
to cascadi...@googlegroups.com
Hello,

As shown below, I'm parsing my input data with the following regex. But it fails when there are null fields for the 5th group. All I want it to skip them and continue the job. Can someone help?

            // Declare the field names used to parse out of the loan performance file

            Fields loanPerfFields = new Fields( "LoanID", "month","day","year","name","intRate","upBalance");


            // Define the regular expression used to parse the input file

            String loanPerfRegex = "([A-z,0-9].*)\\|(\\d{2})\\/(\\d{2})\\/(\\d{4})\\|([A-z,0-9]+)\\|([0-9\\.]+)\\|([0-9\\.]+).*";


            // Declare the groups from the above regex. Each group will be given a field name from 'loanPerfFields'

            int[] allGroups = {1, 2, 3, 4, 5 ,6 ,7};


            // Create the parser

            RegexParser parser = new RegexParser( loanPerfFields, loanPerfRegex, allGroups );


            // Create the main import pipe element, and with the input argument named "line"

            Pipe processPipe = new Each( "processPipe", new Fields( "line" ), parser, Fields.RESULTS );


            // Creating unique tuples of LoanID + Month combination

           //  Pipe uniquePipe = new Unique( processPipe, new Fields( "LoandID","month") );


Also, as you may have noticed in the last line, I'm trying to see if I can use "Unique" pipe to remove duplicated lines that may exist in the input. I am still having errors getting that to work. Any advice on that would be appreciated too.

Thanks.
A.

Ken Krugler

unread,
Mar 26, 2015, 10:32:38 AM3/26/15
to cascadi...@googlegroups.com


From: Bots A

Sent: March 26, 2015 5:22:03am PDT

To: cascadi...@googlegroups.com

Subject: How to skip regex mismatches and make the job continue without failing?


Hello,

As shown below, I'm parsing my input data with the following regex. But it fails when there are null fields for the 5th group. All I want it to skip them and continue the job. Can someone help?

Since this is a text file, I assume you mean that the 5th group is empty, not null.

This seems like a regex issue, not a Cascading issue.

e.g. ([a-zA-Z0-9,]*) would match an empty field.

And then you could filter out records where that field is empty.

As far as the Unique pipe issue, it looks like the field name has a typo - I see "LoandID", not "LoanID".

Regards,

-- Ken


            // Declare the field names used to parse out of the loan performance file

            Fields loanPerfFields = new Fields( "LoanID", "month","day","year","name","intRate","upBalance");


            // Define the regular expression used to parse the input file

            String loanPerfRegex = "([A-z,0-9].*)\\|(\\d{2})\\/(\\d{2})\\/(\\d{4})\\|([A-z,0-9]+)\\|([0-9\\.]+)\\|([0-9\\.]+).*";


            // Declare the groups from the above regex. Each group will be given a field name from 'loanPerfFields'

            int[] allGroups = {1, 2, 3, 4, 5 ,6 ,7};


            // Create the parser

            RegexParser parser = new RegexParser( loanPerfFields, loanPerfRegex, allGroups );


            // Create the main import pipe element, and with the input argument named "line"

            Pipe processPipe = new Each( "processPipe", new Fields( "line" ), parser, Fields.RESULTS );


            // Creating unique tuples of LoanID + Month combination

           //  Pipe uniquePipe = new Unique( processPipe, new Fields( "LoandID","month") );


Also, as you may have noticed in the last line, I'm trying to see if I can use "Unique" pipe to remove duplicated lines that may exist in the input. I am still having errors getting that to work. Any advice on that would be appreciated too.

Thanks.
A.


--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Reply all
Reply to author
Forward
0 new messages