Extracting only the years out of test string

405 views
Skip to first unread message

Cameron Stewart

unread,
May 1, 2017, 10:59:14 PM5/1/17
to OpenRefine
Hi I have a whole heap of junky strings

02budgetsfinanceaccountspayableutilityutilityuploadfinancialyear20122013nov12aglgasaglbill00400006459120121031zip 
02budgetsfinanceaccountspayableutilityutilityuploadfinancialyear20122013aug12addinvoicenumberfor777accounttxt 
19subcontractscontractsapprovalsreactiveworks2013fullfieldcleaningfullfieldcleaningformserverfilesblankgif 
11correspondencetenantfootwaysfootwayapplicationformsdcp2012sydneydcp2012mapscityofsydneyfilesjot

Hoping to extract years out of them...
eg 2012|2013
eg 2012|2013
eg 2012
eg 2012


GREL Using gets me first match.

value.match(/.*?([2][0][01][0-9]).*?/)[0]


Was hoping to use this idea in split as per previous to get all as per this thread https://groups.google.com/forum/#!topic/openrefine/J70OBqs9qSA 

filter(value.split(/\s/),v,isNonBlank(v.match(/[A-Z]{2,}/))).join("|")


Having trouble with the split part as no spaces, could be digit or number.  Any ideas?



Have consulted also



Thad Guidry

unread,
May 1, 2017, 11:05:59 PM5/1/17
to OpenRefine
Years are 4 digits.   \d\d\d\d

They could start with a 1 or 19 or 2 or 20 currently in our millennial.    1\d\d\d    2\d\d\d  19\d\d   20\d\d

experiment :) and you will remember and learn..


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thad Guidry

unread,
May 1, 2017, 11:11:23 PM5/1/17
to OpenRefine
don't try to do things in 1 big fat GREL statement.

Create the array with your match statement and store it as a string with GREL.

value.match(blahblah).toString()

Then use the Edit column - split multi value cell to split by comma.

Ettore Rizza

unread,
May 2, 2017, 1:14:02 AM5/2/17
to OpenRefine
value.match() is a tricky function. To extract all the occurrences of a pattern in a string, I personally prefer to use Jython/Python.

For example, here is how to extract all 4-digit series.

import re

myregex
= re.compile(r"\d{4}")

return "|".join(myregex.findall(value))



And here is an excellent website to test regex in Python : https://regex101.com/

Cameron Stewart

unread,
May 2, 2017, 1:48:20 AM5/2/17
to OpenRefine
Thanks for tips - was trying to do it all in one step which was hard....jython method is good to know too!

Owen Stephens

unread,
May 3, 2017, 3:56:00 AM5/3/17
to OpenRefine
As Ettore says matching for all occurrences in a string is hard with GREL - unfortunately 'match' simply doesn't do this. 
Equally unfortunately 'split' doesn't preserve the characters you actually split on - so although you could split the strings where there was a series of 4 digits starting with 20 or 19, the result would be the rest of the string (which you don't want)

I think you can either:

Use repeated 'match' operations to extract potential years one at a time
Use the Jython approach described by Ettore

There are some other approaches I can think of, but they are all more complicated!

Owen

Thad Guidry

unread,
May 3, 2017, 7:38:45 AM5/3/17
to openr...@googlegroups.com
Reminder to all...

There are reasons why we have additional functions inside GREL that compliment each other...for even more flexibility...

You can preserve separator chars using partition() instead of just using split() or smartSplit()

-Thad

Owen Stephens

unread,
May 3, 2017, 9:39:37 AM5/3/17
to OpenRefine
Thanks Thad - I'd completely forgotten the partition function. In this case I think this won't work directly because partition only finds the first occurrence, but it could be used as part of the solution.

Owen
Reply all
Reply to author
Forward
0 new messages