regex "g" flag modifier

3,063 views
Skip to first unread message

Andrea Borruso

unread,
Sep 30, 2014, 9:08:37 AM9/30/14
to openr...@googlegroups.com
Hi all,
is it possible to use "g" flag modifier in openrefine regex?

I want to use match and continue until no more matches can be found.

Thank you

Andrea Borruso

unread,
Sep 30, 2014, 9:15:21 AM9/30/14
to openr...@googlegroups.com
Something like http://regex101.com/r/uO5qO9/1

Thank you

Owen Stephens

unread,
Sep 30, 2014, 10:06:36 AM9/30/14
to openr...@googlegroups.com
I think the quick answer is 'no' - not within GREL.
Jython would give you more options - see http://schoolofdata.org/2013/06/04/analysing-uk-lobbying-data-using-openrefine/ which includes information on using Jython to parse text

Thad Guidry

unread,
Sep 30, 2014, 12:12:06 PM9/30/14
to openrefine
OpenRefine uses Java Regex and does use the Matcher class in the GREL
match() function.

We do this by:
import java.util.regex.Matcher;

as you can see in the sourcecode here:
https://github.com/OpenRefine/OpenRefine/blob/8517111fc4e2c940ff3c935d68d6acce53dd5c84/main/src/com/google/refine/expr/functions/strings/Match.java

There are notes for Perl differences with Java Regex here:
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Of note there is this:

Perl uses the g flag to request a match that resumes where the last
match left off. This functionality is provided implicitly by the
Matcher class: Repeated invocations of the find method will resume
where the last match left off, unless the matcher is reset.

However we currently do not have support for find
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#find()
in the GREL match() and partition() functions. We could add it
however and give a user something like this:

value.match(/(\d+)/, Repeated)

or

value.match(/(\d+)/, g)

Remember that match() does output an Array currently.

So... can you provide your use case pattern(s) and the kind of Regex
that you are looking for matching ?
Is the issue close to this use case and does the answer help somewhat
https://github.com/OpenRefine/OpenRefine/issues/647 ?

Once we decide on the Enhancement functionality that your looking
for...you could then sponsor that Enhancement issue and create a
Bounty for it at
https://www.bountysource.com/trackers/32795-openrefine

HOWEVER:
If you just want to get some work done FAST --
Then my suggestion is just to use Jython as your expression language
in OpenRefine for this...and perhaps using re.search() instead:
http://www.jython.org/docs/library/re.html#search-vs-match
But also make sure to read though that whole Jython reference document
to see if you can make it work for you.

The basic wiring for Regex functions with Jython as your expression
language will look something like this:

import re
g = re.search(ur"\u2014 (.*),\s*BWV", value)
return g.group(1)
> --
> You received this message because you are subscribed to the Google Groups
> "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to openrefine+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
-Thad
+ThadGuidry
Thad on LinkedIn

andy

unread,
Sep 30, 2014, 5:15:38 PM9/30/14
to openr...@googlegroups.com
Hi,
and thank you.

On Tue, Sep 30, 2014 at 6:12 PM, Thad Guidry <thadg...@gmail.com> wrote:
Remember that match() does output an Array currently.

I know that it's an array but I have always length = 1, than I think there is an error of mine.
 
So... can you provide your use case pattern(s) and the kind of Regex
that you are looking for matching ?

I'm attacching a sample project. I apply "length(value.match(/.*(<a href=".*?" id=".*?" class="dldlnk".*?>).*?/))" to the fist column and I obtain "1".

But as you can see here (the same source code) I should obtain "2". 

Best regards,

Andrea




--
Andrea Borruso
website: http://blog.spaziogis.it
GEO+ geomatica in Italia http://bit.ly/GEOplus 
38° 7' 48" N, 13° 21' 9" E, EPSG:4326
--

"cercare e saper riconoscere chi e cosa,
 in mezzo all’inferno, non è inferno, 
e farlo durare, e dargli spazio"

Italo Calvino
gflag.google-refine.tar.gz

Owen Stephens

unread,
Sep 30, 2014, 5:40:56 PM9/30/14
to openr...@googlegroups.com
There are different ways of doing this - one is to split the string into an array, and then use the match statement against each item in the array - something like:

forEach(value.split("<a "),v,v.match(/.*(href=".*?" id=".*?" class="dldlnk").*?>.*?/)[0])

If you apply this to the string
<div> <p>bla bla bla</p><a href="opcgali.php" style="text-decoration:underline;" target="_blank"> Nxcvali</a> </div> </div> <div class="clear"></div> <div style="padding:15px;float:left;width:200px;font-size:14px;">Scarica xcvformati:</div><div style="padding:5px;float:left;width:50px;font-size:15px;"> <a href="http://www.xcv.it/js/server/uploads/xc/zxczx.xls" id="276" class="dldlnk" title="xcv" target="_blank"><img src="img/excel_big.png" width="32" height="32" alt="xls" /></a> </div> <div style="padding:5px;float:left;width:50px;font-size:15px;"> <a href="http://www.zxczx.it/js/server/uploads/zxczx/xzczxc.xls" id="295" class="dldlnk" title="dsfsd" target="_blank"> <p>bla bla bla</p></div>

You get the output:

[ null, null, "href=\"http://www.xcv.it/js/server/uploads/xc/zxczx.xls\" id=\"276\" class=\"dldlnk\"", "href=\"http://www.zxczx.it/js/server/uploads/zxczx/xzczxc.xls\" id=\"295\" class=\"dldlnk\"" ]

Hope this helps 

andy

unread,
Sep 30, 2014, 5:47:18 PM9/30/14
to openr...@googlegroups.com
Hi,

On Tue, Sep 30, 2014 at 11:40 PM, Owen Stephens <ow...@ostephens.com> wrote:
Hope this helps 

I'm sure that this will help me.

I have a question for you. In the "match" documentation I read (match(string s, regexp p)):
Attempts to match the string s in its entirety against the regex pattern p and returns an array of capture groups. 

Then I should obtain an array with two items with my regex. Am I wrong? And why?

Thank you and be patient :) 

Thad Guidry

unread,
Sep 30, 2014, 6:10:41 PM9/30/14
to openrefine
Andy,
Since your use case has some HTML, I thought you might want to know
about this, just in case your are not aware of the built-in HTML GREL
functions...

https://github.com/OpenRefine/OpenRefine/wiki/StrippingHTML#extract-html-attributes-text-links-with-integrated-grel-commands

Thad Guidry

unread,
Sep 30, 2014, 6:14:25 PM9/30/14
to openrefine
>
> I have a question for you. In the "match" documentation I read (match(string
> s, regexp p)):
>>
>> Attempts to match the string s in its entirety against the regex pattern p
>> and returns an array of capture groups.
>
>
> Then I should obtain an array with two items with my regex. Am I wrong? And
> why?

Yes, you will get an indexed array...correct.
I recently added some improved documentation for the match() function
here: https://github.com/OpenRefine/OpenRefine/wiki/GREL-String-Functions#matchstring-s-regexp-p

andy

unread,
Sep 30, 2014, 6:14:55 PM9/30/14
to openr...@googlegroups.com
Hi Thad,
once again thank you!

On Wed, Oct 1, 2014 at 12:10 AM, Thad Guidry <thadg...@gmail.com> wrote:
Since your use case has some HTML, I thought you might want to know
about this, just in case your are not aware of the built-in HTML GREL
functions...

Ok, I do not must use match with html text :)

Best regards 

andy

unread,
Sep 30, 2014, 6:18:48 PM9/30/14
to openr...@googlegroups.com

On Wed, Oct 1, 2014 at 12:14 AM, Thad Guidry <thadg...@gmail.com> wrote:
Yes, you will get an indexed array...correct.

And than why the length of mine is "1"?

Owen Stephens

unread,
Sep 30, 2014, 6:32:58 PM9/30/14
to openr...@googlegroups.com
Because the global flag doesn't apply, your current regular expression is only finding one matching group (the first one).
You could amend the expression to find more than one group. I haven't checked the following, but something like:

value.match(/.*(<a href=".*?" id=".*?" class="dldlnk".*?>).*(<a href=".*?" id=".*?" class="dldlnk".*?>).*?/)

would capture two groups. If you know that 'value' will only ever contain two relevant groupings then this would do the job. However, you can't do this when you don't know how many times the relevant grouping will be repeated. I don't think 'match' is the right tool for capturing an arbitrary number of repeated groupings.

Thad's suggestion of using parseHtml is an excellent one - you may be able to do what you need with something simple like:

value.parseHtml().select("a.dldlnk")

Thad Guidry

unread,
Sep 30, 2014, 7:11:07 PM9/30/14
to openrefine
Andy,

You are not iterating over the instances of each <a> link for example.

Try looking at how this GREL expression works and copy and paste it
and play with it...

-----BEGIN GREL----

forEach(
value.split("<a"),v,v.match(/.*(class="dldlnk").*/)[0]
)

------END GREL----
> --
> You received this message because you are subscribed to the Google Groups
> "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to openrefine+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



andy

unread,
Oct 2, 2014, 5:38:24 AM10/2/14
to openr...@googlegroups.com
Only to say thank you very much boys
Reply all
Reply to author
Forward
0 new messages