Extracting a Phrase based on Keywords

804 views
Skip to first unread message

Christina Miskey

unread,
Jul 18, 2017, 12:12:08 PM7/18/17
to OpenRefine
Hello everyone, 

I'm new to OpenRefine but attempting to use it for a large data project I am working on. I have a bunch of text that I need to extract only a portion of, and am a bit stuck. I'm hoping for some help!

Here's an example of my text: 
Crittenden, A.N., Metabolism, Anthropometry, and Nutrition Laboratory, Department of Anthropology, University of Nevada Las Vegas, NV, United States; Sorrentino, J., D.M.D. Family and Cosmetic Dentistry, New York, United States; Moonie, S.A., Epidemiology and Biostatistics, School of Community Health Sciences, University of Nevada Las Vegas, NV, United States; Peterson, M., Dorobo Fund, Arusha, Tanzania; Mabulla, A., Department of Archaeology, University of Dar es Salaam, Dar es Salaam, Tanzania; Ungar, P.S., Department of Anthropology, University of Arkansas, Fayetteville, AK, United States


I need to extract the bolded text only into a separate column. The below topic is (I believe) close to what I want. I can't figure out how to adjust the regex to get just the text string I need using a keyword. I don't need the "join" portion at the end.


Some important information: 
The keywords I need to look for are "university", "college", "school" and "institute". 
The university names can vary in length, but are always between two commas. 
The university name can appear in various places in the text, because of this I can't just "Split into several columns" and delete the ones I don't need.

Thanks in advance!!!

Owen Stephens

unread,
Jul 18, 2017, 1:20:28 PM7/18/17
to OpenRefine
If you know that the value you want is always between two commas, then I'd suggest using an approach where you first 'split' the string into an array (a list of things), then check each value in the array to see if it is one of the strings you want to preserve - you can do this with the 'filter' function which takes in an array (list) and outputs only the items in the array that match some criteria you set.

The way I'd do this is something like:

Use 'Edit Column->Add column based on this column'
Call new column "Universities"
filter(value.split(","),v,v.trim().startsWith("University"))

Owen Stephens

unread,
Jul 18, 2017, 1:22:09 PM7/18/17
to OpenRefine

I've tried to do a screen capture here hopefully it will help



On Tuesday, July 18, 2017 at 5:12:08 PM UTC+1, Christina Miskey wrote:

Owen Stephens

unread,
Jul 18, 2017, 1:29:07 PM7/18/17
to OpenRefine
In my original reply I forgot - you need to do a 'join' at the end of the GREL expression because OpenRefine can't store arrays as arrays - you always need to convert it into a string.

So the GREL I use is
filter(value.split(","),v,v.trim().startsWith("University")).join("|")


This gives you a list of Universities extracted from the string, in a list, separated by the pipe "|" character
You could do the same with Colleges, Schools, Institutes, etc.

You could then use 'Edit Cells -> Split multi-valued cells' - with the | character as the separator - this will break the names into separate cells in a column.
I'd recommend then using the '
You can then use OpenRefine faceting and clustering on this column to clean up the data and make sure you have consistent naming of universities etc.

Ettore Rizza

unread,
Jul 18, 2017, 2:05:32 PM7/18/17
to OpenRefine
Example of alternative solution using Python/Jython if you want more flexibility (you said the phrases beginning with "school" or "college", etc, are also interesting):

import re
pattern
= re.compile(r"((university|college|school|institute).+?),", re.I)
list
= []
for i in pattern.findall(value):
     list
.append(i[0])
return ":::".join(list)

Then, you can split the new column using ::: as separator.

Thad Guidry

unread,
Jul 18, 2017, 3:50:51 PM7/18/17
to OpenRefine
Hi Christina,

You may also have a need for doing some faceting work to discover other things as well as duplicates or the variations of Educational Institutions.
I've done similar work against a bag of words such as your usecase before but within the Beef industry :)  Go figure.


In the recipe above I describe how to create a Custom Text facet and then drop in that Jython code and choose Jython as your expression language.

It would be interesting to see what that facet might discover for you.

Enjoy !
-Thad

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christina Miskey

unread,
Jul 19, 2017, 9:10:06 AM7/19/17
to OpenRefine
Thank you all for your help! Both formulas work, though Ettore Rizza's does pull in a bit more of the text I need. Either way, these are both great solutions and will help me move forward with the project.

Thanks again!
Message has been deleted

M

unread,
Dec 28, 2018, 9:25:48 PM12/28/18
to OpenRefine
Hey all,

Great to see the support - as a follow-on to this question.

If I want to extract a list of words that occur from a (large) standard list - how would I achieve this? 

i.e I have Freeform Addresses in 'Address' Column.

I have a list of Countrynames.csv & CityNames.csv

I want to extract the list of Countries and Citys into distinct columns similar to the above (but in reference to a long csv list).

How would you achieve that in GREL or Jython?

Wes Shepherd

unread,
Dec 28, 2018, 11:04:42 PM12/28/18
to openr...@googlegroups.com
Guys,

Outside of building rules to do this, I’ve had success with Stanford NLP or Open NLP which can do places like states and countries and universities among other things.  The thing to do it hit an api with this function and return an array of values since you’ll likely have multiple ones.  Naturally, more specific or exacting extraction is much more difficult to achieve. 

Wes

Sent from my iPhone
--

Ettore Rizza

unread,
Dec 29, 2018, 9:35:01 AM12/29/18
to OpenRefine
@M: This OpenRefine Recipe looks pretty close of what you want to do.

BR,

Ettore

M

unread,
Dec 29, 2018, 10:43:26 AM12/29/18
to OpenRefine
Thanks Ettore and Wes!

I've tried to run the recipe:
- Add column based on this column 
- I've attached the csv for reference (The data is in column 1)

 but it doesn't pull the Countries into the Column. 

Any thoughts on how to resolve it?

Location.png

countriesoftheworld.csv

Ettore Rizza

unread,
Dec 29, 2018, 11:30:58 AM12/29/18
to OpenRefine
Re M,

This is because the Python script above uses a too simplistic way to tokenize sentences (ie, to split a sentence into words). 

This one is probably slightly better (but not perfect) for your use case :



import csv
import re

#remove punctuation from the value string
value = re.sub(ur'[^\w\d\s]+', '', value)

with open(r"C:\Users\Ettore\Desktop\countriesoftheworld.csv",'r') as f:
    reader = csv.reader(f)   
    words_to_match = [col[0].strip().lower() for col in reader]

return ",".join([x for x in value.split(' ') if x.lower().strip() in words_to_match])

Screenshot :

téléchargement.png

M

unread,
Dec 30, 2018, 2:55:12 PM12/30/18
to OpenRefine
Thanks Etorre - that worked great for my needs.

Much appreciated!

Vidal Santos

unread,
Mar 16, 2020, 10:18:30 AM3/16/20
to OpenRefine
Hi guys, 

I need help with something similar.

Got some affiliations to extract and tried with this one, but i'd like to add more variables to be joined in concatenation:

filter(cells["Affiliation_Norm"].value.split(","),v,v.trim().contains("Univer")).join("|")I've just have somithe needs of 

Just as how the python script on this same thread does,  but being able to recover all between the separation marks after the parsing 
(ex: " _ Institute for _, _ Univers _ ," -> _ Institute for  _| _ Univers _ )  

import re
pattern 
= re.compile(r"((university|college|school|institute).+?),", re.I)
list 
= []
for i in pattern.findall(value):
     list
.append(i[0])
return ":::".join(list)

Could it be?

Thanks, 

Thad Guidry

unread,
Mar 16, 2020, 10:44:46 AM3/16/20
to openr...@googlegroups.com
I think perhaps our existing "Word facet" might help?

1. Facet -> Customized facets -> Word facet
2. Click count, to sort by the number of each word
3. Then you can click on each individual word, like University, College, etc. to include or exclude to filter on only those rows that contain them.

image.png


Ettore Rizza

unread,
Mar 16, 2020, 6:04:43 PM3/16/20
to OpenRefine
Got some affiliations to extract and tried with this one, but i'd like to add more variables to be joined in concatenation:

filter(cells["Affiliation_Norm"].value.split(","),v,v.trim().contains("Univer")).join("|")I've just have somithe needs of

 
@Vidal : Something like this?

list_of_terms = ["univer","college","school","institute"]

to_check
= value.split(",")

final_list
= []

for term in list_of_terms:
   
for word in to_check:
       
if term in word.lower():
            final_list
.append(word.strip())

return "|".join(final_list)

Vidal Santos

unread,
Mar 18, 2020, 12:53:43 PM3/18/20
to OpenRefine
Thanks Thad and Ettore,

That script was what I was looking for (and faceting by word will be very helpful after sorting them)

Have a nice day

Vidal Santos

unread,
Apr 30, 2020, 7:07:20 AM4/30/20
to OpenRefine
Hi guys, 

I'm wondering how to see what's left in the original column after the jython script is done to check if there's still some entities left (Acronyms, most probably)
How should I tell the replace function to look for each one of the extracted values and then replace them with blank?

Thanks again

Thad Guidry

unread,
Apr 30, 2020, 10:43:51 AM4/30/20
to openr...@googlegroups.com
You should have created a new column based on that has those extracted values, YES?

If YES, then use our Facets to filter down the values that you wish to inspect in that column of extracted values, you then have a choice of:
1. Editing and replacing values directly in the Facet itself with the little edit link on each value.
2. Use Column menu to perform Custom Transform with value.replace("")



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Vidal Santos

unread,
Apr 30, 2020, 12:13:10 PM4/30/20
to OpenRefine
Thanks Thad, 

What I was asking for is a bit different. Let me explain it better. 
I've got to extract Institution from a set of citation data. In order to check there's no missing affiliations left I just want to erase every extracted keyword to get only the remaining string in the original column.

After the jython script you proposed I got something like this: 

COLUMN A="School of Economics and Business, University of Navarra, Campus Universitario, Pamplona, 31009, Spain"
=>
COLUMN B ="University of Navarra|School of Economics and Business|Campus Universitario"

What I need is to have a column with what's left: "Pamplona, 31009, Spain". 

Is there a way to add all the values of column B to a value.replace( , "") function? 

To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

Vidal Santos

unread,
Apr 30, 2020, 12:24:07 PM4/30/20
to OpenRefine
value.replace(cells["Extract"].value.split("|")[0], "")

Something like this but with all of them at once. I've got no clue on how to call them alltogether 

Thad Guidry

unread,
Apr 30, 2020, 2:02:52 PM4/30/20
to openr...@googlegroups.com
I think you are looking for something like this

     forEach(
       value.split(","),
       i,
       if(
         inArray(cells.B.value.split("|"),i),
         "",
         i
       )
     )
     .join(",")



To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/d501acc6-2a6d-473b-89a4-adf5746dcb72%40googlegroups.com.

Thad Guidry

unread,
Apr 30, 2020, 2:12:56 PM4/30/20
to openr...@googlegroups.com
Hmm, it looks like you might have to Trim the leading and trailing whitespace as part of the initial forEach() .... so, try this additionally:

forEach(

  forEach(
    value.split(","),
    i,
    i.trim()

  ),
  i,
  if(
    inArray(cells.B.value.split("|"),i),
    "",
    i
    )
)
.join(",")

Vidal Santos

unread,
Apr 30, 2020, 2:38:11 PM4/30/20
to OpenRefine
That works like a charm.
Thanks Thad, you've saved me a lot of time.

Best wishes, 

Thad Guidry

unread,
Apr 30, 2020, 3:28:26 PM4/30/20
to openr...@googlegroups.com
Looks like we are missing this nice function on a higher level however.

Both of those functions might be valuable, I would say, to OpenRefine users such as your use case presented to me.
I'll open a new issue for us to add those 2 as new GREL String functions.

Then your GREL would have been much simpler with something like this for example:

value.trim().replaceEach(cells.B.value.split("|"), "")

1 line only compared to the previous "GREL snake" ;-)


Thad Guidry

unread,
Apr 30, 2020, 3:36:33 PM4/30/20
to openr...@googlegroups.com
I have opened Issue #2606 for your use case Vidal.


Michael Beckett

unread,
Apr 30, 2020, 4:59:24 PM4/30/20
to OpenRefine
Side question: is the inArray function part of core OpenRefine, and is it documented? I didn't find it in the GREL documentation on github, though I did see a discussion about implementing it here: https://github.com/OpenRefine/OpenRefine/issues/1853.

Thanks!

-- Michael

Tom Morris

unread,
Apr 30, 2020, 5:08:46 PM4/30/20
to openr...@googlegroups.com
"School of Economics and Business, University of Navarra, Campus Universitario, Pamplona, 31009, Spain".replace(/(University of Navarra|School of Economics and Business|Campus Universitario)/,'') yields ", , , Pamplona, 31009, Spain"

which seems pretty close to what you want, but the thing I think we're missing is the ability to use a cell value as a regex pattern.

Note that if we followed the Apache pattern the third argument to replaceEach is an array, not a string, and for this case would require constructing a variable length array of empty strings to match the length of the search array.

Tom

On Thu, Apr 30, 2020 at 3:28 PM Thad Guidry <thadg...@gmail.com> wrote:
Looks like we are missing this nice function on a higher level however.

Both of those functions might be valuable, I would say, to OpenRefine users such as your use case presented to me.
I'll open a new issue for us to add those 2 as new GREL String functions.

Then your GREL would have been much simpler with something like this for example:

value.trim().replaceEach(cells.B.value.split("|"), "")


Thad Guidry

unread,
Apr 30, 2020, 5:25:09 PM4/30/20
to openr...@googlegroups.com
Yes inArray() and randomNumber() were added last year by Owen (as part of the initial GOKb Utils extension support into OpenRefine core)
You can read further details and issues referenced in the PR itself #1993

Feel free to document them both in our GREL docs !
Our docs will be getting a major overhaul this summer anyways, but don't let that stop you from contributing now to our base wiki docs, WE NEED YOU!


Thad Guidry

unread,
Apr 30, 2020, 5:37:57 PM4/30/20
to openr...@googlegroups.com
On Thu, Apr 30, 2020 at 4:08 PM Tom Morris <tfmo...@gmail.com> wrote:
"School of Economics and Business, University of Navarra, Campus Universitario, Pamplona, 31009, Spain".replace(/(University of Navarra|School of Economics and Business|Campus Universitario)/,'') yields ", , , Pamplona, 31009, Spain"

which seems pretty close to what you want, but the thing I think we're missing is the ability to use a cell value as a regex pattern.

Note that if we followed the Apache pattern the third argument to replaceEach is an array, not a string, and for this case would require constructing a variable length array of empty strings to match the length of the search array.

Tom


Yeah, how it eventually gets implemented (or limited) is probably good to discuss in the new issue I made.
Let me add a note to that issue of my thoughts additionally as to what I was thinking...


Vidal Santos

unread,
May 3, 2020, 2:54:35 AM5/3/20
to OpenRefine
Thad, 
Please consider I was asking to replace them on a "raw" string of text and, as Tom noted, that leaves some noise in the example.
If it can do the work if the source data have a more robust structure (as it should be), it would be very useful to non advanced users as me.

Thanks again, 

Antoine Beaubien

unread,
May 7, 2020, 8:11:54 PM5/7/20
to OpenRefine
You can actually get rid of the spaces with a Regex Split() instead, and one trim before:

forEach(
 
(value.trim()).split(/\s*,\s*/),
  vI
,
 
if(
    inArray
(cells.B.value.split("|"), vI),
   
"",
    vI
 
)
)
.join("|")

No need for the double nested forEach…

Regards,
   Antoine
Reply all
Reply to author
Forward
0 new messages