filter(value.split(","),v,v.trim().startsWith("University"))filter(value.split(","),v,v.trim().startsWith("University")).join("|")
This gives you a list of Universities extracted from the string, in a list, separated by the pipe "|" characterYou could do the same with Colleges, Schools, Institutes, etc.
You could then use 'Edit Cells -> Split multi-valued cells' - with the | character as the separator - this will break the names into separate cells in a column.I'd recommend then using the 'You can then use OpenRefine faceting and clustering on this column to clean up the data and make sure you have consistent naming of universities etc.--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--

import csvimport re
#remove punctuation from the value stringvalue = re.sub(ur'[^\w\d\s]+', '', value)
with open(r"C:\Users\Ettore\Desktop\countriesoftheworld.csv",'r') as f: reader = csv.reader(f) words_to_match = [col[0].strip().lower() for col in reader]
return ",".join([x for x in value.split(' ') if x.lower().strip() in words_to_match])

Got some affiliations to extract and tried with this one, but i'd like to add more variables to be joined in concatenation:
filter(cells["Affiliation_Norm"].value.split(","),v,v.trim().contains("Univer")).join("|")I've just have somithe needs of
list_of_terms = ["univer","college","school","institute"]
to_check = value.split(",")
final_list = []
for term in list_of_terms:
for word in to_check:
if term in word.lower():
final_list.append(word.strip())
return "|".join(final_list)
--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/fc336089-bfba-4eda-8bc1-8ccd73dbc769%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/d501acc6-2a6d-473b-89a4-adf5746dcb72%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/d501acc6-2a6d-473b-89a4-adf5746dcb72%40googlegroups.com.
Looks like we are missing this nice function on a higher level however.Noticed in Apache StringUtils, it seems we have available the replaceEach() and replaceEachRepeatedly()
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#replaceEach-java.lang.String-java.lang.String:A-java.lang.String:A-Both of those functions might be valuable, I would say, to OpenRefine users such as your use case presented to me.I'll open a new issue for us to add those 2 as new GREL String functions.Then your GREL would have been much simpler with something like this for example:value.trim().replaceEach(cells.B.value.split("|"), "")
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CAChbWaMh6dhU%2BNLfr_S_GbuTAZZKjksYObfTnQgzxcmFrWJROQ%40mail.gmail.com.
"School of Economics and Business, University of Navarra, Campus Universitario, Pamplona, 31009, Spain".replace(/(University of Navarra|School of Economics and Business|Campus Universitario)/,'') yields ", , , Pamplona, 31009, Spain"which seems pretty close to what you want, but the thing I think we're missing is the ability to use a cell value as a regex pattern.Note that if we followed the Apache pattern the third argument to replaceEach is an array, not a string, and for this case would require constructing a variable length array of empty strings to match the length of the search array.Tom
forEach(
(value.trim()).split(/\s*,\s*/),
vI,
if(
inArray(cells.B.value.split("|"), vI),
"",
vI
)
)
.join("|")