Filtering by upper case

Gustavo Magalhães

unread,

May 23, 2014, 4:50:43 PM5/23/14

to openr...@googlegroups.com

Hello guys,

I'm working on a particular set of data where the description of a item contains a few words on upper case. I need to create a new column with only this particular information, the words typed with upper case. Any ideas?

Thx.

Thad Guidry

unread,

May 23, 2014, 9:15:06 PM5/23/14

to openr...@googlegroups.com

You can use a GREL statement such as:

value.toUppercase()

Documented here:

https://github.com/OpenRefine/OpenRefine/wiki/GREL-String-Functions#case-conversion

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

-Thad

+ThadGuidry
Thad on LinkedIn

Owen Stephens

unread,

May 30, 2014, 8:15:21 AM5/30/14

to openr...@googlegroups.com

Slightly late in responding to this. I think understood something different to Thad from this question, so posting my answer just in case:

My understanding was that if the original value is:

MONDAY Tuesday WEDNESDAY Thursday Friday

You want to get

MONDAY WEDNESDAY

If this is the case I think you need to split the original data into an array of words (each element in the array is a word from your original data), then filter that array for words containing non-uppercase characters. You can do this with something like:

filter(value.split(/[^\w]/),v,isNonBlank(v.match(/([^a-z]*)/)[0]))

This splits the original data on non-word characters (using 'split') to get the array of words, then filters out any members of the array that contain a lowercase character. You are left with an array only containing the uppercase words. If you want to get this back to a string you can use 'join' on the array.

Owen

Tom Morris

unread,

May 30, 2014, 8:51:38 AM5/30/14

to openr...@googlegroups.com

I think your interpretation is closer to what the OP intended, but this:

On Fri, May 30, 2014 at 8:15 AM, Owen Stephens <ow...@ostephens.com> wrote:

You can do this with something like:

filter(value.split(/[^\w]/),v,isNonBlank(v.match(/([^a-z]*)/)[0]))

This splits the original data on non-word characters (using 'split') to get the array of words, then filters out any members of the array that contain a lowercase character. You are left with an array only containing the uppercase words.

Is going to result in words which are "not lowercase," rather than those which are "uppercase". Two different things.

You can use [A-Z] if you're sure there are no no-ASCII characters in your data, but you'd probably be better served by something like \p{Lu}

http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Tom

Owen Stephens

unread,

May 30, 2014, 12:16:16 PM5/30/14

to openr...@googlegroups.com

Thanks Tom - good point.

The expression I've offered may need tweaking based on Gustavo's actual data. If the 'words' can contain characters other than uppercase (e.g. if you have hyphenated words in uppercase and want to preserve them in their hyphenated form, you'd need to ensure the expression you use to split the string doesn't split on hyphen (which mine does I think) and include this as an allowed character in your filters.

The same would be true of other characters of course - hyphen seems one of the more obvious possibilities

Gustavo Magalhães

unread,

Jun 20, 2014, 6:05:22 PM6/20/14

to openr...@googlegroups.com

Thank you all for the help (and I'm sorry for taking so long to give you this feedback)

Owen, your expression works perfect. Originally, I had a dataset with products and brands. Something like:

BRAND product

product x BRAND

product y BRAND

Now I have products and brands on different columns, as expected.

Reply all

Reply to author

Forward