Expression for word count and creating a new column based on this count

908 views
Skip to first unread message

Lukas Bechera

unread,
Jun 10, 2015, 9:54:39 AM6/10/15
to openr...@googlegroups.com
Hi guys,

I am relatively new to Open Refine. I use it for PPC and I have one problem. Let me describe what I want to achieve:
I´d like to create a new column based on column "search term" where cell values would be a number of words from which a particular search term consist. So search term "big red car" would have value 3 in that new column. Is something like that possible? All I found was "word facet" and "text length facet". I´d like to make an analysis based on number of words used in serch terms.

Thanks for any help
Lukas

Owen Stephens

unread,
Jun 10, 2015, 10:26:02 AM6/10/15
to openr...@googlegroups.com
Assuming you are OK with saying that whitespace defines the boundary between words this is simple. Use the GREL expression:

value.split(" ").length()

This takes the string and splits it up into words by looking for spaces between them. The result is an array of words (which is what is used to create the 'word facet'). The 'length()' counts the number of items in the array.

If you want to worry about things like hyphenation or punctuation forming word boundaries they you may need a more sophisticated approach to the 'split' part of this expression

Owen

Thad Guidry

unread,
Jun 10, 2015, 12:09:51 PM6/10/15
to openrefine
also you may want to replace non-breaking spaces with regular spaces as a very 1st pass.

value.replaceChars(" ", " ").split(" ").length()


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thad Guidry

unread,
Jun 10, 2015, 12:19:02 PM6/10/15
to openrefine
A actually just use regex to do the heavy lifting....

partition on a Word boundary (where \W will treat non-breaking spaces automatically as regular spaces)

value.partition(/\W/).length()

Thad Guidry

unread,
Jun 10, 2015, 12:36:43 PM6/10/15
to openrefine
Hmm... \W does not treat non-breaking spaces automatically as regular spaces in Java...bummer... use instead \u00A0

Tom Morris

unread,
Jun 10, 2015, 1:12:31 PM6/10/15
to openr...@googlegroups.com
I would either use split as suggested by Owen or split with a regex to
do fancier splitting, such as

value.split(/\W+/).length()

I don't think partition will work in this particular case. Also, be
careful of using pre-defined character classes like \W if you're
dealing with non-ASCII characters. You can use Unicode character
classes instead or just roll your own with the characters that you
consider to be word separators.

Tom

raja kumar dash

unread,
Apr 3, 2017, 4:39:29 PM4/3/17
to OpenRefine
I did the following:

1. Make a copy of the text column to be analyzed.
2. Strip punctuation: value.replace(/\p{Punct}/,' ')
3. Trimmed leading/ trailing and consecutive spaces
4. Counted words: value.split(" ").length()

As for the rest of your requirement, you'd have to write a bit of code (e.g., in Jython) to match search words as a tuple against target words tuple.

I suppose you might be able to use the forEach() function, but it'll probably get hairy.
Reply all
Reply to author
Forward
0 new messages