Count how many time each word appear in a list of tweet.

451 views
Skip to first unread message

Simon Breton

unread,
Feb 2, 2017, 5:06:52 AM2/2/17
to OpenRefine
Hello, 

I'm new to open refine. I've successfully done some tutorials. I know how to use the basic functions. However I'm not confortable yet to build or think of about my own recipe. 

I have an excel file with one column of 10k rows. Each row is a tweet. I would like to be able to do a text facet on all my tweets. And count how many time each word appear. 

How can I do that ? What are the main step I should follow for this ? 

I hope I'm clear. 

Thanks. 

Ettore RIZZA

unread,
Feb 2, 2017, 5:12:25 AM2/2/17
to openr...@googlegroups.com
Hi Simon,

You can facet by words on your column, like this :


​The output will be a list of words with their count.

Hope this help.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ettore RIZZA

unread,
Feb 2, 2017, 5:23:37 AM2/2/17
to openr...@googlegroups.com

I forgot to specify that you can then copy and paste this list of words.



Simon Breton

unread,
Feb 2, 2017, 5:39:44 AM2/2/17
to OpenRefine
Hello. thanks a lot. I've already done this but I have the following error : "1585 choices total, too many to display". And I have this error when I'm only working on a sample of my data 500 hundred row and 10k. 
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Ettore RIZZA

unread,
Feb 2, 2017, 5:47:25 AM2/2/17
to openr...@googlegroups.com
OR limits the number of choices to save memory. You can increase this limit by clicking on "Set choice count limit".  But in this case, be sure you have allocated more memory (see here for details : https://github.com/OpenRefine/OpenRefine/wiki/FAQ:-Allocate-More-Memory )



To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.

Ettore RIZZA

unread,
Feb 2, 2017, 6:47:02 AM2/2/17
to openr...@googlegroups.com
By the way, the "facet by words" function is a simple shortcut for : 

GREL: value.split('  ')

It's not an advanced tokenization.

You can do it in GREL by following this procedure :





 The Grel formulas used is :

value.split(' ').join('||||')
Then, I splitted the multivalued cells on the new column "words" using the |||| separator. The "cluster and edit" function showed that a lot of words could be normalized. Finally, I filled down the column ID, so each word on the new column is associated with a tweet. 

Imho, the easiest way to do frequency analysis would be to export this file to .xls and use Pivot tables in Excel.

Don't hesitate to ask for clarification if the screencast (or my International Broken English) is not clear.

Owen Stephens

unread,
Feb 2, 2017, 7:07:31 AM2/2/17
to OpenRefine
+1 to pretty much everything that Ettore has said. What I'd add is that if you want to do textual analysis to any degree of sophistication you might want to look at tools that are more focussed on this task.

One that is easy to get started with is Voyant Tools https://voyant-tools.org. This online tool will give you information on the most common words, term counts, phrase counts, allow you to see key words in context (KWIC) plus incorporates some visualisations like word clouds and spark lines (which show how words are distributed throughout the corpus - so e.g. if tweets are in chronological order you can see if some terms appear more commonly during specific times)

Antconc is another text analysis tools with similar functions - but this you run locally on your computer - there is a good introduction at http://programminghistorian.org/lessons/corpus-analysis-with-antconc 

If this is new to you I'd start with Voyant tools and see if that did the job, and move onto Antconc only if I hit limitations with Voyant

Owen

Ettore Rizza

unread,
Feb 3, 2017, 9:38:29 AM2/3/17
to OpenRefine

Hi Owen. You are right of course, and +2 for these excellent tools that are Voyant and the ANT collection. There are many tools better suited to Text Processing than Open Refine. At least for a beginner user. But, and you know that far better than me, OR offers a bunch of possibilities, whether with GREL or Jython (I do not know Clojure at all). It also has the advantage of being able to work with structured text in a spreadsheet, which is not the case of Voyant. And finally, it allows very precise fine tunes when you know a little of regular expressions (the basis of the tokenization after all).

In Jython, tokenizing the tweets might be as simple as this:

Import re
tokens = re.findall("\ w + | \ $ [\ d \.] + | \ S +", value)
return "||||" .join(tokens)


By modifying the regular expression, you could manage the problem of "http", ":", "// "considered as three different tokens. 

If the ponctuation is not important, value.fingerprint() is a straightforward method or normalization, and I suspect that Voyant uses something similar.

Last but not least, OR allows to work with Natural Language Processing APIs. One of the famous examples is the extension "Named entity Recognition".

In short, although I agree totally with you, I'd advise Simon to try to reach the OR limits before moving on to a more specific tool. :)

Reply all
Reply to author
Forward
0 new messages