How can I extract text snippets with Regex and Open Refine?

831 views
Skip to first unread message

cosmin

unread,
Mar 12, 2013, 6:45:15 AM3/12/13
to openr...@googlegroups.com
Hello Open Refine Ninjas,

What's the problem?


I'm analyzing Tweets ... as you know some Tweets contain hashtags and links. To extract them with Python is very easy.
The problem is that I don't want to switch from one tool to another to analyze the data. Is there a possibility to do these
steps with Open Refine?

                thashtags=re.findall("#([a-z0-9]+)", result['text'], re.I)
                data['hashtags']='::'.join(thashtags)
                
                tuser=re.findall("@([a-z0-9]+)", result['text'], re.I)
                data['user identified']='::'.join(tuser)
                
                thashtags=re.findall("RT @([a-z0-9]+)", result['text'], re.I)
                data['retweets']='::'.join(thashtags)

                thashtags=re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", result['text'], re.I)
                data['links']='$$$'.join(thashtags)


Thanks,
Cosmin

Tom Morris

unread,
Mar 12, 2013, 8:45:50 AM3/12/13
to openr...@googlegroups.com
On Tue, Mar 12, 2013 at 6:45 AM, cosmin <cosmi...@googlemail.com> wrote:

I'm analyzing Tweets ... as you know some Tweets contain hashtags and links. To extract them with Python is very easy.
The problem is that I don't want to switch from one tool to another to analyze the data. Is there a possibility to do these
steps with Open Refine?

                thashtags=re.findall("#([a-z0-9]+)", result['text'], re.I)
                data['hashtags']='::'.join(thashtags)

Sounds like Refine's match() function might be appropriate.  Have you tried that?

If you select Edit cell -> Transform, the preview window where you enter expressions has a Help tab with a list of all the available functions and controls.  That will give you an idea of what is available for your use.

Tom

cosmin

unread,
Mar 12, 2013, 5:01:33 PM3/12/13
to openr...@googlegroups.com
Hi Tom,

Thanks for the pointer. The following function extracts the first hashtag from a tweet ...

value.match(/.*(#([a-z0-9]+)).*/)[0]


but how can I extract more than one hashtag from a tweet and
store them in a cell?

Input >> Test Tweet with #hashtag1 #hashtag2 #hashtag3
Output >> #hashtag1 #hashtag2 #hashtag3

Martin Magdinier

unread,
Mar 12, 2013, 11:39:23 PM3/12/13
to openr...@googlegroups.com

The [0] at the end of the expression indicates which hastag you want to extract:
[0] for the first element
[1] for the second
And so one ...

To know how many hastag your string contains you can use countif facet (see http://googlerefine.blogspot.ca/2011/09/countif-in-google-refine-with.html)

Martin

--
You received this message because you are subscribed to the Google Groups "Open Refine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Tom Morris

unread,
Mar 13, 2013, 1:03:31 AM3/13/13
to openr...@googlegroups.com
There's an open issue for that, which I forgot about.  The workaround listed here might help:


Tom

--

cosmin

unread,
Mar 13, 2013, 5:22:17 PM3/13/13
to openr...@googlegroups.com
Hi Martin,

I need an expression to extract all elements s.th. like

value.match(/.*(#([a-z0-9]+)).*/)[all].join("addSpecialCharacterBetweenEachElement")
orefine.png

Martin Magdinier

unread,
Mar 20, 2013, 9:19:56 AM3/20/13
to openr...@googlegroups.com

Ok. No idea how to do that into one single expression.

Reply all
Reply to author
Forward
0 new messages