Perhaps bug in ruby regex for ALLCAPS?

30 views
Skip to first unread message

chris...@timelinelabs.com

unread,
Sep 2, 2014, 8:41:41 PM9/2/14
to Global...@googlegroups.com
Hi Jeffery, thanks for posting the ruby preprocessing code.  I'm excited to work on reproducing this!  
Could the regex for ALLCAPS have a typo?

.gsub(/[^a-z0-9()<>'`\-]{2,}\)/){ |word|

When I run:
echo -n "HOWDY"| ruby -n preprocess-twitter.rb

I get "HOWDY" back.

Should the regex be:
.gsub(/([^a-z0-9()<>'`\-]){2,}/){ |word|

with this I get:
echo -n "HOWDY"| ruby -n preprocess-twitter.rb
howdy <ALLCAPS>

Regards,

Chris

Jeffrey Pennington

unread,
Sep 2, 2014, 9:19:04 PM9/2/14
to Global...@googlegroups.com, chris...@timelinelabs.com
Yes, you're right, the original ALLCAPS regex is broken. Thanks for pointing this out and providing a fix! I updated the link with a corrected version.

ian....@insight-centre.org

unread,
Feb 20, 2017, 10:29:21 AM2/20/17
to GloVe: Global Vectors for Word Representation, chris...@timelinelabs.com
First, a question: I think this ruby script may not have been actually used for preparing the embeddings on the stanford webpage - am I right? 
If so, what WAS used? Was it prone to any of the problems below? (eg: "HOwdy" changed to "ho <allcaps> wdy"?)
I'm interested in the emoticons in particular - some things like ':o)' and ':-\' are in the embedding files, but '(:' and '(-:' are not, but don't get transformed by the ruby script either.

A few remaining problems with the ruby script (the one currently on the stanford page):
 - the allcaps expression actually catches any previous tags in the text, so for example "#mytag" gets changed to "<hashtag <ALLCAPS>> mytag"
 - it detects any combination of two or more capitals as an allcaps word, so "THis" gets changed to "th <ALLCAPS>is"
 - it concatenates the <allcaps> tag with the following word, so "THIS is" gets changed to "this <ALLCAPS>is"
 - strings of 2 or more whitespace characters are marked as allcaps, so "  " gets changed to "   <ALLCAPS>"
 - the tags are uppercase, but the embedding files have lower-case tags

These can be fixed by using lower-case tags in the replacement expressions and changing the allcaps regex to /\b([^a-z0-9()<>'`\-\s]){2,}\b/  (note the word boundaries \b and added \s to the negated character group).

Cheers
Ian
Reply all
Reply to author
Forward
0 new messages