Perhaps bug in ruby regex for ALLCAPS?

chris...@timelinelabs.com

unread,

Sep 2, 2014, 8:41:41 PM9/2/14

to Global...@googlegroups.com

Hi Jeffery, thanks for posting the ruby preprocessing code. I'm excited to work on reproducing this!

Could the regex for ALLCAPS have a typo?

.gsub(/[^a-z0-9()<>'`\-]{2,}\)/){ |word|

When I run:

echo -n "HOWDY"| ruby -n preprocess-twitter.rb

I get "HOWDY" back.

Should the regex be:

.gsub(/([^a-z0-9()<>'`\-]){2,}/){ |word|

with this I get:

echo -n "HOWDY"| ruby -n preprocess-twitter.rb

howdy <ALLCAPS>

Regards,

Chris

Jeffrey Pennington

unread,

Sep 2, 2014, 9:19:04 PM9/2/14

to Global...@googlegroups.com, chris...@timelinelabs.com

Yes, you're right, the original ALLCAPS regex is broken. Thanks for pointing this out and providing a fix! I updated the link with a corrected version.

ian....@insight-centre.org

unread,

Feb 20, 2017, 10:29:21 AM2/20/17

to GloVe: Global Vectors for Word Representation, chris...@timelinelabs.com

First, a question: I think this ruby script may not have been actually used for preparing the embeddings on the stanford webpage - am I right?

If so, what WAS used? Was it prone to any of the problems below? (eg: "HOwdy" changed to "ho <allcaps> wdy"?)

I'm interested in the emoticons in particular - some things like ':o)' and ':-\' are in the embedding files, but '(:' and '(-:' are not, but don't get transformed by the ruby script either.

A few remaining problems with the ruby script (the one currently on the stanford page):

- the allcaps expression actually catches any previous tags in the text, so for example "#mytag" gets changed to "<hashtag <ALLCAPS>> mytag"

- it detects any combination of two or more capitals as an allcaps word, so "THis" gets changed to "th <ALLCAPS>is"

- it concatenates the <allcaps> tag with the following word, so "THIS is" gets changed to "this <ALLCAPS>is"

- strings of 2 or more whitespace characters are marked as allcaps, so " " gets changed to " <ALLCAPS>"

- the tags are uppercase, but the embedding files have lower-case tags

These can be fixed by using lower-case tags in the replacement expressions and changing the allcaps regex to /\b([^a-z0-9()<>'`\-\s]){2,}\b/ (note the word boundaries \b and added \s to the negated character group).