First, a question: I think this
ruby script may not have been actually used for preparing the embeddings on the
stanford webpage - am I right?
If so, what WAS used? Was it prone to any of the problems below? (eg: "HOwdy" changed to "ho <allcaps> wdy"?)
I'm interested in the emoticons in particular - some things like ':o)' and ':-\' are in the embedding files, but '(:' and '(-:' are not, but don't get transformed by the ruby script either.
A few remaining problems with the ruby script (the one currently on the stanford page):
- the allcaps expression actually catches any previous tags in the text, so for example "#mytag" gets changed to "<hashtag <ALLCAPS>> mytag"
- it detects any combination of two or more capitals as an allcaps word, so "THis" gets changed to "th <ALLCAPS>is"
- it concatenates the <allcaps> tag with the following word, so "THIS is" gets changed to "this <ALLCAPS>is"
- strings of 2 or more whitespace characters are marked as allcaps, so " " gets changed to " <ALLCAPS>"
- the tags are uppercase, but the embedding files have lower-case tags
These can be fixed by using lower-case tags in the replacement expressions and changing the allcaps regex to /\b([^a-z0-9()<>'`\-\s]){2,}\b/ (note the word boundaries \b and added \s to the negated character group).
Cheers
Ian