Hi Owen. You are right of course, and +2 for these excellent tools that are Voyant and the ANT collection. There are many tools better suited to Text Processing than Open Refine. At least for a beginner user. But, and you know that far better than me, OR offers a bunch of possibilities, whether with GREL or Jython (I do not know Clojure at all). It also has the advantage of being able to work with structured text in a spreadsheet, which is not the case of Voyant. And finally, it allows very precise fine tunes when you know a little of regular expressions (the basis of the tokenization after all).
In Jython, tokenizing the tweets might be as simple as this:
Import re
tokens = re.findall("\ w + | \ $ [\ d \.] + | \ S +", value)
return "||||" .join(tokens)
By modifying the regular expression, you could manage the problem of "http", ":", "// "considered as three different tokens.
If the ponctuation is not important, value.fingerprint() is a straightforward method or normalization, and I suspect that Voyant uses something similar.
Last but not least, OR allows to work with Natural Language Processing APIs. One of the famous examples is the extension "Named entity Recognition".
In short, although I agree totally with you, I'd advise Simon to try to reach the OR limits before moving on to a more specific tool. :)