Hey,
I am trying to word-tokenize a text file in the most standard way possible, i.e., not too complicated hopefully, and hopefully highly accurate.
The NLTK tokenize class documentation provides a lot of interesting information and various methods for doing this. However, I'd still like a little bit of external discussion or assistance on this to understand with more confidence.
At one point in the documentation it was interesting to read "tokenization is considered a solved problem because rule-based tokenizers are highly accurate". I assume then that I should probably use the Regexp tokenizer? However, do these classes come with built-in regexes for word tokenization, or do I have to enter it myself? Where would I find a standard word tokenization regex?
I also noticed there were other word tokenizers such as the REPP. Are there other word tokenizers I should consider using? For example, is there any neural network/machine learning based word tokenizer with any advantage in using it?
Thanks very much,
Julius