Word tokenization

Julius Hamilton

Jul 26, 2021, 10:55:38 AM7/26/21
I am trying to word-tokenize a text file in the most standard way possible, i.e., not too complicated hopefully, and hopefully highly accurate.

The NLTK tokenize class documentation provides a lot of interesting information and various methods for doing this. However, I'd still like a little bit of external discussion or assistance on this to understand with more confidence.

At one point in the documentation it was interesting to read "tokenization is considered a solved problem because rule-based tokenizers are highly accurate". I assume then that I should probably use the Regexp tokenizer? However, do these classes come with built-in regexes for word tokenization, or do I have to enter it myself? Where would I find a standard word tokenization regex?

I also noticed there were other word tokenizers such as the REPP. Are there other word tokenizers I should consider using? For example, is there any neural network/machine learning based word tokenizer with any advantage in using it?

Jul 26, 2021, 4:57:48 PM7/26/21
you can several questions at the same time. let's go step by step.
the first step is to do what precisely
