Word tokenization

Skip to first unread message

Julius Hamilton

Jul 26, 2021, 10:55:38 AM7/26/21
to nltk-...@googlegroups.com

I am trying to word-tokenize a text file in the most standard way possible, i.e., not too complicated hopefully, and hopefully highly accurate.

The NLTK tokenize class documentation provides a lot of interesting information and various methods for doing this. However, I'd still like a little bit of external discussion or assistance on this to understand with more confidence.

At one point in the documentation it was interesting to read "tokenization is considered a solved problem because rule-based tokenizers are highly accurate". I assume then that I should probably use the Regexp tokenizer? However, do these classes come with built-in regexes for word tokenization, or do I have to enter it myself? Where would I find a standard word tokenization regex?

I also noticed there were other word tokenizers such as the REPP. Are there other word tokenizers I should consider using? For example, is there any neural network/machine learning based word tokenizer with any advantage in using it?

Thanks very much,


Jul 26, 2021, 4:57:48 PM7/26/21
to nltk-...@googlegroups.com
you can several questions at the same time. let's go step by step.
the first step is to do what precisely
Best regards

You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/nltk-users/CAGrXgp0v0fqHn08uj99vM0D%2BK836z2xmA%3Di1AJAv495r7C%3DmQw%40mail.gmail.com.
Reply all
Reply to author
0 new messages