Word tokenization

53 views
Skip to first unread message

Julius Hamilton

unread,
Jul 26, 2021, 10:55:38 AM7/26/21
to nltk-...@googlegroups.com
Hey,

I am trying to word-tokenize a text file in the most standard way possible, i.e., not too complicated hopefully, and hopefully highly accurate.

The NLTK tokenize class documentation provides a lot of interesting information and various methods for doing this. However, I'd still like a little bit of external discussion or assistance on this to understand with more confidence.

At one point in the documentation it was interesting to read "tokenization is considered a solved problem because rule-based tokenizers are highly accurate". I assume then that I should probably use the Regexp tokenizer? However, do these classes come with built-in regexes for word tokenization, or do I have to enter it myself? Where would I find a standard word tokenization regex?

I also noticed there were other word tokenizers such as the REPP. Are there other word tokenizers I should consider using? For example, is there any neural network/machine learning based word tokenizer with any advantage in using it?

Thanks very much,
Julius


brou.christ...@gmail.com

unread,
Jul 26, 2021, 4:57:48 PM7/26/21
to nltk-...@googlegroups.com
Hey,
you can several questions at the same time. let's go step by step.
the first step is to do what precisely
Best regards
Christ

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/nltk-users/CAGrXgp0v0fqHn08uj99vM0D%2BK836z2xmA%3Di1AJAv495r7C%3DmQw%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages