I have a question about Building Dictionary

17 views
Skip to first unread message

Shantanu Nath

unread,
May 18, 2020, 10:48:15 AM5/18/20
to Nematus Support
Dear Sir,
According to your script what I think For building a dictionary,
  -first we need to tokenize our Source and Target data.
  -learn BPE on joint vocabulary
-finally build Dictionary

If I want to use SentencePiece then what will be the step I should follow?

Best Regards, Shantanu Nath

Rico Sennrich

unread,
May 19, 2020, 4:17:21 AM5/19/20
to nematus...@googlegroups.com
Hello Shantanu,

sentencepiece applies its own simple tokenization (basically splitting on whitespace and whenever the unicode script changes (Latin characters belong to one script, special characters to another). So if you use sentencepiece, you can skip the tokenization step. You'll still want to use nematus/data/build_dictionary.py to create a dictionary in the right format for Nematus.

best wishes,
Rico
--
You received this message because you are subscribed to the Google Groups "Nematus Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nematus-suppo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nematus-support/1a19f95a-06e4-45f9-885b-69befc4ccf5b%40googlegroups.com.


Reply all
Reply to author
Forward
0 new messages