Re: [kaldi-help] Google Ngram

98 views

Skip to first unread message

Daniel Povey

unread,

Nov 19, 2022, 10:41:59 AM11/19/22

to kaldi...@googlegroups.com

I'm afraid the approach of using FSTs that are explicitly expanded in memory, like Kaldi uses, will make it hard to

use extremely large LMs and vocabularies that have more than a million or so words.

What you are talking about sounds a bit like class language models. As long as you can turn it into a G.fst, Kaldi could probably create a graph out of it, there may be things either in SRILM and/or in Thrax that might enable you to estimate a class-based language model and turn it into an FST.

But these days your best bet would probably be to be to use some approach with BPE pieces as the vocabulary, if you want to handle a super large vocabulary.

On Sat, Nov 19, 2022 at 12:05 AM www.e...@gmail.com <www.e...@gmail.com> wrote:

Hello.
I'm trying to create an lm-model based on a data slice from Google Ngram Exports (https://storage.googleapis.com/books/ngrams/books/datasetsv3.html). However, I ran into the problem that lm-models in Kaldi require a lexicon.txt file, which will contain the full list of available words. Google uses "tags" that indicate the part of speech of a word and allows you to reduce the size of the final model. Thus, my question is - is there any way to represent data in this way for training an lm-model in Kaldi?

It is also interesting how the search for word forms is implemented in Google Ngrams: for example, if you enter the query "run_INF", it will find the word forms "run", "ran", "running", "runs".

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/e45264ff-fc61-4685-8c12-f4fff3260db3n%40googlegroups.com.

Эрнест Касимов

unread,

Nov 22, 2022, 10:43:50 AM11/22/22

to kaldi...@googlegroups.com

Thank you very much, but my message still does not appear in the list of questions on the forum..

пт, 18 нояб. 2022 г. в 19:05, www.e...@gmail.com <www.e...@gmail.com>:

Hello.
I'm trying to create an lm-model based on a data slice from Google Ngram Exports (https://storage.googleapis.com/books/ngrams/books/datasetsv3.html). However, I ran into the problem that lm-models in Kaldi require a lexicon.txt file, which will contain the full list of available words. Google uses "tags" that indicate the part of speech of a word and allows you to reduce the size of the final model. Thus, my question is - is there any way to represent data in this way for training an lm-model in Kaldi?

It is also interesting how the search for word forms is implemented in Google Ngrams: for example, if you enter the query "run_INF", it will find the word forms "run", "ran", "running", "runs".

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---

You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/1HMRGA6oAHQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

Reply all

Reply to author

Forward

0 new messages