Persian N-gram Data?

Masoud Komeily

unread,

Sep 20, 2014, 12:12:49 PM9/20/14

to common...@googlegroups.com

Hi,

For a special N-gram language modeling of Persian, a relatively free word order language, I need to filter-out low frequency sentence permutations. To do so, I need web-scale bigram/trigram data. Can any one help me? Can Common Crawl provide such data? Is there any available script for doing that?

For example, among below permutations, I want to see which ones are more/less frequently used by a Persian speaker (all six permutations 'are' grammatical in Persian):

من علی را دبدم >> I Ali saw
من دیدم علی را >> I saw Ali
علی را من دیدم >> Ali I saw
علی را دیدم من >> Ali saw I
دیدم من علی را >> Saw I Ali
دیدم علی را من >> Saw Ali I

Thanks in advance,

Masoud

Kenneth Heafield

unread,

Sep 20, 2014, 6:39:43 PM9/20/14

to common...@googlegroups.com

You can find raw text by language at: http://statmt.org/ngrams/raw/

We don't have a good tokenizer for Persian, so it wouldn't make sense to post bad n-grams.

Masoud Komeily

unread,

Sep 21, 2014, 1:29:06 AM9/21/14

to common...@googlegroups.com

Hi,

Thanks for your help.

Does 'fa' in the mentioned link stand for Farsi (Persian)?
My training set for LM is a 100-million-word corpus called 'Peykare' from which I will make dependency-based permutations for every single sentence therein. Can I use the corpus itself to filter out low-frequency permutation? Isn't it bias?
Or do you think I should necessarily use web-scale data to select high-frequency permutations using web-based 2-gram/3-gram data?

Thanks,

Masoud Komeily

M.Sc, Computational Linguistics

Artificial Intelligence & Linguistics Program

University of Isfahan

Isfahan, Iran

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/7u0ryrr1P3w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Oscar Kjell

unread,

Nov 5, 2015, 2:28:12 PM11/5/15

to Common Crawl

Hi Masoud,
I am a Swedish PhD student in Psychology; and I am interested in finding a n-gram corpus for Persian (Farsi). Have you been able to find one? Perhaps created one yourself?

Any help is much appreciated.
Kind Regards,
Oscar

Reply all

Reply to author

Forward