Persian N-gram Data?

345 views
Skip to first unread message

Masoud Komeily

unread,
Sep 20, 2014, 12:12:49 PM9/20/14
to common...@googlegroups.com
Hi,

For a special N-gram language modeling of Persian, a relatively free word order language, I need to filter-out low frequency sentence permutations. To do so, I need web-scale bigram/trigram data. Can any one help me? Can Common Crawl provide such data? Is there any available script for doing that?

For example, among below permutations, I want to see which ones are more/less frequently used by a Persian speaker (all six permutations 'are' grammatical in Persian):
  • من علی را دبدم  >> I Ali saw
  • من دیدم علی را  >> I saw Ali
  • علی را من دیدم  >> Ali I saw
  • علی را دیدم من  >> Ali saw I
  • دیدم من علی را  >> Saw I Ali
  • دیدم علی را من  >> Saw Ali I

Thanks in advance,
Masoud

Kenneth Heafield

unread,
Sep 20, 2014, 6:39:43 PM9/20/14
to common...@googlegroups.com
You can find raw text by language at: http://statmt.org/ngrams/raw/

We don't have a good tokenizer for Persian, so it wouldn't make sense to post bad n-grams. 

Masoud Komeily

unread,
Sep 21, 2014, 1:29:06 AM9/21/14
to common...@googlegroups.com
Hi,

Thanks for your help.

  • Does 'fa' in the mentioned link stand for Farsi (Persian)?
  • My training set for LM is a 100-million-word corpus called 'Peykare' from which I will make dependency-based permutations for every single sentence therein. Can I use the corpus itself to filter out low-frequency permutation? Isn't it bias?
  • Or do you think I should necessarily use web-scale data to select high-frequency permutations using web-based 2-gram/3-gram data?


Thanks,
Masoud Komeily

M.Sc, Computational Linguistics
Artificial Intelligence & Linguistics Program
University of Isfahan
Isfahan, Iran


--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/7u0ryrr1P3w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.

Oscar Kjell

unread,
Nov 5, 2015, 2:28:12 PM11/5/15
to Common Crawl
Hi Masoud,
I am a Swedish PhD student in Psychology; and I am interested in finding a n-gram corpus for Persian (Farsi). Have you been able to find one? Perhaps created one yourself?

Any help is much appreciated.
Kind Regards,
Oscar
Reply all
Reply to author
Forward
0 new messages