Hi,
For a special N-gram language modeling of Persian, a relatively free word order language, I need to filter-out low frequency sentence permutations. To do so, I need web-scale bigram/trigram data. Can any one help me? Can Common Crawl provide such data? Is there any available script for doing that?
For example, among below permutations, I want to see which ones are more/less frequently used by a Persian speaker (all six permutations 'are' grammatical in Persian):
- من علی را دبدم >> I Ali saw
- من دیدم علی را >> I saw Ali
- علی را من دیدم >> Ali I saw
- علی را دیدم من >> Ali saw I
- دیدم من علی را >> Saw I Ali
- دیدم علی را من >> Saw Ali I
Thanks in advance,
Masoud