Accurate Root to Word mapping and Word frequency in Quran database

504 views
Skip to first unread message

Omar Al Zabir

unread,
Sep 30, 2012, 12:35:23 PM9/30/12
to united...@googlegroups.com
Both qurandev.github.com/corpus/ (QuranDev) and quran.azurewebsites.net (QRT) have inaccurate Root database because both are based on corpus.quran.com data.

One example is, yawm appears only 405 times in corpus.quran.com. They haven't mapped other forms besides Noun. for ex, Adverbial forms are missing.

Shows 405 times.

Accurate count is 475 as seen on Tanzil:

We need to prepare an accurate Root database and root to word mapping so that we can show accurate Root information in qurandev and QRT.

Brother Ali has generously taken this challenge to produce an accurate mapping of Roots and Root to word.

Once he produces that, we will add the grammar forms on that output using corpus.quran.com. This will mean many root will not have grammar mapping. We will try to identify those missing mappings and fix ourselves. 

And then we can offer that to corpus.quran.com team.


q d

unread,
Sep 30, 2012, 12:51:51 PM9/30/12
to united...@googlegroups.com
wasalaam Omar,

Good catch. Do you have exact an ayah or word reference where this happens... ex: adverbial yawm thats not listed in corpus. so I can take a deeper look at this.


salaam/QD

Omar Al Zabir

unread,
Sep 30, 2012, 12:57:43 PM9/30/12
to united...@googlegroups.com

q d

unread,
Sep 30, 2012, 2:04:23 PM9/30/12
to united...@googlegroups.com
wasalaam,

2:85 is there in Corpus if you scroll down... Lets see if we can find an example of something that's missing...

(2) Time adverb

(2:85:39) wayawmaand (on the) Dayوَيَوْمَ الْقِيَامَةِ يُرَدُّونَ إِلَىٰ أَشَدِّ الْعَذَابِ


q d

unread,
Sep 30, 2012, 4:02:33 PM9/30/12
to united...@googlegroups.com
wasalaam again,

it gets trickier and trickier. I did some more ivnestigating.. and I dont think we shud write off Corpus data just yet. We haven't found a discrepancy atleast related to this thread, just yet.

I compared tanzil & corpus for all the other roots under YA and atleast the counts match for all of them. So corpus isn't just counting the Nominals, but ALL forms.


Question remains for ywm, why the count is off by a lot. Does it have something to do with the weak letter y? perhaps for some words, tanzil & corpus have different root? i have seen some instances earlier where root would interchangeably have one of the week letter waav/yaa/alif. khair, dont want to jump to conclusions.


Can you find a specific instance of any root which occurs in corpus/tanzil, in one and not in the other??

salaam/QD

Omar Al Zabir

unread,
Sep 30, 2012, 5:46:39 PM9/30/12
to united...@googlegroups.com
Attached two files - one contains the Tanzil occurrences of the root and the other contains the Corpus occurrences of the root. 

If you compare the two files, you will see the missing ones. Some examples:

3:167
4:42
6:16
7:8
8:16
corpus_yawm.txt
tanzil_yawm.txt

q d

unread,
Sep 30, 2012, 7:41:08 PM9/30/12
to Omar Al Zabir, united...@googlegroups.com
wasalaam

jazakallah khair. You are right.. Root info is missing in corpus for the ones you pointed out:http://corpus.quran.com/wordbyword.jsp?chapter=3&verse=167#(3:167:20)


In all these cases, its in "T- Time Adverb" (orange color), and whats interesting is there's Lemma info, but no Root info. This form occurs in exactly 70 times.. which seems to add-up to match Tanzil. 405 + 70 = 475.

another note: am still leaning towards Corpus being good dataset, needing perhaps some spot corrections done carefully. For Mutaradifaat purpose, Lemma form might be better choice than Root anyway. (Root leads to diverging Lemma forms too different in meanings?). 
Also while checking out above, I stumbled on a bug.. http://quran.azurewebsites.net/3/167 click on the yawm word, and its giving a 500 error. Earlier for http://quran.azurewebsites.net/1/7 noticed on last word, root letter being wrong. 
Do you want to discuss separately.. rather than screenscraping, did you have have difficulty using their txt data export? Thats what I am using and have absolutely not caught any errors yet against their site..


salaam


ps: Just wondering aloud.. always wanted to crosscheck corpus against the word-by-word translation of Mohar Ali (available as PDFs online). Had tried once.. maybe it might help catch more issues like this.



--
 
 

Omar AL Zabir

unread,
Oct 1, 2012, 2:47:56 PM10/1/12
to q d, united...@googlegroups.com

On QRT, the words that have black transliteration, not blue link are missing roots from Corpus data. Corpus does not have root->word mapping for those occurrences.

 

I did parse the html output of corpus dictionary site and built the database. But looking at the corpus text export, still similar problems exists. I believe they are building the site from the same data, so same problem exists on their website as well.

 

I needed to grab the full Arabic word and the transliteration, so I went for html scrapping on corpus site. Their text data does not have the complete Arabic form, instead has broken down the bulkwater form and no transliteration either.

 

I would like to start over importing the data directly from corpus text data. For that, I need two things as described on this post:

https://groups.google.com/forum/#!topic/united-quran/eOhPm8qrKBQ

 

I believe Br Ali is working on it. If you have already made progress on this, we would love to leverage your data.

Qd

unread,
Oct 1, 2012, 4:07:07 PM10/1/12
to Omar AL Zabir, united...@googlegroups.com
Wasalaam all

I need to read this in depth later when free. 1 comment though.. A realization that i came to myself recently.

Lot of scholarly.work in printed books is available thats well accepted. However hardly any electtonic data exports for grammar. Corpus has its isues.. But its great start. FULL TIME effort of PHD students... Vs part time hobbyist coders like us. I would just advise tread carefully so dont introduce other issues.

Salam/qd
--
 
  

Ali Adams

unread,
Oct 2, 2012, 6:53:12 PM10/2/12
to Qd, united...@googlegroups.com, Omar AL Zabir
Here is in sha Allah a useful link to an Arabic search engine in the following digitized references:

لسان العرب
مقاييس اللغة
الصّحّاح في اللغة
القاموس المحيط
العباب الزاخر

Thanks to brother Ramadhaan on Zekr group

Salam

Ali

Wasalaam all


Salam/qd

https://groups.google.com/forum/#!topic/united-quran/eOhPm8qrKBQ

wasalaam


http://qurandev.github.com/corpus/#!/quran-corpus/7:8


http://qurandev.github.com/corpus/#!/quran-corpus/8:16


salaam

3:167


4:42


6:16


7:8


8:16

wasalaam again,

salaam/QD

wasalaam,

(2) Time adverb

(2:85:39) wayawma

and (on the) Day

http://quran.azurewebsites.net/2/85


http://corpus.quran.com/qurandictionary.jsp?q=ywm

wasalaam Omar,


salaam/QD

One example is, yawm appears only 405 times in corpus.quran.com . They haven't mapped other forms besides Noun. for ex, Adverbial forms are missing.

corpus.quran.com/qurandictionary.jsp?q=ywm


Shows 405 times.

Accurate count is 475 as seen on Tanzil:


http://tanzil.net/#search/root/ يوم

We need to prepare an accurate Root database and root to word mapping so that we can show accurate Root information in qurandev and QRT.

Brother Ali has generously taken this challenge to produce an accurate mapping of Roots and Root to word.

Once he produces that, we will add the grammar forms on that output using corpus.quran.com . This will mean many root will not have grammar mapping. We will try to identify those missing mappings and fix ourselves.

And then we can offer that to corpus.quran.com team.

--

--


--



Ali Adams

unread,
Oct 2, 2012, 6:53:45 PM10/2/12
to Qd, united...@googlegroups.com, Omar AL Zabir
Here is the link
http://www.baheth.info/
Reply all
Reply to author
Forward
0 new messages