just how much can frequency ordering be improved on?

42 views

Skip to first unread message

James Tauber

unread,

Mar 26, 2008, 10:13:06 AM3/26/08

to graded...@googlegroups.com

Here's a quick demonstration. Recall that in my previous post, I
pointed out that learning the top 100 inflected forms gives you 0
(zero, nada) target versus in the GNT.

I showed that, for example target 130528 (1 Thessalonians 5.28) gets
excluded because of one form that is #235 while the other eight forms
appear in the top 66.

Well, what if those 9 forms were learnt first? That is:

Χριστοῦ, κυρίου, Ἰησοῦ, ὑμῶν, μετά,
τοῦ, χάρις, ἡ, ἡμῶν

Not only could 130528 be read but also 071623

Now if the reader learnt πάντων (just one more form) they could
read three more verses: 140318, 191325 and 272221

Now introduce these six forms:

καί, ὑμῖν, ἀπό, εἰρήνη, πατρός, θεοῦ

and suddenly *seven* more verses are readable: 140102, 070103,
100102, 110102, 090103, 180103, 080102

This was just with one algorithm I'm experimenting with (which I'll
explain and provide code for soon) and there are likely others than do
better.

So instead of 100 forms giving 0 verses, we now have just 16 forms
giving us 12 entire verses from an actual corpus.

The usual caveats apply: items are considered independent and equally
easy to learn, there's no consideration of morphology, syntax, idiom
and this is using verses as targets. We'll fix all that over time.