if only they knew that one rare word...

24 views
Skip to first unread message

James Tauber

unread,
Mar 26, 2008, 9:53:41 AM3/26/08
to graded...@googlegroups.com
I'm going to talk in more detail about alternatives to frequency order
in a different thread but I wanted to share the results of a quite
striking little test I did.

In my last post, I show the vocab/coverage table applied to fully
inflected forms in the Greek NT rather than lexemes. You may have
noticed that the 100% coverage column and even the 95% coverage column
said 0.0% verses for the 100 most frequent forms.

If you did, you might then have wondered: is this just a rounding
error? The answer is no. Even if you knew the 100 most frequent
inflected forms in the GNT, there is not a single verse you would know
all the forms in (of course assuming you couldn't guess).

I wanted to test if this was because of just one outlier. So I
modified (added 4 extra lines) the code that produced the table to
instead output a list of the top ten targets (i.e. verses) whose
*second least* frequent item (i.e. form) is most frequent overall.

Here are the results:

032030 2 [1, 2, 1077]
030146 35 [1, 35, 524]
041135 46 [2, 46, 14597]
130528 66 [5, 19, 38, 45, 49, 59, 65, 66, 235]
071623 66 [5, 19, 38, 45, 59, 66, 235]
070323 68 [3, 3, 29, 65, 68, 131]
020940 72 [8, 18, 22, 22, 44, 49, 49, 72, 102]
012425 78 [36, 78, 2846]
060211 96 [8, 14, 18, 22, 79, 96, 4276]
130519 98 [7, 17, 98, 14731]

What this listing is showing is that, for example, target 032030 (Luke
20.30) consists of the 1st, 2nd and 1077th most frequent forms; target
030146 (Luke 1.46) consists of the 1st, 35th and 524th most frequent
forms. So if the rarest word wasn't needed, they would jump from
needing the top 1077 forms to just the top 2 and from needing the top
524 forms to the top 35.

Now you may argue that many of these are bad examples because the
verse doesn't make sense in isolation (a good reason to be more
careful about what to use as targets) or that the one rare word is
actually the one carrying most of the semantic weight.

But this little test demonstrates that sometimes a single rare item
can massively delay reading an otherwise quite readable target unit.

By the way, here's the same listing based on *lexemes* rather than
fully inflected forms:

032030 2 [1, 2, 346]
030146 9 [2, 9, 509]
011615 9 [3, 4, 5, 7, 8, 9, 9, 33]
032448 13 [4, 13, 415]
090124 14 [1, 2, 6, 7, 14, 267]
021337 16 [4, 5, 9, 9, 12, 16, 588]
040620 17 [1, 3, 5, 7, 8, 9, 17, 180]
041135 19 [1, 19, 4752]
040426 19 [1, 1, 3, 4, 7, 8, 9, 19, 56]
031934 24 [1, 1, 3, 5, 9, 15, 23, 24, 311]


I'll check in the code that produces this shortly.


James

James Tauber

unread,
Mar 29, 2008, 3:52:36 PM3/29/08
to graded...@googlegroups.com

On Mar 26, 2008, at 9:53 AM, James Tauber wrote:
> I'll check in the code that produces this shortly.

It's now available at

http://code.google.com/p/graded-reader/source/browse/trunk/code/if-only.py

James

Reply all
Reply to author
Forward
0 new messages