GNT verse coverage with frequency ordering

Skip to first unread message

James Tauber

Mar 25, 2008, 4:13:16 AM3/25/08
[if you'll indulge me, I'm trying to get all my thoughts and previous
writing on these topics in one place and this list is a good place to
do it]

[this is based on a post to b-greek[1] and my blog[2]. I hope the
table comes out! ]

It is fairly common, in the context of learning vocabulary for a
particular corpus like the Greek New Testament, to talk about what
proportion of the text one could read if one learnt the top N words.
I even produced such a table for the GNT back in 1996—see New
Testament Vocabulary Count Statistics[3].

But these sort of numbers are highly misleading because they don't
tell you what proportion of sentences (or as a rough proxy in the GNT
case: verses) you could read, only what proportion of words.

Reading theorists have suggested that you need to know 95% of the
vocabulary of a sentence to comprehend it. So a more interesting list
of statistics would be how many verses can one understand 95% of the
vocab of if one know a certain number of words. Of course, there's a
lot more to reading comprehension than knowing the vocab. But it was
enough for me to decide to write some code yesterday afternoon to run
against my MorphGNT database.

To first of all give you a flavour in the specific before moving to
the final numbers, consider John 3.16, which is, from a vocabulary
point of view, a very easy verse to read.

To be able to read 50% of it, you only need to know the top 28 lexemes
in the GNT. To read 75% you only need the top 85 (up to κόσμος).
With the top 204 lexemes, you can read 90% of the verse and only a few
more: up to 236 (αἰώνιος) gives you the 95%. The only word you
would not have come across learning the top 236 words would be
μονογενής but even that is in the top 1,200.

This example does highlight some of the shortcomings of this sort of
analysis. There's no consideration of necessary knowledge of
morphology, syntax, idioms, etc. Nor for the fact that the meaning of
something like μονογενής is fairly easy to guess from
knowledge of more common words. But I still think it's much more
useful than the pure word coverage statistics I linked to above.

So let's actually run the numbers on the complete GNT. If you know the
top N words, how many verses could you understand 50% of, 75%, 90% or
95% of...

vocab / coverage any 50% 75% 90% 95% 100%

100 99.9% 91.3% 24.4% 2.1% 0.6% 0.4%
200 99.9% 96.9% 51.8% 9.8% 3.4% 2.5%
500 99.9% 99.1% 82.3% 36.5% 18.0% 13.9%
1,000 100.0% 99.7% 93.6% 62.3% 37.3% 30.1%
1,500 100.0% 99.8% 97.2% 76.3% 53.5% 44.8%
2,000 100.0% 99.9% 98.4% 85.1% 65.5% 56.5%
3,000 100.0% 100.0% 99.4% 93.6% 81.0% 74.1%
4,000 100.0% 100.0% 99.7% 97.4% 90.0% 85.5%
5,000 100.0% 100.0% 100.0% 99.4% 96.5% 94.5%
all 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%
What this means is purely from a vocabulary point of view if you knew
the top 1000 lexemes, then 37.3% of verses in the GNT would be 95%
familiar to you.

Note that this uses:
1) verses as the reading target
2) lexemes as the individual items to be learnt
3) frequency of lexemes as the ordering
It is possible to alter any of these variables and in subsequent posts
I will do this.

[3] (via Internet Archive's Wayback Machine)

Karyn Traphagen

Mar 25, 2008, 7:21:00 AM3/25/08
What database are you using for searching? What is the search string or method you used to generate the table? I want to do that for Biblical Hebrew and the Hebrew text.

Your point about needing to know 95% of vocab for reading comprehension is something we have referenced in our own efforts at better ideas for vocabulary acquisition. The data set you provide is extremely helpful.

James Tauber

Mar 25, 2008, 8:06:30 AM3/25/08
The database is MorphGNT. The table is generated from code I wrote
myself. I've just spent the last few hours making the code more
generic and I will be checking it in shortly.

All it requires as input is a file with a line of the form <target>
<item> for every word in the text. <target> in my case is the book/
chapter/verse reference and <item> is the lemma of the word.


James Tauber

Mar 25, 2008, 8:26:20 AM3/25/08
I've checked in my Python code as:

If you're not comfortable running it yourself, I can run it on any
data you provide.

(if you send data, I suggest you do it off-list and be careful because
a "reply" will go to the entire mailing list)

Remember that, as I said in my post, there's no consideration of
necessary knowledge of morphology, syntax, idioms, etc. Over time, we
can incorporate that, but for now the results are limited to the
somewhat naïve assumptions that:

(a) comprehension is only at the level of the target (the verse in my
example data)
(b) learning the items (lexemes in the example table I gave) is all
that matters to comprehending the target
(c) all items are equally easy to learn
(d) there is no dependency between items

and, of course, the table assumes a frequency ordering of items. Soon
I'll be starting a separate thread on alternative orderings.

But all that said, the numbers produced are far more useful than
misleading notions like "the top 10 words account for 37% of the text".

Incidentally, here is the table when applied to *forms* in the Greek
NT rather than lexemes:

0% 50% 75% 90% 95% 100%

100 99.8% 57.7% 1.1% 0.0% 0.0% 0.0%
200 99.8% 79.2% 6.4% 0.3% 0.0% 0.0%
500 99.9% 93.0% 27.0% 2.2% 0.5% 0.4%
1,000 99.9% 96.9% 51.4% 7.9% 2.3% 1.7%
2,000 99.9% 98.7% 72.5% 21.8% 7.9% 5.7%
5,000 99.9% 99.7% 91.0% 52.3% 28.6% 21.5%
8,000 99.9% 99.9% 96.7% 71.6% 47.6% 37.8%
12,000 100.0% 99.9% 99.2% 86.3% 64.8% 54.2%
16,000 100.0% 100.0% 99.9% 97.9% 88.9% 82.4%
20,000 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%

The fact that it takes 1,000 forms just to get 2.3% of verses at 95%
coverage is indicative of the fact that frequency alone is not the way
to go. Soon, I'll also produce similar tables using clauses (in the sense), rather than verses, as the target.


Reply all
Reply to author
0 new messages