Corpus for word frequencies in eng.cube.word-freq ?

244 views

Skip to first unread message

Tom Morris

unread,

Jul 1, 2015, 3:00:48 PM7/1/15

to tesser...@googlegroups.com

When I look at the word frequencies in eng.cube.word-freq, they look more like what I would expect from analyzing a web corpus rather than a corpus of printed materials (of any era).

The list starts off okay:

#1 the 13675

#2 of 15222

#3 and 15473

#4 to 15694

#5 a 17149

but then we have:

#29 Links 24448

#34 Search 25779

#37 Home 25853

which seem suspiciously like high frequency terms from web boilerplate.

If you look at the Google N-grams data, you can see that the frequency of "Links" is orders of magnitude lower.

How much of an impact does the word-freq list have on the OCR? Would I get better results on printed documents if I tuned the word frequencies to match their contents?

Tom

Helmut Wollmersdorfer

unread,

Jul 2, 2015, 4:32:18 AM7/2/15

to tesser...@googlegroups.com

Am Mittwoch, 1. Juli 2015 21:00:48 UTC+2 schrieb Tom Morris:

When I look at the word frequencies in eng.cube.word-freq, they look more like what I would expect from analyzing a web corpus rather than a corpus of printed materials (of any era).

The list starts off okay:

#1 the 13675
#2 of 15222
#3 and 15473
#4 to 15694
#5 a 17149

but then we have:

#29 Links 24448
#34 Search 25779
#37 Home 25853

which seem suspiciously like high frequency terms from web boilerplate.

Yes, this seems biased. Got similar effects with wikipedia-dumps and Wiki-syntax keywords.

If you look at the Google N-grams data, you can see that the frequency of "Links" is orders of magnitude lower.

How much of an impact does the word-freq list have on the OCR? Would I get better results on printed documents if I tuned the word frequencies to match their contents?

It has some impact. But you can only find out by using two different corpora and compare the results.

If a dedicated frequency provides better results depends on the vocabulary of your documents.

Words badly recognized are most outside the main main language or not in the basic vocabulary. E.g. in my focus on old scientific nature books I have to deal with different languages (eng, deu, lat, greek), different orthography, different fonts (fraktur, antiqua, sans, italic, greek), and do most of the spelling correction after OCR, with dozens of lexicons, each millions of words.

Reply all

Reply to author

Forward

0 new messages