Am Mittwoch, 1. Juli 2015 21:00:48 UTC+2 schrieb Tom Morris:
When I look at the word frequencies in eng.cube.word-freq, they look more like what I would expect from analyzing a web corpus rather than a corpus of printed materials (of any era).
The list starts off okay:
#1 the 13675
#2 of 15222
#3 and 15473
#4 to 15694
#5 a 17149
but then we have:
#29 Links 24448
#34 Search 25779
#37 Home 25853
which seem suspiciously like high frequency terms from web boilerplate.
Yes, this seems biased. Got similar effects with wikipedia-dumps and Wiki-syntax keywords.
If you look at the
Google N-grams data, you can see that the frequency of "Links" is orders of magnitude lower.
How much of an impact does the word-freq list have on the OCR? Would I get better results on printed documents if I tuned the word frequencies to match their contents?
It has some impact. But you can only find out by using two different corpora and compare the results.
If a dedicated frequency provides better results depends on the vocabulary of your documents.
Words badly recognized are most outside the main main language or not in the basic vocabulary. E.g. in my focus on old scientific nature books I have to deal with different languages (eng, deu, lat, greek), different orthography, different fonts (fraktur, antiqua, sans, italic, greek), and do most of the spelling correction after OCR, with dozens of lexicons, each millions of words.