Hi Ben,
On 10/05/2021 15:09, Ben Crowell wrote:
> Hi Merlijn,
>
> Thanks very much for your reply. It's encouraging that you were able to get
> somewhat better results. However, I'm not able to reproduce them. When I
> use -l eng+ell, the results are still very poor:
>
> 1. Evverre declare wot to me, Movca Muse,
> avopa the man voAvtpotrov of many fortunes,
> ὁς Νο πλαγχθη παπἀρτεάἁ µαλα πολλα very
> much, eves when ewepoev he had destroyed
> i d city T { Troy:
> lepov troAscOpor the sacred city Tons of Troy :
> we Se and saw aorea towns «at and eyvo
> learnt vooy the mood πολλων ανθρωπων οἳ
>
> The text uses ancient Greek vocabulary and accentuation, so it actually
> makes sense to use grc, not ell.
Ah, my bad.
>
> I didn't understand what you meant by "using the Archive.org Tesseract
> stack," but a web search on your name led me to archive-pdf-tools, which
> you're the author of. It's great to have help from someone who's clearly
> very expert. I just don't know how to diagnose what is different between
> your setup and mine. It looks like you did the whole first page rather than
> the piece I posted, so there may be a difference in how the image was
> prepared. I just zoomed in on the
archive.org page, took a screenshot,
> cropped it, and changed it to grayscale. I'm running tesseract 4.1.1, which
> seems to be the latest official release. Are you running a version compiled
> from the latest source or something? My
> file /usr/share/tesseract-ocr/4.00/tessdata/grc.traineddata , which came
> from installing the debian package tesseract-ocr-grc, is dated 2017, which
> seems old, and is 2.2 Mb. The version
> at
https://github.com/tesseract-ocr/tessdata is 7 Mb and looks like it was
> changed around 2018. I could try just replacing the file with the newer
> version, but I have no idea whether that's a reasonable thing to do, since
> I don't know anything about how the software is designed.
"using the Archive.org Tesseract stack" means that
archive.org will
automatically run Tesseract OCR on uploaded content and make those
results available (so you can compare with your local results). Because
this book predates the integration of Tesseract, I submitted the content
for re-OCRing, using Tesseract, in an attempt to reproduce your results.
I'm rerunning the item now with Ancient Greek "grc" as opposed to Greek
"ell".
The version that is being used is Tesseract "5.0.0-alpha-20201231" [1],
the language packs are the latest ones from Git, I believe. Maybe it
would be worth giving the latest version a shot and see if it yields
better results. There is an ubuntu ppa [2] with development
snapshots/versions. Then, if the latest version still results in
unsatisfying results, it would be worth trying to investigate why?
Hope this helps,
Cheers,
Merlijn
[1]
https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0-alpha-20201231
[2]
http://ppa.launchpad.net/alex-p/tesseract-ocr-devel