Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Using tesseract_best (or other models?) for 18th-century English printed text

36 views
Skip to first unread message

Massimiliano Carloni

unread,
Apr 21, 2025, 12:03:33 PMApr 21
to tesseract-ocr
Hello everyone,

A quick question regarding the use of the tessdata_best models. I have simply copy-pasted the eng.traineddata file into the local directory where Tesseract takes models from (the one shown when running --list-langs: in my case, /opt/homebrew/share/tessdata/). I simply replaced the standard model that comes with the tesseract Homebrew package. Should I adapt some other configuration in order to have better results (apart from --oem 1)?

Honestly, I am having the same amount of (or even more) errors than with the standard model. I am trying to automatically transcribe documents such as the one attached (a simple excerpt from a longer file; see also e.g. https://royalsocietypublishing.org/doi/epdf/10.1098/rstl.1720.0013). Any idea if there are more suitable models for this kind of 18th-century documents? (Seems like a 18th-century Caslon font, which uses the long S quite often)

Thank you for any kind of help you can provide!
Best,
Massimiliano
18_century_extract.pdf

RuePat07

unread,
Apr 21, 2025, 3:02:02 PMApr 21
to tesseract-ocr
Try preprocessing your documents. Create a black and white image first and crop the images for text area. Try to enhance the text by thresholding. In my experience i have seen tesseract do not so well when there are stray lines or boxes. You can also experiment with different psm modes, i found changing them to be useful in my application. You could also finetune the eng/latin model if all the documents are in a similar font for that font. 

Graham Toal

unread,
Apr 21, 2025, 5:34:43 PMApr 21
to tesser...@googlegroups.com
On Mon, Apr 21, 2025 at 2:02 PM RuePat07 <patil.ruc...@gmail.com> wrote:
Try preprocessing your documents. Create a black and white image first and crop the images for text area. Try to enhance the text by thresholding. In my experience i have seen tesseract do not so well when there are stray lines or boxes. You can also experiment with different psm modes, i found changing them to be useful in my application. You could also finetune the eng/latin model if all the documents are in a similar font for that font. 

Actually that document looked like one of the ones that has been prepared with whatever tool it is that creates 3 layers for every page, and one of those layers is the text only layer in grey scale, with the background already removed (although it is inverted white on black which is easily fixed).  You can extract those images from the file and keep every third one which will be the text.  I don't know which tool is creating pdfs in this format, but it's similar to the way that Deja Vu originally pioneered separating the background and replacing it with a more compact version.  I've seen it in files from both Google Books and archive.org.  In my current project, this was all I found necessary to add to those extracted layers - basically just removing a little noise:
    convert \
        $1 \
        -write MPR:source \
        -morphology close rectangle:3x4 \
        -clip-mask MPR:source \
        -morphology erode:8 square \
        +clip-mask \
        scan_intermediate.jpg
    convert scan_intermediate.jpg -shave 150x150 -fuzz 20% -trim +repage ../images/$1
btw while I'm posting... some 'gotchas' to look out for which I've come across myself recently when OCRing and proofreading similar 18th and 19th C documents, some of which were due to the typesetter substituting what was available for a less common character: the actual letter 'f' substituted for the long medial s; 'y' substituted for thorn - the old style thorn that looks like a y or a gamma, not the representation used by UTF-8 that looks somewhat like a p or b or beta. (example: for the using þe way of witchcraft of moudiwart's feet upon him in his purse given to him þe Satan for the cause that sa lang as he had them upon him he sould never want siller.), the which is frequently erroneously rendered (and mispronounced) as 'ye'. An apostrophe being used in Scottish names like M`Donald in place of a superscript 'c'.  Various ligatures that you don't see much nowadays (eg ct).  Much more common uses of superscripts where in modern times we'd use an apostrophe to denote missing letters before the word-final cluster of letters.  u for v and vice-versa.  Qu for W.  Thin spaces before some punctuation (caused by mechanical issues with the type, eg ' ;'  which should be OCR'd as just ';'.)  More common use of ligatures (eg Æneas).  Use of the old style '&' which looks more like the letters "Et".  Use of accents that you might not be expecting and might dismiss as bad OCR, eg "We hairtlie thank thé Hevinlie Father". Use of vulgar fractions with a horizontal bar which cannot be represented in UTF-8 which only supports a diagonal bar.  The old letter yogh which is written with a descender and often rendered as (and similarly mispronounced as) 'z' as in the surname 'Menzies' which is pronounced 'meengis' - the name of American jazz musician Charlie Mingus actually preserves the pronunciation but not the spelling of the original name of Menzies.  Few of these are caught by tesseract and will require manual proofreading.

Good luck with your project.

Graham
Reply all
Reply to author
Forward
0 new messages