Using tesseract_best (or other models?) for 18th-century English printed text

Massimiliano Carloni

unread,

Apr 21, 2025, 12:03:33 PMApr 21

to tesseract-ocr

Hello everyone,

A quick question regarding the use of the tessdata_best models. I have simply copy-pasted the eng.traineddata file into the local directory where Tesseract takes models from (the one shown when running --list-langs: in my case, /opt/homebrew/share/tessdata/). I simply replaced the standard model that comes with the tesseract Homebrew package. Should I adapt some other configuration in order to have better results (apart from --oem 1)?

Honestly, I am having the same amount of (or even more) errors than with the standard model. I am trying to automatically transcribe documents such as the one attached (a simple excerpt from a longer file; see also e.g. https://royalsocietypublishing.org/doi/epdf/10.1098/rstl.1720.0013). Any idea if there are more suitable models for this kind of 18th-century documents? (Seems like a 18th-century Caslon font, which uses the long S quite often)

Thank you for any kind of help you can provide!

Best,

Massimiliano

18_century_extract.pdf

RuePat07

unread,

Apr 21, 2025, 3:02:02 PMApr 21

to tesseract-ocr

Try preprocessing your documents. Create a black and white image first and crop the images for text area. Try to enhance the text by thresholding. In my experience i have seen tesseract do not so well when there are stray lines or boxes. You can also experiment with different psm modes, i found changing them to be useful in my application. You could also finetune the eng/latin model if all the documents are in a similar font for that font.

Graham Toal

unread,

Apr 21, 2025, 5:34:43 PMApr 21

to tesser...@googlegroups.com

On Mon, Apr 21, 2025 at 2:02 PM RuePat07 <patil.ruc...@gmail.com> wrote:

Try preprocessing your documents. Create a black and white image first and crop the images for text area. Try to enhance the text by thresholding. In my experience i have seen tesseract do not so well when there are stray lines or boxes. You can also experiment with different psm modes, i found changing them to be useful in my application. You could also finetune the eng/latin model if all the documents are in a similar font for that font.

Actually that document looked like one of the ones that has been prepared with whatever tool it is that creates 3 layers for every page, and one of those layers is the text only layer in grey scale, with the background already removed (although it is inverted white on black which is easily fixed). You can extract those images from the file and keep every third one which will be the text. I don't know which tool is creating pdfs in this format, but it's similar to the way that Deja Vu originally pioneered separating the background and replacing it with a more compact version. I've seen it in files from both Google Books and archive.org. In my current project, this was all I found necessary to add to those extracted layers - basically just removing a little noise:

convert \
$1 \
-write MPR:source \
-morphology close rectangle:3x4 \
-clip-mask MPR:source \
-morphology erode:8 square \
+clip-mask \
scan_intermediate.jpg
convert scan_intermediate.jpg -shave 150x150 -fuzz 20% -trim +repage ../images/$1

btw while I'm posting... some 'gotchas' to look out for which I've come across myself recently when OCRing and proofreading similar 18th and 19th C documents, some of which were due to the typesetter substituting what was available for a less common character: the actual letter 'f' substituted for the long medial s; 'y' substituted for thorn - the old style thorn that looks like a y or a gamma, not the representation used by UTF-8 that looks somewhat like a p or b or beta. (example: for the using þe way of witchcraft of moudiwart's feet upon him in his purse given to him þe Satan for the cause that sa lang as he had them upon him he sould never want siller.), the which is frequently erroneously rendered (and mispronounced) as 'ye'. An apostrophe being used in Scottish names like M`Donald in place of a superscript 'c'. Various ligatures that you don't see much nowadays (eg ct). Much more common uses of superscripts where in modern times we'd use an apostrophe to denote missing letters before the word-final cluster of letters. u for v and vice-versa. Qu for W. Thin spaces before some punctuation (caused by mechanical issues with the type, eg ' ;' which should be OCR'd as just ';'.) More common use of ligatures (eg Æneas). Use of the old style '&' which looks more like the letters "Et". Use of accents that you might not be expecting and might dismiss as bad OCR, eg "We hairtlie thank thé Hevinlie Father". Use of vulgar fractions with a horizontal bar which cannot be represented in UTF-8 which only supports a diagonal bar. The old letter yogh which is written with a descender and often rendered as (and similarly mispronounced as) 'z' as in the surname 'Menzies' which is pronounced 'meengis' - the name of American jazz musician Charlie Mingus actually preserves the pronunciation but not the spelling of the original name of Menzies. Few of these are caught by tesseract and will require manual proofreading.

Good luck with your project.

Graham

Mahmoud Mohamed

unread,

May 24, 2025, 7:31:06 PMMay 24

to tesser...@googlegroups.com

Have you solve it or not yet , I may suggest a combination of tesseract and ai . Normally I try tesseract first, I write some python scripts to enhance or prepare the documents and use pystract, if it did not work I use ai model to correct the mistakes.

If you can not do it and there is no private documents or info send me the one you need to extract and I will help in my free time or I will try with some pages and tell you what script to use and which model to aid you in the process. Best of luck

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CABwQhLmKesS8PJa%2BM7o75oV%3DW9tm4L-9P62kGOMj8MZLDiLBnw%40mail.gmail.com.

Tom Morris

unread,

May 25, 2025, 1:54:16 PMMay 25

to tesseract-ocr

On Monday, April 21, 2025 at 12:03:33 PM UTC-4 mcarlo...@gmail.com wrote:

Honestly, I am having the same amount of (or even more) errors than with the standard model. I am trying to automatically transcribe documents such as the one attached (a simple excerpt from a longer file; see also e.g. https://royalsocietypublishing.org/doi/epdf/10.1098/rstl.1720.0013). Any idea if there are more suitable models for this kind of 18th-century documents? (Seems like a 18th-century Caslon font, which uses the long S quite often)

You might want to look at some of the work that was done by the Early Modern OCR project: https://emop.tamu.edu/

Tom

Reply all

Reply to author

Forward