[questions] what happened to `tessdata_best` in Tesseract 5?

154 views
Skip to first unread message

Alessandro Griseta

unread,
Jul 6, 2025, 10:18:32 AMJul 6
to tesseract-ocr
I tried manually adding files I needed from https://github.com/tesseract-ocr/tessdata_best (`equ.traineddata`, `osd.traineddata`, `ita.traineddata`) inside `/usr/share/tesseract-ocr/5/tessdata`: unfortunately I then found out the hard way that these only work on Tesseract 4 XD. 

1. It seems funny though: does that really mean I'll get better results by downgrading so that I can actually use these files?

I understand the performance loss, but I'm particularly interested in getting the best of `equ.traineddata`, which to my understanding interprets math characters, which are often a challenge for OCR engines, so was trying to get the absolute best scan possible for that.

2. Also, I wasn't able to specify `-l equ` as the error told me Tesseract is supposed to deal with that on its own: if that's the case, is `equ` installed by default with `sudo apt-get install tesseract-ocr` (couldn't find it in `tessdata` folder, and don't know where else to look for it)?

3. I also tested the Docker image: if I put `equ.traineddata` and `osd.traineddata` inside the `tessdata` folder will they (which I have chosen manually) actually be used?

Hope this all makes sense, don't be afraid to ask :)
Alessandro

Zdenko Podobny

unread,
Jul 6, 2025, 10:21:06 AMJul 6
to tesser...@googlegroups.com
What is  `Tesseract 4 XD`?  What does that mean `I then found out the hard way that ...` ????

Zdenko


ne 6. 7. 2025 o 16:18 Alessandro Griseta <alegr...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/789d7514-bded-49e4-95ed-44cfb0049ad1n%40googlegroups.com.

Alessandro Griseta

unread,
Jul 8, 2025, 4:26:02 AMJul 8
to tesseract-ocr
Oh nothing, `XD` was just an exclamation of laughter in place of the emoji!
I found it out the hard way as I did a lot of fiddling with Docker containers (I was using the `ocrmypdf` tool, and so originally thought it was a problem with that tool itself, until I found the same behaviour in Tesseract)

Milan Hauth

unread,
Aug 30, 2025, 5:46:52 AMAug 30
to tesseract-ocr
works for me with tesseract 5.5.1

tesseract src.tiff - -c tessedit_create_hocr=1 --dpi 300 -l eng \
  --oem 1 --psm 6 --tessdata-dir tessdata_best >dst.hocr

> I then found out the hard way that these only work on Tesseract 4

how did you call tesseract?

tesseract can fail if you also pass
"magic" positional arguments like "hocr" or "quiet"

then tesseract prints warning messages like
"read_params_file: Can't open <ARG>"
for each unexpected argument

$ tesseract src.tiff - --dpi 300 -l eng \
  --oem 1 --psm 6 --tessdata-dir tessdata_best hocr >dst.hocr
read_params_file: Can't open hocr

$ tesseract src.tiff - hocr --dpi 300 -l eng \
  --oem 1 --psm 6 --tessdata-dir tessdata_best >dst.hocr
read_params_file: Can't open --dpi
read_params_file: Can't open 300
read_params_file: Can't open -l
read_params_file: Can't open eng
...

Reply all
Reply to author
Forward
0 new messages