Facing trouble with Tesseract OCR (from v4 to v5) for python version upgrade (from Python 3.6 to Python 3.10)

Skip to first unread message

Prashant Sharma

Feb 27, 2023, 2:03:13 AMFeb 27
to tesseract-ocr
Hi All,

I am trying to upgrade the software versions of an inhouse text extraction application developed with Python, tesserocr python module and tesseract OCR software as below:

  • Existing software versions (Outdated softwares) : Python (v3.6.5) + tesserocr (v2.4.0) + tesseract OCR (v4)
  • Target software versions   (Latest softwares)   : Python (v3.10.7) + tesserocr (v2.5.2) + tesseract OCR (v5)

However I get different results from same set of softwares with different versions (as above) in terms of bounding box cordinates, text extraction results (minor changes), and other numerical metadata while calling the GetHOCRText method.

I need to get exact same extraction result in terms of metadata (ex.-bounding boxes) as I have some dependencies post the text extraction hence result needs to be same for metadata with the upgraded softwares.

Could you please advise ?

Prashant Sharma

Zdenko Podobny

Mar 11, 2023, 1:03:26 PMMar 11
to tesser...@googlegroups.com
First of all: it is a good manner to provide a test case (working code + input &output)
Next: there were improvements (e.g. https://github.com/tesseract-ocr/tesseract/commit/3a5e5089343798932d9952628acfdf56f3108c43)  in providing better -bounding boxes, so you will need to make a custom build with reverting of respective commits. 


po 27. 2. 2023 o 8:03 Prashant Sharma <prashants...@gmail.com> napísal(a):
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/59de7622-bb9d-4aa2-8b86-686b3d63f639n%40googlegroups.com.
Reply all
Reply to author
0 new messages