I am trying to upgrade the software versions of an inhouse text extraction application developed with Python, tesserocr python module and tesseract OCR software as below:
- Existing software versions (Outdated softwares) : Python (v3.6.5) + tesserocr (v2.4.0) + tesseract OCR (v4)
- Target software versions (Latest softwares) : Python (v3.10.7) + tesserocr (v2.5.2) + tesseract OCR (v5)
However I get different results from same set of softwares with different versions (as above) in terms of bounding box cordinates, text extraction results (minor changes), and other numerical metadata while calling the GetHOCRText method.
I need to get exact same extraction result in terms of metadata (ex.-bounding boxes) as I have some dependencies post the text extraction hence result needs to be same for metadata with the upgraded softwares.
Could you please advise ?