Hi,
I’m currently working on OCR for Arabic digits (also known as Hindi numbers) extracted from table cells in old documents. After cutting the table cells, I’ve been OCRing the content individually. However, I’ve noticed some repetitions in the recognized digits.
While visualizing the coordinates of each digit, I discovered that extra bounding boxes were generated. Do you have any suggestions for resolving this issue?
I’ve attached the visual results, highlighting the inaccuracies with red circles, along with the original cell image for your reference.
I am utilizing a tuned version of the Arabic.traineddata model, which was adjusted using training lines from the same collection of books that I’m OCRing. The OCR process is being done with PSM=6 and OEM=1.
Tesseract 5.5.0,
Python 3.13.1,
tesserocr 2.7.1,
leptonica-1.82.0,
Thank you!
Sara Elshobaky