Digits only recognized when mixed with letters

125 views
Skip to first unread message

Iman Firouzian

unread,
Feb 28, 2024, 3:28:51 AM2/28/24
to tesseract-ocr
Dear Friends,

Farsi Numbers are not recognized when they appear alone and not next to letters.
For example 
OnlyDigits.jpg
is recognized as: ۷۵ ۱ب
which is not correct.

But when it is mixed with letters as:
DigitsMixedWithLetters.jpg
it is recognized as: ماه ۸,۱۷۵,۰۰۰
which is correct

Please help me with this

Tom Morris

unread,
Feb 28, 2024, 5:20:50 PM2/28/24
to tesseract-ocr
On Wednesday, February 28, 2024 at 3:28:51 AM UTC-5 Iman Firouzian wrote:

Please help me with this

Please include more details about what version of the software you are using and which language (or script) model(s).

Tom

Iman Firouzian

unread,
Feb 29, 2024, 1:20:33 AM2/29/24
to tesseract-ocr
I've installed it using:
!sudo apt install tesseract-ocr
in Google Colab.

it says it's the latest version: tesseract-ocr is already the newest version (4.1.1-2.1build1).

and the language model is "fas" and is installed by:
!sudo apt install tesseract-ocr-fas

thanks for helping

Iman Firouzian

unread,
Feb 29, 2024, 3:45:53 AM2/29/24
to tesseract-ocr
Hi again,
I've tested it on windows and pycharm.
the tesseract version is tesseract v5.0.0-alpha.20200328

the result is roughly the same.
it would recognize correctly when numbers are mixed with letters. 
Any specific confugurations needed?

thanks for helping

Philippe Argouarch

unread,
Feb 29, 2024, 10:17:39 AM2/29/24
to tesseract-ocr
I have a similar problem with the breton language, the lib does not recognize the verbal particle o and replace it by a zero 0 . oa which mean "was' in english becomes 0a 
philippe

Tom Morris

unread,
Feb 29, 2024, 4:06:00 PM2/29/24
to tesseract-ocr
Thanks for the version and model information. That'll be useful for anyone trying to help.

My best guess is that there's something about the Farsi training data which is causing this, but I don't know what (and I don't speak Farsi). One thing you might try is using the Arabic script model and see if that's any better. Other than that, I'm afraid I don't have any good suggestions.

Tom

Tom Morris

unread,
Feb 29, 2024, 4:07:16 PM2/29/24
to tesseract-ocr
Reply all
Reply to author
Forward
0 new messages