Arabic characters and numbers

Mostafa Abdo

unread,

Nov 1, 2023, 6:43:31 AM11/1/23

to tesseract-ocr

Is there a train data file that contains Arabic characters and numbers?
I can get only characters or numbers not both

Also I use this with JAVA not the OCR Tool

Des Bw

unread,

Nov 1, 2023, 8:09:45 AM11/1/23

to tesseract-ocr

Doesn't the official Arabic model include the numberal?

The Arabic numberals are supposed to be part of almost all the models. The Amharic model, I am working on, for example, does recognize Arabic numerals (of course, along with the regular letter characters).

Mostafa Abdo

unread,

Nov 1, 2023, 10:02:22 AM11/1/23

to tesseract-ocr

I tried ara.traineddata , Arabic.traineddata and ara-Amiri.traineddata all don't have the Arabic (Indian) numbers but have the normal (English) numbers

Tom Morris

unread,

Nov 1, 2023, 12:11:45 PM11/1/23

to tesseract-ocr

On Wednesday, November 1, 2023 at 10:02:22 AM UTC-4 mosta....@gmail.com wrote:

I tried ara.traineddata , Arabic.traineddata and ara-Amiri.traineddata all don't have the Arabic (Indian) numbers but have the normal (English) numbers

You might want to clarify whether you are referring to: https://en.wikipedia.org/wiki/Arabic_numerals

or https://en.wikipedia.org/wiki/Eastern_Arabic_numerals

Or better yet, include pictures or Unicode code points, so that people can give you a precise answer.

Tom

Mostafa Abdo

unread,

Nov 2, 2023, 12:01:28 PM11/2/23

to tesseract-ocr

I need to get Eastern Arabic numbers.

I tried ara.traineddata , Arabic.traineddata and ara-Amiri.traineddata of Tessdata folder all don't have the Arabic (Eastern) numbers
I am using Java with tess4j-5.8.0 library

Mostafa Abdo

unread,

Nov 2, 2023, 12:03:53 PM11/2/23

to tesseract-ocr

I tried to get the data of this picture the this is the output ! :

7 عضو عامل
السيدة : نهى الامام الشيخ

رقم العضوية : ??????? .
رقم الايصال : ??????????

تاريخ السداد : ?????/??/???

ا » | >ا.ن.»»» رتيس معا ادارة
“ ا. 7 الاعلامي د/ اسفة

id2.jpg

Tom Morris

unread,

Nov 7, 2023, 4:16:26 PM11/7/23

to tesseract-ocr

Strangely, the ara language model only seems to have the first two numerals, but the Arabic script model has them all:

$ grep [٠١٢٣٤٥٦٧٨٩] ara.lstm-unicharset
٠ 8 0,255,0,255,0,0,0,0,0,0 Arabic 82 5 82 ٠ # ٠ [660 ]0
١ 8 0,255,0,255,0,0,0,0,0,0 Arabic 83 5 83 ١ # ١ [661 ]0
$ grep [٠١٢٣٤٥٦٧٨٩] script/Arabic.lstm-unicharset
٠ 8 0,255,0,255,0,0,0,0,0,0 Arabic 82 5 82 ٠ # ٠ [660 ]0
١ 8 0,255,0,255,0,0,0,0,0,0 Arabic 83 5 83 ١ # ١ [661 ]0
٩ 8 0,255,0,255,0,0,0,0,0,0 Arabic 102 5 102 ٩ # ٩ [669 ]0
٨ 8 0,255,0,255,0,0,0,0,0,0 Arabic 217 5 217 ٨ # ٨ [668 ]0
٣ 8 0,255,0,255,0,0,0,0,0,0 Arabic 218 5 218 ٣ # ٣ [663 ]0
٢ 8 0,255,0,255,0,0,0,0,0,0 Arabic 219 5 219 ٢ # ٢ [662 ]0
٧ 8 0,255,0,255,0,0,0,0,0,0 Arabic 222 5 222 ٧ # ٧ [667 ]0
٥ 8 0,255,0,255,0,0,0,0,0,0 Arabic 223 5 223 ٥ # ٥ [665 ]0
٤ 8 0,255,0,255,0,0,0,0,0,0 Arabic 224 5 224 ٤ # ٤ [664 ]0
٦ 8 0,255,0,255,0,0,0,0,0,0 Arabic 300 5 300 ٦ # ٦ [666 ]0

With -l Arabic using the tessdata_best model, I get:

السيدة : نهى الامام الشيخ
رقم العضوية : ۸ ‎٥۸٥‏ 6
رقم الايصال : ‎۱۰۹۲١٤‏
‏تاريخ السداد : ‎۲٠۲۰/۰۱/۰٢‏

www.alshams.club ‏ا‎ ۱

ئيس مجلس الادارة
‎@alshams.club goo‏
‎info “f m‏

الاعلامي د/ انامه ابو

Mostafa Abdo

unread,

Nov 8, 2023, 6:23:46 AM11/8/23

to tesseract-ocr

That exactly I want, can you share this tessdata model here.

I meant the file (.traineddata)

as I am using it with JAVA not other tools

Tom Morris

unread,

Nov 8, 2023, 11:11:52 AM11/8/23

to tesseract-ocr

On Wednesday, November 8, 2023 at 6:23:46 AM UTC-5 mosta....@gmail.com wrote:

That exactly I want, can you share this tessdata model here.

tessdata_best is the name of the repo for the standard models and the Arabic model contains the language independent Arabic script training.

https://github.com/tesseract-ocr/tessdata_best/blob/main/script/Arabic.traineddata

Tom

Mostafa Abdo

unread,

Nov 9, 2023, 10:22:03 AM11/9/23

to tesseract-ocr

Also the same result !! what's wrong with that ?

Tom Morris

unread,

Nov 10, 2023, 11:59:50 AM11/10/23

to tesseract-ocr

Please don't use screenshots to represent text. They can't be searched, quoted, edited, and are generally much more difficult for people to deal with.

My results were with -l Arabic. It looks like your Java program is doing the equivalent of -l ara, which isn't the same thing. I suspect that if you use the correct model you'll get the results you desire.

Tom

Reply all

Reply to author

Forward