Arabic characters and numbers

227 views
Skip to first unread message

Mostafa Abdo

unread,
Nov 1, 2023, 6:43:31 AM11/1/23
to tesseract-ocr
Is there a train data file that contains Arabic characters and numbers?
I can get only characters or numbers not both
Also I use this with JAVA not the OCR Tool

Des Bw

unread,
Nov 1, 2023, 8:09:45 AM11/1/23
to tesseract-ocr
Doesn't the official Arabic model include the numberal?
The Arabic numberals are supposed to be part of almost all the models. The Amharic model, I am working on, for example, does recognize Arabic numerals (of course, along with the regular letter characters). 

Mostafa Abdo

unread,
Nov 1, 2023, 10:02:22 AM11/1/23
to tesseract-ocr
I tried ara.traineddata , Arabic.traineddata and ara-Amiri.traineddata all don't have the Arabic (Indian) numbers but have the normal (English) numbers

Tom Morris

unread,
Nov 1, 2023, 12:11:45 PM11/1/23
to tesseract-ocr
On Wednesday, November 1, 2023 at 10:02:22 AM UTC-4 mosta....@gmail.com wrote:
I tried ara.traineddata , Arabic.traineddata and ara-Amiri.traineddata all don't have the Arabic (Indian) numbers but have the normal (English) numbers

You might want to clarify whether you are referring to: https://en.wikipedia.org/wiki/Arabic_numerals

Or better yet, include pictures or Unicode code points, so that people can give you a precise answer.

Tom 

Mostafa Abdo

unread,
Nov 2, 2023, 12:01:28 PM11/2/23
to tesseract-ocr

I need to get Eastern Arabic numbers.
I tried ara.traineddata , Arabic.traineddata and ara-Amiri.traineddata of Tessdata folder all don't have the Arabic (Easternnumbers
I am using Java with 
tess4j-5.8.0 library

Mostafa Abdo

unread,
Nov 2, 2023, 12:03:53 PM11/2/23
to tesseract-ocr
I tried to get the data of this picture the this is the output ! :

7 عضو عامل
السيدة : نهى الامام الشيخ

رقم العضوية : ??????? .
رقم الايصال : ??????????

تاريخ السداد : ?????/??/???

ا » | >ا.ن.»»» رتيس معا ادارة
“ ا. 7 الاعلامي د/ اسفة

id2.jpg

Tom Morris

unread,
Nov 7, 2023, 4:16:26 PM11/7/23
to tesseract-ocr
Strangely, the ara language model only seems to have the first two numerals, but the Arabic script model has them all:

$ grep [٠١٢٣٤٥٦٧٨٩] ara.lstm-unicharset
٠ 8 0,255,0,255,0,0,0,0,0,0 Arabic 82 5 82 ٠ # ٠ [660 ]0
١ 8 0,255,0,255,0,0,0,0,0,0 Arabic 83 5 83 ١ # ١ [661 ]0
$ grep [٠١٢٣٤٥٦٧٨٩] script/Arabic.lstm-unicharset
٠ 8 0,255,0,255,0,0,0,0,0,0 Arabic 82 5 82 ٠ # ٠ [660 ]0
١ 8 0,255,0,255,0,0,0,0,0,0 Arabic 83 5 83 ١ # ١ [661 ]0
٩ 8 0,255,0,255,0,0,0,0,0,0 Arabic 102 5 102 ٩ # ٩ [669 ]0
٨ 8 0,255,0,255,0,0,0,0,0,0 Arabic 217 5 217 ٨ # ٨ [668 ]0
٣ 8 0,255,0,255,0,0,0,0,0,0 Arabic 218 5 218 ٣ # ٣ [663 ]0
٢ 8 0,255,0,255,0,0,0,0,0,0 Arabic 219 5 219 ٢ # ٢ [662 ]0
٧ 8 0,255,0,255,0,0,0,0,0,0 Arabic 222 5 222 ٧ # ٧ [667 ]0
٥ 8 0,255,0,255,0,0,0,0,0,0 Arabic 223 5 223 ٥ # ٥ [665 ]0
٤ 8 0,255,0,255,0,0,0,0,0,0 Arabic 224 5 224 ٤ # ٤ [664 ]0
٦ 8 0,255,0,255,0,0,0,0,0,0 Arabic 300 5 300 ٦ # ٦ [666 ]0


With -l Arabic using the tessdata_best model, I get:

السيدة : نهى الامام الشيخ
رقم العضوية : ۸ ‎٥۸٥‏ 6
رقم الايصال : ‎۱۰۹۲١٤‏
‏تاريخ السداد : ‎۲٠۲۰/۰۱/۰٢‏

www.alshams.club ‏ا‎ ۱

ئيس مجلس الادارة
‎@alshams.club goo‏
‎info “f m‏

الاعلامي د/ انامه ابو

Mostafa Abdo

unread,
Nov 8, 2023, 6:23:46 AM11/8/23
to tesseract-ocr
That exactly I want, can you share this tessdata model here.
I meant the file (.traineddata)
as I am using it with JAVA not other tools

Tom Morris

unread,
Nov 8, 2023, 11:11:52 AM11/8/23
to tesseract-ocr
On Wednesday, November 8, 2023 at 6:23:46 AM UTC-5 mosta....@gmail.com wrote:
That exactly I want, can you share this tessdata model here.

 tessdata_best is the name of the repo for the standard models and the Arabic model contains the language independent Arabic script training.


Tom

Mostafa Abdo

unread,
Nov 9, 2023, 10:22:03 AM11/9/23
to tesseract-ocr
Also the same result !! what's wrong with that ?osr1.JPGosr2.JPGosr3.JPG

Tom Morris

unread,
Nov 10, 2023, 11:59:50 AM11/10/23
to tesseract-ocr
Please don't use screenshots to represent text. They can't be searched, quoted, edited, and are generally much more difficult for people to deal with.

My results were with -l Arabic. It looks like your Java program is doing the equivalent of -l ara, which isn't the same thing. I suspect that if you use the correct model you'll get the results you desire.

Tom

Reply all
Reply to author
Forward
0 new messages