Tesseract-OCR Training Arabic text & numbers

932 views
Skip to first unread message

Eliyaz L

unread,
Jul 12, 2020, 10:05:46 AM7/12/20
to tesseract-ocr

Hi,


My use case is on Arabic document, the pre retrained ara.traineddata are good but not perfect. so i wish to fine tune ara.traineddata, if the results are not satisfying then have train my own custom data.


please suggest me for the following:

  1. for my use case in Arabic text, problem is in one character which is always predicting wrong. so do i need to add the document font (traditional arabic font) and train? if so pls provide the procedure or link to add one font in pre training ara.traineddata.
  2. if fine tuning or training from scratch, how many gt.txt files i need and how many characters needs to be there in each file? and any apx iterations if you know?
  3. for number, the prediction is totally wrong on Arabic numbers, so do i need to start from scratch or need to fine tune? if any then how to prepare datasets for the same.
  4. how to decide the max_iterations is there any ratio of datasets and iteration.


Below are my trails:


For Arabic Numbers:


-> i tried to custom train only Arabic numbers.
-> i wrote a script to write 100,000 numbers in multiple gt.txt files. 100s of character in each gt.txt file.
-> then one script to convert text to image (text2image) which should be more like scanned image.
-> parameters used in the below order.

text2image --text test.gt.txt --outputbase /home/user/output --fonts_dir /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' --degrade_image false --rotate_image --exposure 2 --resolution 300

  1. How much dataset i need to prepare for arabic number, as of now required only for 2 specific fonts which i already have.
  2. Will dateset be duplicate if i follow this procedure, if yes is there any way to avoid it.
  3. Is that good way to create more gt.txt files with less characters in it (for eg 50,000 gt files with 10 numbers in each file) or less gt.txt files with more characters (for eg 1000 gt files with 500 numbers in each file).  

If possible please guide me the procedure for datasets preparation.

For testing I tried 50,000 eng number, with each number in one gt.txt file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration but it fails.


For Arabic Text:


-> prepared around 23k gt.txt files each having one sentence

-> generated .box and small .tifs files for all gt.txt files using 1 font (traditional Arabic font)

-> used the tesstrain git and trained for 20,000 iteration

-> after training generated foo.traineddata with 0.03 error rate

-> did prediction an the real data, it is working perfect for the perticular character which on pre trained (ara.traineddata) failes. but when comes to overall accuracy the pre trained (ara.traineddata) performs better except that one character.



Summery:


  • how to fix one character in pre trained (ara.traineddata) model or if not possible how to custom train from scratch or is there a way to annotate on real image and prepare dateset, pls suggest the best practice?
  • how to prepare Arabic number dataset and train it. if custom training on number not possible then can arabic numbers added with pre trained model (ara.traineddata)  

 

GitHub link used for custom training Arabic text and numbers: https://github.com/tesseract-ocr/tesstrain

Shree Devi Kumar

unread,
Jul 12, 2020, 11:00:40 AM7/12/20
to tesseract-ocr
What character are you trying to add?
Please share the training data to try and replicate the issue.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com.

Eliyaz L

unread,
Jul 12, 2020, 12:31:14 PM7/12/20
to tesseract-ocr
Always the letter "لا" is predicted as "ال" .

My training data here
My prediction document will be in Traditional Arabic font here.

Below shell command used to generate tif and box file from gt file: 

for i in $(seq -f "%06g" 006601 006798)
do
 echo $i
 text2image
--xsize 3600 --ysize 300 --text $i.gt.txt --outputbase /home/user/Desktop/$i --font 'Traditional Arabic' --fonts_dir /home/user/.local/share/fonts/
done


Input Image:

firstName.jpg

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Rainer Verteidiger

unread,
Jul 12, 2020, 1:15:01 PM7/12/20
to tesseract-ocr

Always the letter "لا" is predicted as "ال" .

Not sure how much relevancy that bears in the context of training models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic Letter Alef") whereas ال is ا followed by ل (so, the exact opposite way around; no ligature). Both are incredibly common in Arabic texts and although I have no clue about machine learning, I'm surprised how the training could miss the difference between them.

Shree Devi Kumar

unread,
Jul 12, 2020, 1:23:24 PM7/12/20
to tesseract-ocr
@Eliyaz What version of tesseract are you using? Which traineddata?

>Always the letter "لا" is predicted as "ال" .

I think this was fixed by Ray Smiith in 2017 and should be ok in the traineddata files in tessdata_fast and tessdata_best repos.

On Sun, Jul 12, 2020 at 6:45 PM Rainer Verteidiger <materialde...@gmail.com> wrote:

Always the letter "لا" is predicted as "ال" .

Not sure how much relevancy that bears in the context of training models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic Letter Alef") whereas ال is ا followed by ل (so, the exact opposite way around; no ligature). Both are incredibly common in Arabic texts and although I have no clue about machine learning, I'm surprised how the training could miss the difference between them.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Jul 12, 2020, 1:27:07 PM7/12/20
to tesseract-ocr

Eliyaz L

unread,
Jul 12, 2020, 3:22:39 PM7/12/20
to tesseract-ocr
Hi Shree,

i was using thie below version. I guess you are right its 2016 file. Let me test with latest traineddata. 


Meanwhile can u pls help me with arabic number.
i tried ara_number.traineddata from here it is working for number but unable to get date format with slash
and also searched for similar issue here here

main problem is with date i am trying to do prediction Arabic date in the below format.

Input image: 

date.jpg





On Sunday, July 12, 2020 at 4:27:07 PM UTC+3, shree wrote:
On Sun, Jul 12, 2020 at 6:52 PM Shree Devi Kumar <shree...@gmail.com> wrote:
@Eliyaz What version of tesseract are you using? Which traineddata?

>Always the letter "لا" is predicted as "ال" .

I think this was fixed by Ray Smiith in 2017 and should be ok in the traineddata files in tessdata_fast and tessdata_best repos.

On Sun, Jul 12, 2020 at 6:45 PM Rainer Verteidiger <materialde...@gmail.com> wrote:

Always the letter "لا" is predicted as "ال" .

Not sure how much relevancy that bears in the context of training models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic Letter Alef") whereas ال is ا followed by ل (so, the exact opposite way around; no ligature). Both are incredibly common in Arabic texts and although I have no clue about machine learning, I'm surprised how the training could miss the difference between them.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Jul 13, 2020, 4:24:57 AM7/13/20
to tesseract-ocr
If I recall correctly, ara_number.traineddata has been trained for legacy engine. You cannot use two traineddata files each using a different engine.

Regarding training of Arabic numbers and punctuation, it is currently an open issue. If you use the latest code from tesstrain repo it should automatically apply bidi algorithm to handle Arabic text as well as numbers correctly. I am not so sure about punctuation such as ( ) etc and whether they need to be reversed or not.

I suggest that you use the latest code from tesseract, tesstrain repo with the latest traineddata and try.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com.

Eliyaz L

unread,
Jul 13, 2020, 8:55:41 AM7/13/20
to tesseract-ocr
Thanks for the support, it saves lot of time and efforts.

i tried the latest tesseract its working fine with the arabic text and numbers but the only issue is with arabic date,
so if the issue is still open, can i prepare dataset and train a separate custom model for only numbers and date.

if possible then pls help me with the sample dataset and can i use this repo to train and if any apx count of dataset and iteration can be provide that will be helpful.  

Eliyaz L

unread,
Jul 14, 2020, 7:48:34 AM7/14/20
to tesseract-ocr
Hi sorry to bother, just a follow up.

i tried the latest tesseract its working fine with the arabic text and numbers but the only issue is with arabic date,
so if the issue is still open, can i prepare dataset and train a separate custom model for only numbers and date.

if possible then pls help me with the sample dataset and can i use this repo to train and if any apx count of dataset and iteration can be provide that will be helpful.  

Shree Devi Kumar

unread,
Jul 14, 2020, 8:34:54 AM7/14/20
to tesseract-ocr
@Eliyaz I do not know Arabic or any other RTL. 
I suggest you try running training with the latest code and tesstrain. You may have to experiment to get the best result.

I will try to do a test run with the data you provided, does it include numbers and dates?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9e24724e-5af7-4ea2-9a5f-baae731e2e14o%40googlegroups.com.

Eliyaz L

unread,
Jul 14, 2020, 10:27:26 AM7/14/20
to tesseract-ocr
sure ill try experimenting with Arabic numbers. 
The data i have shared is for Arabic text any way that is not required as the latest ara.traineddata it self working for that particular Arabic letter "لا".
so if i need to train only English number for example eng_number.traineddata can u pls share the sample 1 or 2 dataset how it should be, ill follow the same for Arabic numbers. 

The below is the sample which has 3 datasets, if this format is correct then i can make 10k or 50k datasets with different numbers :

Screenshot from 2020-07-14 13-23-48.png

Anuradha B

unread,
Aug 13, 2020, 11:41:12 AM8/13/20
to tesseract-ocr
I am trying to extract the arabic dates and numbers from the national ID card.I am using the following code in Anaconda- Jupiter Notebook.I ahve aalso attached the image I have used and the outputs wrt to using grayscale,threshold,canny,image etc functions..But all the text extracted does not extract the dates and numerals.[I have also installed Tesseract alpha4.0 version.]Please suggest.

import cv2
import matplotlib.pyplot as plt
from PIL import Image
import pytesseract
import numpy as np
from matplotlib import pyplot as plt
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

import cv2
import numpy as np

img = cv2.imread('image2.jpg')

# get grayscale image
def get_grayscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# noise removal
def remove_noise(image):
    return cv2.medianBlur(image,5)
 
#thresholding
def thresholding(image):
    return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

#dilation
def dilate(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.dilate(image, kernel, iterations = 1)
    
#erosion
def erode(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.erode(image, kernel, iterations = 1)

#opening - erosion followed by dilation
def opening(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)

#canny edge detection
def canny(image):
    return cv2.Canny(image, 100, 200)

#skew correction
def deskew(image):
    coords = np.column_stack(np.where(image > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    (h, w) = image.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
    return rotated

#template matching
def match_template(image, template):
    return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED) 

image = cv2.imread('image2.jpg')

gray = get_grayscale(image)
thresh = thresholding(gray)
opening = opening(gray)
canny = canny(gray)

text = pytesseract.image_to_string(image,lang='eng+ara')
print(text)
print('----------------------------------------------------------------')
text = pytesseract.image_to_string(gray,lang='eng+ara')
print(text)
print('----------------------------------------------------------------')
text = pytesseract.image_to_string(thresh,lang='eng+ara')
print(text)
print('----------------------------------------------------------------')
text = pytesseract.image_to_string(opening,lang='eng+ara')
print(text)
print('----------------------------------------------------------------')
text = pytesseract.image_to_string(canny,lang='eng+ara')
print(text)
Screenshot (164).png
Screenshot (162).png
Screenshot (161).png
Screenshot (163).png
image1.jpg
Screenshot (165).png

Mahmoud Mabrouk

unread,
Aug 13, 2020, 12:31:19 PM8/13/20
to tesseract-ocr
for numbers i used this and works fine with AEN numbers
https://github.com/ahmed-tea/tessdata_Arabic_Numbers

Anuradha B

unread,
Aug 19, 2020, 7:06:46 AM8/19/20
to tesseract-ocr
Thanks Mahmoud...DO we have to just copy the ara_number.traineddata   file from  https://github.com/ahmed-tea/tessdata_Arabic_Numbers  to the tessdata folder in the local system.I am using Google colab Jupiter notebook for the same..how do we add these files in colab..plz guide me...How do make the tessract in colab take these new trained number file??

write2...@gmail.com

unread,
Oct 27, 2020, 10:06:51 AM10/27/20
to tesseract-ocr
not able to extract this. can anyone able to extract this?
dob.jpg
doi.jpg

Sorosh Shiwa

unread,
Oct 27, 2020, 12:10:37 PM10/27/20
to tesser...@googlegroups.com
hello
thanks a lot for information but how can i use it in flutter?
please reply my question 
sorosh shiwa

Reply all
Reply to author
Forward
0 new messages