How to prepare fonts folder to train from scratch

Essam Zaky

unread,

Mar 24, 2020, 4:05:03 PM3/24/20

to tesseract-ocr

Hi Dears ,

I would like to build *.traindata from scratch specially for English and Arabic

So lets talk about English as example

my question how to prepare fonts folder?

i read the https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh file

i found the this file contain about only 32 font name

should i add other Latin fonts installed in the training machine to the previous file "language-specific.sh" ?

i used "font manger" tool and i found about 147 font installed in training machine

i opended https://github.com/tesseract-ocr/langdata_lstm/blob/master/eng/okfonts.txt and it contain 4567 font name

should i search and download and install all missing fonts in the training machine ?

should i collect all fonts files from training machine and create new fonts folder "HOME/.fonts" and paste all fonts in that folder?

i see fonts have diffirent extentions "*.ttf , *.otf , *.afm , ... "

does all font types work in training or i need specific type ?

I will write another question about the required text data .

Thanks for help

Regards

Essam

Shree Devi Kumar

unread,

Mar 25, 2020, 1:14:05 AM3/25/20

to tesseract-ocr

As far as I know no one has replicated the LSTM training done from scratch by Ray.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e605a197-000c-444a-9969-dd10346f2028%40googlegroups.com.

Essam Zaky

unread,

Mar 25, 2020, 2:11:27 AM3/25/20

to tesseract-ocr

Thanks @shreeshrii

Would answer the questions depending on your experience ,

also is it possible to get help from Ray ?

Shree Devi Kumar

unread,

Mar 25, 2020, 2:50:42 AM3/25/20

to tesseract-ocr

AFAIK Ray is involved in other projects at Google. Unlikely to get a reply from him.

See https://github.com/tesseract-ocr/tesstrain/wiki for training done by @stweil on similar scale for Fraktur. The pages list the hardware requirements, time taken etc.

Please check that you have enough resources to try and replicate the LSTM training.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6512c5b3-df3b-4702-afa9-6d9f5c4d035f%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Lorenzo Bolzani

unread,

Mar 25, 2020, 3:50:47 AM3/25/20

to tesser...@googlegroups.com

Why do you want to do this?

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e605a197-000c-444a-9969-dd10346f2028%40googlegroups.com.

Essam Zaky

unread,

Mar 25, 2020, 4:54:39 AM3/25/20

to tesseract-ocr

@Lorenozo

I need to do that because because the accuracy of current Arabic not very good as English , and i have a lot fonts need to add to Arabic model

adding them by fine tune will affect the model so i need to build from scratch and make the model more generalized

so i need to know what is done in English model and take it as a reference to make new Arabic model

بتاريخ الثلاثاء، 24 مارس، 2020 10:05:03 م UTC+2، كتب Essam Zaky:

Lorenzo Bolzani

unread,

Mar 25, 2020, 5:31:15 AM3/25/20

to tesser...@googlegroups.com

I think fine tuning may work very well in this case, no need to train from scratch. Training from scratch does not guarantee better results, especially if you don't do it correctly.

I suggest to try fine tuning first and see if the results are good enough for you. In this way you get comfortable with the training process.

Training from scratch is just the same thing but more difficult because you will see the results after many hours or days and if you messed up something you need to start over. You also need to change the learning rate during training and monitor the training curves. I think there is not a simple recipe.

If you want to preserve what the model learned so far as much as possible you can try two things:

1. fine tune with the new fonts and the old fonts (or similar ones).

2. try this: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00#training-just-a-few-layers

I recommend the option 1 first, make it work correctly, then try option 2 and see if it makes things better.

Just make sure to split your data into training data and testing data at the very beginning and monitor the test accuracy to limit overfitting. You need a reliable way to compare results.

Bye

Lorenzo

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f74b7970-db67-4cb5-aec4-7a17192dc0ef%40googlegroups.com.

Essam Zaky

unread,

Mar 25, 2020, 6:02:24 AM3/25/20

to tesseract-ocr

Thanx @Loranzo and @Shree

i will give try to fine tune , and if the result still not satisfied will switch again to build from scratch

بتاريخ الثلاثاء، 24 مارس، 2020 10:05:03 م UTC+2، كتب Essam Zaky:

Shree Devi Kumar

unread,

Mar 25, 2020, 6:15:42 AM3/25/20

to tesseract-ocr

The issue with Arabic is related to RTL processing and how punctuation and digits are handled. If your training text does not have them, you will have greater success.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4928b6a0-c06c-49ca-8ecd-e300dc8da736%40googlegroups.com.

Essam Zaky

unread,

Mar 25, 2020, 6:34:14 AM3/25/20

to tesseract-ocr

My target is to recognize Arabic with numbers and punctuation + English
there are some English lines contain Arabic word

and Some Arabic lines contain English word

i did some page layout analysis and split the text to lines and try to detect the language of each word depending on word geometry in the line

and if i have line contain Arabic and English pass the line to English engine and Arabic engine then i select the final result depending on the confidence returned

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Reply all

Reply to author

Forward