How to prepare fonts folder to train from scratch

97 views
Skip to first unread message

Essam Zaky

unread,
Mar 24, 2020, 4:05:03 PM3/24/20
to tesseract-ocr
Hi Dears ,

I would like to build *.traindata from scratch specially for English and Arabic

So lets talk about English as example
my question how to prepare fonts folder? 

i found the this file contain about only 32 font name 
should i add other Latin fonts installed in the training  machine to the previous file "language-specific.sh" ?


i used "font manger" tool and i found about 147 font installed in training machine 
should i search and download and install all missing fonts in the training machine ?

should i collect all fonts files from training machine and create new fonts folder "HOME/.fonts" and paste all fonts in that folder? 

i see fonts have diffirent extentions "*.ttf , *.otf , *.afm , ... "
does all font types work in training or i need specific type ?


I will write another question about the required text data .  

Thanks for help



Regards
Essam

Shree Devi Kumar

unread,
Mar 25, 2020, 1:14:05 AM3/25/20
to tesseract-ocr
As far as I know no one has replicated the LSTM training done from scratch by Ray. 



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e605a197-000c-444a-9969-dd10346f2028%40googlegroups.com.

Essam Zaky

unread,
Mar 25, 2020, 2:11:27 AM3/25/20
to tesseract-ocr
Thanks @shreeshrii

Would answer the questions depending on your experience , 

also is it possible to get help from Ray ?

Shree Devi Kumar

unread,
Mar 25, 2020, 2:50:42 AM3/25/20
to tesseract-ocr
AFAIK Ray is involved in other projects at Google. Unlikely to get a reply from him.

See https://github.com/tesseract-ocr/tesstrain/wiki for training done by @stweil on similar scale for Fraktur. The pages list the hardware requirements, time taken etc. 

Please check that you have enough resources to try and replicate the LSTM training.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Lorenzo Bolzani

unread,
Mar 25, 2020, 3:50:47 AM3/25/20
to tesser...@googlegroups.com
Why do you want to do this? 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Essam Zaky

unread,
Mar 25, 2020, 4:54:39 AM3/25/20
to tesseract-ocr
@Lorenozo 
I need to do that because because the accuracy of current Arabic not very good as English , and i have a lot fonts need to add to Arabic model
adding them by fine tune will affect the model so  i need to build from scratch and make the model more generalized
so i need to know what is done in English model and take it as a reference to make new Arabic model


بتاريخ الثلاثاء، 24 مارس، 2020 10:05:03 م UTC+2، كتب Essam Zaky:

Lorenzo Bolzani

unread,
Mar 25, 2020, 5:31:15 AM3/25/20
to tesser...@googlegroups.com
I think fine tuning may work very well in this case, no need to train from scratch. Training from scratch does not guarantee better results, especially if you don't do it correctly.

I suggest to try fine tuning first and see if the results are good enough for you. In this way you get comfortable with the training process.

Training from scratch is just the same thing but more difficult because you will see the results after many hours or days and if you messed up something you need to start over. You also need to change the learning rate during training and monitor the training curves. I think there is not a simple recipe.

If you want to preserve what the model learned so far as much as possible you can try two things:

1. fine tune with the new fonts and the old fonts (or similar ones).


I recommend the option 1 first, make it work correctly, then try option 2 and see if it makes things better.

Just make sure to split your data into training data and testing data at the very beginning and monitor the test accuracy to limit overfitting. You need a reliable way to compare results.


Bye

Lorenzo


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Essam Zaky

unread,
Mar 25, 2020, 6:02:24 AM3/25/20
to tesseract-ocr
Thanx @Loranzo and @Shree
 i will give try to fine tune , and if the result still not satisfied will switch again to build from scratch


بتاريخ الثلاثاء، 24 مارس، 2020 10:05:03 م UTC+2، كتب Essam Zaky:

Shree Devi Kumar

unread,
Mar 25, 2020, 6:15:42 AM3/25/20
to tesseract-ocr
The issue with Arabic is related to RTL processing and how punctuation and digits are handled. If your training text does not have them, you will have greater success. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Essam Zaky

unread,
Mar 25, 2020, 6:34:14 AM3/25/20
to tesseract-ocr
 My target is to recognize Arabic with numbers and punctuation + English
 there are some English lines contain Arabic word
and Some Arabic lines contain English word

i did some page layout analysis and split the text to lines and try to detect the language of each word depending on word geometry in the line 
and if i have line contain Arabic and English pass the line to English engine  and Arabic engine then i select the final result depending on the confidence  returned 
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages