Incremental Training Tesseract 4.0+ for fraktur

278 views
Skip to first unread message

Val LNB

unread,
Jan 28, 2020, 12:16:43 PM1/28/20
to tesseract-ocr
How to perform incremental training on Tesseract 4.0+?


I want to improve the existing fraktur (frk) model with some 6000 hand curated lines from our library. 

Ground truth for these lines has 10 new unicode characters not present in German fraktur model.


How can I continue training from the existing German fraktur model without full retraining?


Progress so far:


What/if anything do I enter into START_MODEL?


It would be fantastic to see an example CLI command used for your incremental training. :)













Shree Devi Kumar

unread,
Jan 28, 2020, 12:24:11 PM1/28/20
to tesseract-ocr
Please see https://github.com/tesseract-ocr/tesstrain/wiki

There are already newly trained models by @stweil for Fraktur.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1e79c1d6-de0c-4c87-b07c-9455b90cfef4%40googlegroups.com.

Val LNB

unread,
Jan 29, 2020, 9:02:40 AM1/29/20
to tesseract-ocr
Thank you for the link!


Here are instructions that I have figured out so far for fine-tuning an existing model:

On Ubuntu 18.04 first I double checked for right packages
dpkg -s tesseract-ocr
dpkg -s tesseract-ocr-frk (not used as I actually grabbed latest model from https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/tessdata_best/) then placed in ~/train/tessdata/script under name Fraktur.traineddata)
dpkg -s libtesseract-dev (unsure if this package is necessary but I installed it a while ago)

~$ tesseract --version
tesseract 4.0.0-beta.1


cd to tesstrain directory 

Then start the training process with the following command:

make -r training START_MODEL=Fraktur TESSDATA=~/train/tessdata/script GROUND_TRUTH_DIR=~/train/data_train_2020_1_28_16_49_54 MODEL_NAME=Frak_LV_J29

so ~/train/tessdata/script/Fraktur.traineddata will be used for start
while GROUND_TRUTH_DIR holds 6k pairs of .gt.txt and .tif files

Defaults: 10,000 epoch run and 10% of GROUND_TRUTH_DIR will be used for testing assuming wiki is correct

My only worry is that my .tif files apparently have no dpi information so default of 70 is used.

Are the warnings about lack of dpi a bad sign?


Interestingly, .png failes are used when running training so I could have perhaps skipped conversion to .tif since I started with .png! :)

Now, the big question, how long will it take to run 10,000 epochs on average 4 core Xeon v3 VM?



 

On Tuesday, January 28, 2020 at 7:24:11 PM UTC+2, shree wrote:
Please see https://github.com/tesseract-ocr/tesstrain/wiki

There are already newly trained models by @stweil for Fraktur.

On Tue, Jan 28, 2020, 22:46 Val LNB <valdis...@gmail.com> wrote:
How to perform incremental training on Tesseract 4.0+?


I want to improve the existing fraktur (frk) model with some 6000 hand curated lines from our library. 

Ground truth for these lines has 10 new unicode characters not present in German fraktur model.


How can I continue training from the existing German fraktur model without full retraining?


Progress so far:


What/if anything do I enter into START_MODEL?


It would be fantastic to see an example CLI command used for your incremental training. :)













--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Jan 29, 2020, 9:40:38 AM1/29/20
to tesseract-ocr
tesseract 4.0.0-beta.1 

This is quite old. I suggest you use latest build.

Not sure if @stweil is actively watching this forum. You can post a question in tesstrain repo.

 

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6c612c7c-99f5-43eb-b338-928884af3e0d%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

hmaster

unread,
Apr 3, 2020, 7:10:18 AM4/3/20
to tesseract-ocr
Hi Val,

How did you generate the 6k .gt.txt files from the tif files?

Thank you.
Reply all
Reply to author
Forward
0 new messages