Need Help Learning Howto Train Tesseract OCR on Fraktur Fonts - MAC - VietOCR v5.5.2 and Tesseract 4.1.0

660 views
Skip to first unread message

Akos Simon

unread,
Oct 2, 2019, 1:26:45 AM10/2/19
to tesseract-ocr

Fraktur Fonts OCR recognition with Tesseract OCR is what I am looking for,.... I installed VietOCR v5.5.2 and Tesseract 4.1.0 on my mac, and now I am trying to find help on how to train it better.... there are too many OCR errors...

How would I go about training the software? Can anyone help?

I am a total retard, ...sadly,.... and I do not even know how I was able to install the two components so far..... and this training step is nowhere explained

Any help into the right direction would greatly be appreciated

Zdenko Podobny

unread,
Oct 2, 2019, 1:38:08 AM10/2/19
to tesser...@googlegroups.com
Why do you think training will help you? What other option you have tried?

Zdenko


st 2. 10. 2019 o 7:26 Akos Simon <phot...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cb69ba1b-7539-4157-9b0f-698b82466f1b%40googlegroups.com.

Shree Devi Kumar

unread,
Oct 2, 2019, 3:48:43 AM10/2/19
to tesseract-ocr

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cb69ba1b-7539-4157-9b0f-698b82466f1b%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Akos Simon

unread,
Oct 2, 2019, 5:58:34 AM10/2/19
to tesseract-ocr
training tesseract ........

Tesseract it is an OCR TEXT recognition software that can be trained. 
I have gotten as far as installing Tesseract on my iMac with a GUI, but there are no options after I launch and look at a scanned image with Fraktur Type/fonts, on that GUI, to train Tesseract, and to make TesseractOCR better in recognizing this very difficult, very very old European font, which was used in the last 1000 years, but mostly before 1900.

So I wonder how can one now train that software.... as I mentioned, i am a novice,... only started 3 days ago ,.... and am myself very confused here, 

hopefully, this will change with your help ? .. ;) 

Thanks, Zdenko !!




On Wednesday, October 2, 2019 at 7:38:08 AM UTC+2, zdenop wrote:
Why do you think training will help you? What other option you have tried?

Zdenko


st 2. 10. 2019 o 7:26 Akos Simon <phot...@gmail.com> napísal(a):

Fraktur Fonts OCR recognition with Tesseract OCR is what I am looking for,.... I installed VietOCR v5.5.2 and Tesseract 4.1.0 on my mac, and now I am trying to find help on how to train it better.... there are too many OCR errors...

How would I go about training the software? Can anyone help?

I am a total retard, ...sadly,.... and I do not even know how I was able to install the two components so far..... and this training step is nowhere explained

Any help into the right direction would greatly be appreciated

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Zdenko Podobny

unread,
Oct 2, 2019, 10:54:08 AM10/2/19
to tesser...@googlegroups.com
If you are novice, that most stupid way is to start (and waste time) with training.
Spend some time with research - maybe you will find tesseract if already trained for Fraktur. Did you try to use deu_frak.traineddata[1]?

If you got still bad result please read wiki [2] , or post example image. There are some known[3] issues, not sure how critical it will be for you.


st 2. 10. 2019 o 11:58 Akos Simon <phot...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/de4235ca-a19d-49f1-99b3-f756bdae6fb2%40googlegroups.com.

Akos Simon

unread,
Oct 2, 2019, 5:09:59 PM10/2/19
to tesseract-ocr
Thanks, Zdenko, 

because tesseract has not "yet" a well working well trained Fraktur version around, ...not even the version form the university of Mannheim 

...that's why I said to myself i need to learn this and do it myself... 
I come from 30 years of graphical background, and studied Physics before,.... which will help here for sure,.....  

but I need a massive amount fo guidance and pointers to start this... naturally

I will read through your links,.... Thanks for those !!

Helmut Wollmersdorfer

unread,
Oct 5, 2019, 1:02:23 PM10/5/19
to tesseract-ocr
Hi Akos,

depends from which period you want to OCR Fraktur. Before 1750 you cannot expect very good results.

This one is around 1770 in Fraktur (similar Breitkopffraktur) and not so bad:

https://github.com/wollmers/ocr-deu-bio-testfiles/blob/master/naturgeschichte00gt/naturgeschichte00gt_0014.diff.txt


lines words chars
items ocr: 26 164 1122
items grt: 26 160 1115
matches: 8 137 1074
edits: 18 28 57
subss: 18 22 32
inserts: 0 5 16
deletions: 0 1 9
precision: 0.3077 0.8354 0.9572
recall: 0.3077 0.8562 0.9632
accuracy: 0.3077 0.8303 0.9496
f-score: 0.3077 0.8457 0.9602

Even this around 1830 using Schwabacher, Antiqua, Cursive and Fraktur gives good result:

https://github.com/wollmers/ocr-deu-bio-testfiles/blob/master/isisvonoken00oken/isisvonoken00oken_0153.diff.txt


lines words chars
items ocr: 108 766 5085
items grt: 105 765 5080
matches: 58 688 4956
edits: 50 86 185
subss: 47 69 68
inserts: 3 9 61
deletions: 0 8 56
precision: 0.5370 0.8982 0.9746
recall: 0.5524 0.8993 0.9756
accuracy: 0.5370 0.8889 0.9640
f-score: 0.5446 0.8988 0.9751

Reply all
Reply to author
Forward
0 new messages