best way to train german gothic font model?

52 views
Skip to first unread message

Phillip Ströbel

unread,
Aug 29, 2019, 3:43:51 AM8/29/19
to tesseract-ocr
dear tesseract community

atm, i'm trying to compare the performance of different ocr engines, one of which is tesseract.
i have a ground truth already, which comes in page-xml files, where lines look as follows:

<TextLine id="tl_7" primaryLanguage="German" custom="readingOrder {index:0;}">
<Coords points="1281,594 1734,594 1734,657 1281,657"/>
<Baseline points="1282,642 1734,645"/>
<Word id="w_w1aab1b9b2b3b1ab1" language="German" custom="readingOrder {index:0;}">
<Coords points="1281,594 1360,594 1360,643 1281,643"/>
<TextEquiv>
<Unicode>den</Unicode>
</TextEquiv>
<TextStyle fontFamily="Times New Roman" fontSize="16.0"/>
</Word>
<Word id="w_w1aab1b9b2b3b1b1b1" language="German" custom="readingOrder {index:1;}">
<Coords points="1391,597 1452,597 1452,657 1391,657"/>
<TextEquiv>
<Unicode>19.</Unicode>
</TextEquiv>
<TextStyle fontFamily="Times New Roman" fontSize="15.0"/>
</Word>
<Word id="w_w1aab1b9b2b3b1b2b1" language="German" custom="readingOrder {index:2;}">
<Coords points="1467,597 1734,597 1734,657 1467,657"/>
<TextEquiv>
<Unicode>Heumonat.</Unicode>
</TextEquiv>
<TextStyle fontFamily="Times New Roman" fontSize="16.0"/>
</Word>
<TextEquiv>
<Unicode>den 19. Heumonat.</Unicode>
</TextEquiv>
</TextLine>

i tried to follow the tesseract tutorial to train a model from scratch. since i already have the coordinates of the line boxes, i created the .box file from the points attribute in the TextLine Coords (line-based, so it was smth like "WordStr <left> <bottom> <right> <top> 0 #text \n <left> <bottom> <right> <top>"). when i try to produce the lstmf files, however, i get many warnings that there is no box overlapping the text line.
i segmented the data already for another ocr system, which expects line images and one text file with the corresponding transcription. i found that the --psm option would allow for taking lines as an input.

since from the tutorial not everything is clear to me, i would like to ask the following questions:

  1. 1. if i have line tiffs and the corresponding text, say line0001.tif and the text "den 19. Heumonat" in line0001.txt how do i
    1. produce the unicharset file --> i used the unicharset_extractor and let it run over all .box files i had and this worked, but is it the right way?
    2. what to the .box files need to look like? what coordinates do i need to use? after all, it is the whole image
    3. what data do i use to produce the required files? do i download the *.traineddata from tesseract? do i produce this myself? does tesseract need a text file with all the text from the training data, wordlist from training data, and so on?
i'm sorry, i know there is a doku but i find it very confusing. thanks in advance for any helpful hints.

best,

phillip

Shree Devi Kumar

unread,
Aug 29, 2019, 4:02:08 AM8/29/19
to tesseract-ocr
Use https://github.com/OCR-D/ocrd-train since you have line images and transcription.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ecdc6d31-f505-443f-8058-99c6f6670427%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Phillip Ströbel

unread,
Aug 30, 2019, 3:39:04 AM8/30/19
to tesseract-ocr
thanks a lot for your rapid answer!

do i have to change the psm to 7 in the makefile? or doesn't this matter too much?


On Thursday, 29 August 2019 10:02:08 UTC+2, shree wrote:
Use https://github.com/OCR-D/ocrd-train since you have line images and transcription.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages