Training tesseract for image with complex layout

88 views
Skip to first unread message

Mozhi

unread,
Sep 29, 2021, 5:28:50 AM9/29/21
to tesseract-ocr
Hi, 
I would like to finetune/train tesseract for scanned document similar. For example the funsd data set here : https://guillaumejaume.github.io/FUNSD/
so far what I find out there is git repo tesstrain https://github.com/tesseract-ocr/tesstrain .
I looked at the examples provided for this repo in internet, it mentioned that, your training samples should be only one line of text like below photo:


Screen Shot 2021-09-29 at 9.35.27 AM.png

But I would like to give data like Forms in FUNSD data set and json files contain boxes and their text. How to do end-2-end training for tesseract, including the detection phase and line finding to find the boxes around text. 

Thanks in advance!

Zdenko Podobny

unread,
Oct 1, 2021, 2:36:36 AM10/1/21
to tesser...@googlegroups.com
I do not think you need to do training.
Try to use Fraktur  language data (https://github.com/tesseract-ocr/tessdata_best/blob/main/script/Fraktur.traineddata) or frk.traineddata.
E.g.
tesseract "Screen Shot 2021-09-29 at 9.35.27 AM.png" - -l script/Fraktur 

As far as I remember (but I can not find the link ;-) ) our German friends did quite complex training for this type of text/font.


Zdenko


st 29. 9. 2021 o 11:28 Mozhi <mozhgan...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/be4b6cb4-afe1-49d6-ac76-72ec7e198573n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages