Training tesseract for image with complex layout

88 views

Skip to first unread message

Mozhi

unread,

Sep 29, 2021, 5:28:50 AM9/29/21

to tesseract-ocr

Hi,

I would like to finetune/train tesseract for scanned document similar. For example the funsd data set here : https://guillaumejaume.github.io/FUNSD/

so far what I find out there is git repo tesstrain https://github.com/tesseract-ocr/tesstrain .

I looked at the examples provided for this repo in internet, it mentioned that, your training samples should be only one line of text like below photo:

Screen Shot 2021-09-29 at 9.35.27 AM.png

But I would like to give data like Forms in FUNSD data set and json files contain boxes and their text. How to do end-2-end training for tesseract, including the detection phase and line finding to find the boxes around text.

Thanks in advance!

Zdenko Podobny

unread,

Oct 1, 2021, 2:36:36 AM10/1/21

to tesser...@googlegroups.com

I do not think you need to do training.

Try to use Fraktur language data (https://github.com/tesseract-ocr/tessdata_best/blob/main/script/Fraktur.traineddata) or frk.traineddata.

E.g.

tesseract "Screen Shot 2021-09-29 at 9.35.27 AM.png" - -l script/Fraktur

As far as I remember (but I can not find the link ;-) ) our German friends did quite complex training for this type of text/font.

Zdenko

st 29. 9. 2021 o 11:28 Mozhi <mozhgan...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/be4b6cb4-afe1-49d6-ac76-72ec7e198573n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages