Line level training

72 views
Skip to first unread message

fav...@gmail.com

unread,
Nov 11, 2018, 11:42:29 PM11/11/18
to tesseract-ocr
Dear All,

      Currently, tesseract training is based on the pair (tiff and box). It's not easy to make box file (char level) if we try to train some scanned document images not generated by programs.
My question is whether we have a plan to support line level training in future? Thanks!

Regards,
Jun

Lorenzo Bolzani

unread,
Nov 12, 2018, 4:26:48 AM11/12/18
to tesser...@googlegroups.com

Tesseract 4.x uses lines, not chars.


Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/94b51a88-0b6b-4382-8551-430e5fe3841f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

fav...@gmail.com

unread,
Nov 12, 2018, 5:53:42 AM11/12/18
to tesseract-ocr
That means we can label some existing images with text line boxes instead of individual char boxes in current tesseract 4.0? I checked the box files generated by the training process and found that char boxes were still there.

Thanks,
Jun

在 2018年11月12日星期一 UTC+8下午5:26:48,Lorenzo Blz写道:

Lorenzo Bolzani

unread,
Nov 12, 2018, 6:38:19 AM11/12/18
to tesser...@googlegroups.com
Il giorno lun 12 nov 2018 alle ore 11:53 <fav...@gmail.com> ha scritto:
That means we can label some existing images with text line boxes instead of individual char boxes in current tesseract 4.0? I checked the box files generated by the training process and found that char boxes were still there.

Yes it is confusing. I use ocrd-train and it generates boxes for the whole lines.

This is an example generated from a small python script from ocrd-train:

M 0 0 244 50 0
I 0 0 244 50 0
T 0 0 244 50 0
- 0 0 244 50 0
U 0 0 244 50 0
C 0 0 244 50 0
O 0 0 244 50 0
     244 50 245 51 0

Ground truth is MIT-UCO, image size is 244x50. Here it lists each individual character but the box size is always the full line for all of them.

I use pre-cut images containing single lines, this is why the box cover the whole image. The same thing should work for a large image with multiple lines (but I never did it myself).

You could try to use hocr to split the file in lines see here: https://github.com/OCR-D/ocrd-train/issues/7#issuecomment-419714852


BTW the coords look like: left, top, right, bottom and not <left> <bottom> <right> <top> as in the docs: am I missing something?


Bye

Lorenzo


 

Thanks,
Jun

在 2018年11月12日星期一 UTC+8下午5:26:48,Lorenzo Blz写道:

Tesseract 4.x uses lines, not chars.


Bye

Lorenzo

Il giorno lun 12 nov 2018 alle ore 05:42 <fav...@gmail.com> ha scritto:
Dear All,

      Currently, tesseract training is based on the pair (tiff and box). It's not easy to make box file (char level) if we try to train some scanned document images not generated by programs.
My question is whether we have a plan to support line level training in future? Thanks!

Regards,
Jun

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/94b51a88-0b6b-4382-8551-430e5fe3841f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

fav...@gmail.com

unread,
Nov 12, 2018, 8:38:13 PM11/12/18
to tesseract-ocr
It's clear now. Thanks for the information.

Jun

在 2018年11月12日星期一 UTC+8下午7:38:19,Lorenzo Blz写道:
Reply all
Reply to author
Forward
0 new messages