Training Tesseract4.0 (LSTM) on word level bounding boxes

Shoaib

unread,

Aug 10, 2017, 6:08:05 PM8/10/17

to tesseract-ocr

Hi everyone,

I would like to train Tesseract on my own dataset comprising of word images. I have the bounding box information but for the whole word instead of per character. I referred to the following documentation available on the topic of training Tesseract 4.0.

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

On the documentation, it is mentioned that "The boxes only need to be at the textline level. It is thus far easier to make training data from existing image data.". But later in the wiki, the box format that allows boxes at text line level is said not to be implemented as of yet ("Box File Format - Second Option (NOT YET IMPLEMENTED)"). I would therefore, like to know if there is any way to train Tesseract based on just the word bounding box information instead of character level information?

Thanking you for your time in this regard.

Tao Shatoo

unread,

May 22, 2018, 12:45:47 AM5/22/18

to tesseract-ocr

Not yet,i tried but failed.I'm waiting for the same API like you.

在 2017年8月11日星期五 UTC+8上午6:08:05，Shoaib写道：

ShreeDevi Kumar

unread,

May 22, 2018, 1:12:16 AM5/22/18

to tesser...@googlegroups.com

You can see if generate_line_box.py from https://github.com/OCR-D/ocrd-train is helpful.

It requires single line images and matching ground truth to create the box files.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/159baf4d-28a2-49c6-99c2-5fb1cc231ae3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nick

unread,

May 22, 2018, 11:15:13 PM5/22/18

to tesseract-ocr

hi

how can we train the tesseract 4 beta, with our lines dataset?

shree

unread,

May 23, 2018, 12:28:41 AM5/23/18

to tesseract-ocr

On Wednesday, May 23, 2018 at 8:45:13 AM UTC+5:30, nick wrote:

hi
how can we train the tesseract 4 beta, with our lines dataset?

See https://github.com/OCR-D/ocrd-train

nick

unread,

May 23, 2018, 1:42:01 AM5/23/18

to tesseract-ocr

all images should have a constant height ?
and should we need different images in different font size?

Reply all

Reply to author

Forward