Bounding box

Jennil Thiyam

unread,

Jun 9, 2019, 4:50:34 AM6/9/19

to tesser...@googlegroups.com

ই 110 4657 137 4701 0
ম্ফা 131 4660 191 4693 0
ল 185 4660 217 4689 0
, 217 4654 226 4667 0
226 4650 240 4689 0
জু 240 4650 277 4689 0
ন 269 4660 298 4689 0
298 4660 316 4689 0
১ 316 4660 332 4689 0
৩ঃ 334 4661 376 4688 0
376 4655 394 4701 0
হৌ 394 4655 441 4701 0
জি 436 4660 482 4701 0
ক 477 4660 512 4688 0

The bounding box of each unit is given by the four coordinates. when we create those boxes according to the co-ordinates, we get some of the boxes of units are overlap by some small area, i want to know when can we say that these boxes are not good. Does it affect if some of the boxes are overlap by only small area, or does it has to be non overlap boxes?

Lorenzo Bolzani

unread,

Jun 9, 2019, 5:22:51 AM6/9/19

to tesser...@googlegroups.com

I think you are talking about preparing the training data. With tesseract 4.x you do not need to define the boxed for each chartacter just one big box for the whole line.

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoofQvKbRnRTEFzS5Y-tmtg9u9A1j5WWCp9eUHPQ1WKTHfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jennil Thiyam

unread,

Jun 9, 2019, 5:33:02 AM6/9/19

to tesser...@googlegroups.com

After running tesstrain.sh for creating starter train data we got .box file, right?? in that file we got the coordinate of each unit (which is exactly the bounding box of each unit). can you please elaborate about that file, can you please send me the link about "no need" of bounding boxes of every unit but rather the whole line

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLy4d0NOBVwM0KyUDnsw6EnkTVeEUGmqukMQMB8%3DuysWJQ%40mail.gmail.com.

Lorenzo Bolzani

unread,

Jun 9, 2019, 6:11:13 AM6/9/19

to tesser...@googlegroups.com

I do not use tesstrain.sh for training, but I assume it does the right thing, so if there is a little overlap it is likely not to be a problem. Reading many messages on this mailing list I've never seen this as an issue.

I use ocrd-train and it generates boxes for the whole line, not for individual characters and it works perfectly, at least for latin characters. Also the way an LSTM ocr works makes me think the boxes might get joined together in a single line, but I'm just speculating here.

Bye

Lorenzo

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJxgoodZzFezbFqZ3sfKOac8-mrJ0tX-z8WZJXvuQYp8n7_c5w%40mail.gmail.com.

Reply all

Reply to author

Forward