Box file layout for training tesseract4

170 views
Skip to first unread message

Jul ius

unread,
Jan 25, 2019, 5:56:59 AM1/25/19
to tesseract-ocr
Hi,

I'm interested in training tesseract 4 with real data. As the documentation seems very poor and only captures training with font files, I have a general question.


It says that the boxes need to cover the whole line in tesseract 4. 

When looking inside the linked box file I can clearly see that every box covers a single character.

Can anyone verify which layout for the boxes is right?

Timothy Snyder

unread,
Jan 25, 2019, 9:47:47 AM1/25/19
to tesser...@googlegroups.com
I have successfully trained Tesseract 4.0 using boxes that cover an entire line. I was similarly confused by the mismatch between the docs and that example. I haven't tested training with character-bounding boxes but I can confirm that textline boxes works fine.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1ab1e0b0-a70a-456b-ab58-2f240a3b479f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Li-Chung Chou

unread,
Jan 27, 2019, 12:20:06 PM1/27/19
to tesseract-ocr
Hi Timothy,

I have the same question with Jul. Would you kindly share 1 'textline' boxes file and its corresponding image file which you applied? I assume if I have one image containing one 'textline' as "Thanks", then I will have its corresponding box file as below contents:

Thanks 10 10 500 30 0  //the 10 10 500 30 rectangle contains whole "Thanks" text?

But I was wondering if my 'textline' has space character in it, does it still work? For example, if I have an image containing one 'textline' as "Thank you", will its box file looks like this?

Thank you 10 10 800 30 0 //the 10 10 800 30 rectangle contains whole "Thank you" text?

Not sure if my understainding is correct or not - it's highly appreciated if you can share some examples or experience to us. Thank you very very much!

Li-Chung

Timothy Snyder於 2019年1月25日星期五 UTC+8下午10時47分47秒寫道:

Jul ius

unread,
Jan 28, 2019, 9:01:49 AM1/28/19
to tesseract-ocr
Hi,

that would also be my next question. Don't we need anything like a seperator? Some examples would be great. The amout of information on the internet is very poor as tesseract 4 is new.

Jul ius

unread,
Jan 30, 2019, 6:06:04 AM1/30/19
to tesseract-ocr
Still interested in example of box files for tesseract 4...

Doesn't anyone has an example for us? It would be great to see how we have to handle spaces in textlines.

Shree Devi Kumar

unread,
Jan 30, 2019, 6:45:59 AM1/30/19
to tesser...@googlegroups.com
AFAIK the textline option for box files (WordStr) has NOT been implemented.

The wordaround has been to use the bounding box for the whole line for every character on a line. Ref: ocrd-train project

Example:

च 0 0 1965 128 0
त् 0 0 1965 128 0
व 0 0 1965 128 0
ा 0 0 1965 128 0
र 0 0 1965 128 0
ि 0 0 1965 128 0
ं 0 0 1965 128 0
श 0 0 1965 128 0
त् 0 0 1965 128 0
स 0 0 1965 128 0
ह 0 0 1965 128 0
स् 0 0 1965 128 0
र 0 0 1965 128 0
ा 0 0 1965 128 0
ब् 0 0 1965 128 0
द 0 0 1965 128 0
ं 0 0 1965 128 0
  0 0 1965 128 0
व 0 0 1965 128 0
ा 0 0 1965 128 0
य् 0 0 1965 128 0
व 0 0 1965 128 0
ा 0 0 1965 128 0
ह 0 0 1965 128 0
ा 0 0 1965 128 0
र 0 0 1965 128 0
ा 0 0 1965 128 0
  0 0 1965 128 0

text2image creates tif and box files when given a training text and font. That has bounding boxes per character.

Example:

d 111 4658 135 4698 0
i 137 4658 148 4698 0
f 149 4658 163 4698 0
f 163 4658 177 4698 0
e 178 4657 202 4690 0
r 204 4657 221 4689 0
e 222 4657 246 4689 0
n 248 4657 272 4689 0
t 273 4657 288 4694 0
  288 4657 299 4697 0
N 299 4657 323 4697 0
e 325 4657 349 4689 0
w 349 4657 383 4689 0
  383 4656 390 4697 0
A 390 4656 418 4697 0
r 417 4656 434 4688 0
t 435 4656 450 4693 0
i 451 4656 462 4696 0
c 464 4656 487 4688 0
l 489 4656 500 4696 0
e 502 4656 526 4688 0
s 528 4656 550 4688 0
  550 4651 561 4688 0
p 561 4651 585 4688 0
a 587 4656 610 4688 0
g 612 4649 636 4688 0
e 638 4655 662 4687 0
  662 4655 674 4696 0
2 674 4655 696 4696 0
3 699 4654 723 4696 0
  723 4654 734 4696 0
a 734 4655 757 4687 0
  757 4655 767 4695 0
T 767 4655 791 4695 0
o 791 4655 815 4687 0
  815 4653 826 4696 0
S 826 4653 851 4696 0
e 852 4654 876 4686 0
r 878 4654 895 4686 0
v 895 4654 919 4686 0
i 919 4654 930 4694 0
c 932 4654 955 4686 0
e 957 4654 981 4686 0
  981 4654 994 4686 0
~ 994 4669 1016 4680 0
~ 1020 4669 1042 4680 0
  1042 4653 1053 4685 0
a 1053 4653 1076 4685 0
  1076 4653 1087 4693 0
d 1087 4653 1111 4693 0
e 1113 4653 1137 4685 0
t 1138 4653 1153 4690 0
a 1154 4653 1177 4685 0
i 1179 4653 1190 4693 0
l 1192 4653 1203 4693 0
s 1205 4653 1227 4685 0
  1227 4653 1239 4693 0
D 1239 4653 1264 4693 0
C 1267 4651 1292 4693 0
  1292 4651 1302 4693 0
t 1302 4652 1317 4689 0
h 1318 4652 1342 4692 0
a 1344 4652 1367 4684 0
t 1368 4652 1383 4689 0
  1383 4652 1393 4692 0
d 1393 4652 1417 4692 0
o 1419 4652 1443 4684 0
n 1445 4652 1469 4684 0
' 1472 4680 1479 4692 0
t 1479 4651 1494 4689 0
  1494 4651 1504 4689 0
a 1504 4651 1527 4683 0
s 1529 4651 1551 4683 0
  1551 4651 1561 4691 0
7 1561 4651 1582 4691 0
  1582 4651 1591 4691 0
« 1591 4654 1609 4682 0
« 1610 4654 1628 4682 0
  1628 4651 1639 4691 0
D 1639 4651 1664 4691 0
a 1666 4651 1689 4683 0
t 1690 4650 1705 4688 0
e 1706 4650 1730 4682 0
: 1733 4650 1741 4676 0
  1741 4650 1751 4685 0
# 1751 4650 1781 4685 0
1 1781 4650 1799 4690 0
  1799 4650 1811 4690 0
: 1811 4650 1819 4676 0
  1819 4650 1827 4690 0
A 1827 4650 1855 4690 0
Z 1854 4650 1875 4690 0
1875 4689 1876 4690 0
_ 110 4559 138 4561 0
_ 138 4559 166 4561 0
_ 166 4558 194 4561 0


For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Jan 30, 2019, 6:48:43 AM1/30/19
to tesser...@googlegroups.com

tc...@zips.uakron.edu

unread,
Jan 30, 2019, 2:34:17 PM1/30/19
to tesseract-ocr

Each textline in the image has a line in the boxfile for each character in the textline. the box dimensions following a single character are not for a single character but for the whole textline. After all the characters in one text line are written in the boxile, you need to include a tab character to flag the end of the line. It's a little tricky to explain but the examples I provided should be mostly self-explanatory. Let me know if you have any questions.

Li-Chung Chou

unread,
Feb 2, 2019, 9:12:15 PM2/2/19
to tesseract-ocr
Hi Shree,

Thanks for your kindly response! It's very clear. Actually, I'm also curious about some languages whose "character" might be consist of multiple "glyphs" (not sure if I use correct English words to describle - sorry for my poor English in advance) . Your example also include this part. Thank you so much!

Best Regards,
Li-Chung

shree於 2019年1月30日星期三 UTC+8下午7時48分43秒寫道:

Li-Chung Chou

unread,
Feb 2, 2019, 9:15:48 PM2/2/19
to tesseract-ocr
Hi Timothy,

Yes, your examaple awesome! I was also wondering the "using tab to seperate multiple text lines" part - your example explain this perfectly. I really appreciate your kindly information and response. Thanks again!

Best Regards,
Li-Chung

tc...@zips.uakron.edu於 2019年1月31日星期四 UTC+8上午3時34分17秒寫道:

Shree Devi Kumar

unread,
Feb 3, 2019, 12:21:44 PM2/3/19
to tesser...@googlegroups.com
The easiest way to see box file layout for any language is to run 'text2image,' for training text sample of 2-3 lines.

mohito

unread,
Feb 25, 2019, 8:10:16 AM2/25/19
to tesseract-ocr
Hi,

would you be so kind to make this link public or give me permissions to see your examples?
To see an example would help so much.

Best Regards

Timothy Snyder

unread,
Mar 1, 2019, 10:43:18 AM3/1/19
to tesser...@googlegroups.com
Sorry for the delay. You have access now. I need to set the link to public!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

mohitolp .

unread,
Apr 9, 2019, 4:36:09 AM4/9/19
to tesser...@googlegroups.com
thank you very much, that helped a lot :D

Reply all
Reply to author
Forward
0 new messages