BOX File Automatic Generation using the word coordinates

53 views
Skip to first unread message

eng.ahmed....@gmail.com

unread,
Aug 23, 2018, 7:33:07 AM8/23/18
to tesseract-ocr
I want to train tesseract 4 using images and ground truth text. I have generated the BOX file in for a page in the below format.


D 1107 191 1167 209 0
a
1107 191 1167 209 0
t
1107 191 1167 209 0
e
1107 191 1167 209 0
: 1107 191 1167 209 0
 
1107 191 1167 209 0
2 1202 192 1294 209 0
0 1202 192 1294 209 0
1 1202 192 1294 209 0
8 1202 192 1294 209 0
- 1202 192 1294 209 0
1 1202 192 1294 209 0
- 1202 192 1294 209 0
3 1202 192 1294 209 0
 
1294 209 1295 210 0
W
157 237 313 323 0
a
157 237 313 323 0
l
157 237 313 323 0
 
157 237 313 323 0
m
321 256 402 322 0
 
321 256 402 322 0
a
406 256 454 323 0
 
406 256 454 323 0
r
460 237 525 323 0
t
460 237 525 323 0
 
460 237 525 323 0
e
967 261 1041 280 0
- 967 261 1041 280 0
S
967 261 1041 280 0
D
967 261 1041 280 0
R
967 261 1041 280 0
 
967 261 1041 280 0
s
1049 261 1113 281 0
e
1049 261 1113 281 0
r
1049 261 1113 281 0
i
1049 261 1113 281 0
a
1049 261 1113 281 0
l
1049 261 1113 281 0
 
1049 261 1113 281 0
n
1123 267 1167 281 0
o
1123 267 1167 281 0
. 1123 267 1167 281 0
: 1123 267 1167 281 0
 
1123 267 1167 281 0
 
1203 263 1372 281 0
C
1203 263 1372 281 0
A
1203 263 1372 281 0
1 1203 263 1372 281 0
8 1203 263 1372 281 0
0 1203 263 1372 281 0
1 1203 263 1372 281 0
0 1203 263 1372 281 0
3 1203 263 1372 281 0
0 1203 263 1372 281 0
6 1203 263 1372 281 0
2 1203 263 1372 281 0
2 1203 263 1372 281 0
3 1203 263 1372 281 0
 
1372 281 1373 282 0


where i added the word coordinates for every letter as DATE  and Break the line using \t.

Here is an example of tif and box file. The problem that I have CTC compute failure and also when I try to generate BOX file from Tesseract i have the same issue.


How to make a valid BOX FILE for a Page.



 

train_sample.zip

James Q

unread,
Aug 24, 2018, 5:23:13 AM8/24/18
to tesseract-ocr
Correct me if I am wrong, but shouldn't each character be bound by its own box? Try opening this in JTessBoxEditor ( http://vietocr.sourceforge.net/training.html ).
Reply all
Reply to author
Forward
0 new messages