I want to train tesseract 4 using images and ground truth text. I have generated the BOX file in for a page in the below format.
D 1107 191 1167 209 0
a 1107 191 1167 209 0
t 1107 191 1167 209 0
e 1107 191 1167 209 0
: 1107 191 1167 209 0
1107 191 1167 209 0
2 1202 192 1294 209 0
0 1202 192 1294 209 0
1 1202 192 1294 209 0
8 1202 192 1294 209 0
- 1202 192 1294 209 0
1 1202 192 1294 209 0
- 1202 192 1294 209 0
3 1202 192 1294 209 0
1294 209 1295 210 0
W 157 237 313 323 0
a 157 237 313 323 0
l 157 237 313 323 0
157 237 313 323 0
m 321 256 402 322 0
321 256 402 322 0
a 406 256 454 323 0
406 256 454 323 0
r 460 237 525 323 0
t 460 237 525 323 0
460 237 525 323 0
e 967 261 1041 280 0
- 967 261 1041 280 0
S 967 261 1041 280 0
D 967 261 1041 280 0
R 967 261 1041 280 0
967 261 1041 280 0
s 1049 261 1113 281 0
e 1049 261 1113 281 0
r 1049 261 1113 281 0
i 1049 261 1113 281 0
a 1049 261 1113 281 0
l 1049 261 1113 281 0
1049 261 1113 281 0
n 1123 267 1167 281 0
o 1123 267 1167 281 0
. 1123 267 1167 281 0
: 1123 267 1167 281 0
1123 267 1167 281 0
1203 263 1372 281 0
C 1203 263 1372 281 0
A 1203 263 1372 281 0
1 1203 263 1372 281 0
8 1203 263 1372 281 0
0 1203 263 1372 281 0
1 1203 263 1372 281 0
0 1203 263 1372 281 0
3 1203 263 1372 281 0
0 1203 263 1372 281 0
6 1203 263 1372 281 0
2 1203 263 1372 281 0
2 1203 263 1372 281 0
3 1203 263 1372 281 0
1372 281 1373 282 0
where i added the word coordinates for every letter as DATE and Break the line using \t.
Here is an example of tif and box file. The problem that I have CTC compute failure and also when I try to generate BOX file from Tesseract i have the same issue.
How to make a valid BOX FILE for a Page.