Get line or letter height?

48 views
Skip to first unread message

John Muccigrosso

unread,
Jun 6, 2017, 6:08:32 PM6/6/17
to tesseract-ocr
The wiki suggests making sure that the x-height of text is at least 20 px. Is there a fairly straightforward way to estimate this with manually examining the image? Getting average or median from hocr or something?

John Muccigrosso

unread,
Oct 22, 2017, 1:49:05 PM10/22/17
to tesseract-ocr
On Tuesday, June 6, 2017 at 6:08:32 PM UTC-4, John Muccigrosso wrote:
The wiki suggests making sure that the x-height of text is at least 20 px. Is there a fairly straightforward way to estimate this with manually examining the image? Getting average or median from hocr or something?

Months later...

It looks like what I want to do is create a box file, so checking out the wiki, I modified the instructions to create this command, which seems to do what I want:

tesseract text_image_file output_file_name makebox

Output looks like this:

C 261 2453 285 2480 0
A
287 2454 312 2480 0
P
315 2454 334 2479 0
I
337 2454 347 2480 0
T
349 2454 372 2481 0
O
374 2454 402 2480 0
L
406 2454 426 2480 0
I
429 2454 439 2480 0
N
442 2454 471 2480 0
E
473 2454 494 2480 0


So now I need to process this output to get the letter heights (element 4 - element 2 in each line) and then grab the median.
Reply all
Reply to author
Forward
0 new messages