Tesseract gets space wrong

Jeremy Young

unread,

Jun 3, 2021, 1:45:51 PM6/3/21

to tesseract-ocr

Hi

The attached test image (which could be in a batch of a million, so I need a generalised fix) is being processed in Tess4J but I also get the same issue with the Windows build from Mannheim version:

C:\temp>tesseract --version

tesseract v5.0.0-alpha.20210506

leptonica-1.78.0

libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX2

Found AVX

Found FMA

Found SSE4.1

Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5

Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

When I execute "tesseract test1.png test1" the output contains at line 21 "PartyA | PartyB | Valuation". "Party A" should be two words as should "Party B".

When I output the hocr using Tess4J I can see that the gaps between the characters are 4,6,2,2,12

ie the gap between the "y" and the "A" is much bigger than the others.

P

a

r

t

y

A

Any suggestions what I could do?

Thx

LIKEZERO Limited is a limited company registered in Scotland with registered number SC651418. Our registered office is at Quartermile One, 15 Lauriston Place, Edinburgh, United Kingdom, EH3 9EP

This email is intended solely for the addressee and may contain confidential information. If you have received this message in error, please immediately and permanently delete it. Do not use, copy or disclose the information contained in this message or in any attachment.

This email is not in any way intended to create a binding contract.

We may monitor and record emails for security reasons and for monitoring compliance with internal policies.

test1.png

Jeremy Young

unread,

Jun 4, 2021, 11:13:25 AM6/4/21

to tesseract-ocr

It looks like there's a bug of some sort here. Attached is another image. When I COR it with

"tesseract test.png test -c tessedit_create_hocr=1 -c hocr_char_boxes=1"

the hocr for "Party A" looks like this:

P

a

r

t

y

A

ie the x-coordinate of the "y" overlaps the prior and following characters.

test.png

test.hocr

Zdenko Podobny

unread,

Jun 4, 2021, 11:29:29 AM6/4/21

to tesser...@googlegroups.com

search issue tracker and forum for "table"

Zdenko

pi 4. 6. 2021 o 17:13 Jeremy Young <jeremy...@likezero.co.uk> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/28ea517b-ff78-483c-98ed-67db49a7d7b5n%40googlegroups.com.

Jeremy Young

unread,

Jun 4, 2021, 11:41:44 AM6/4/21

to tesseract-ocr

Hmmm. I had a quick look. The results don't seem to be too helpful. Could be a little more precise as to what I'm looking for?

Thx

Ger Hobbelt

unread,

Jun 6, 2021, 8:35:47 AM6/6/21

to tesser...@googlegroups.com

Don't know why it happens precisely, but tesseract gets a little wonky when you feed it tables (with borders/lines).

Another good test would be to clip out the text of each cell, e.g. "Party A" only, etc. and feed those to tesseract one after the other.

When text comes out proper then, then at least you'll have "proof" that this is triggered by the table layout. Which would imply "image segmentation" would be the next subject to look at - though it can be argued that fiddling with that can be considered as "workaround" instead of "fix". Either way, this is complicated and I dont have the answers. Just a direction to look at, categorize the problem and then maybe you have something to help you reduce the problem.

Cheers,

Ger

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f7c3e0ba-3693-4315-885d-e6bd3a5ae0a4n%40googlegroups.com.

Jeremy Young

unread,

Jun 7, 2021, 10:11:34 AM6/7/21

to tesseract-ocr

Thanks for your input, Ger.

Tesseract is definitely a bit rubbish when there is comnplex spacing to deal with.

Test1.png attached is just the "Party A Party B Valuation" line.

The offsets are still overlapping.

Test2.png is just the "Party A" and Tesseract gets it right.

I already have some code which looks for low confidence characters/words and reOCRs those areas in a different psm.

I stopped using it because the confidence cannot be relied upon, especially when multiple languages come into play.

However, I could reuse that code to reocr, word-by-word, just those sections which appear to have overlapping char coordinates.

Yuk!

J

test2.png

test2.hocr

test1.hocr

test1.png

Reply all

Reply to author

Forward