Problem reading text in two columns

186 views
Skip to first unread message

Brooks Johnson

unread,
May 6, 2018, 3:51:39 PM5/6/18
to tesseract-ocr

I was experimenting with an image of a receipt but there seems to be trouble reading the two columns.  I'm including a sample image so you can see what I was working with.  The output I get from running "tesseract receipt.png out" is this:


CUL DAIRY
CHOBANI VOG

PRODUCE

HONEVURISP APPLES

0.93 lb 6 $2.29/ 1b
{are Weyght: 0.011b

BANANAS

3.16 lb 9 $0,59/ lb
Tare Weight: 0.01m

BALANCEDlE

$2.13

$1.86

$9.88



There are a few typos but the biggest concern is that the $5.89 is nowhere to be found, but the prices that are below it manage to be included.  That first price is still missing after I processed the image and even used a different image taken under different lighting.  Am I doing something wrong here?

ShreeDevi Kumar

unread,
May 7, 2018, 12:16:39 AM5/7/18
to tesser...@googlegroups.com
Which version of tesseract are you using?

Which traineddata (from which repo)

Try with --psm 6 if using tesseract 4 beta. It will recognise whole line, rather than column

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fd5f8596-7f21-42d6-a7bb-0dcafa113a4a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brooks Johnson

unread,
May 7, 2018, 9:53:14 AM5/7/18
to tesseract-ocr
Sorry, I forgot to specify that.

Tesseract 3.04.01

I'm using the data from tesseract-ocr-eng

shree

unread,
May 9, 2018, 4:21:12 AM5/9/18
to tesseract-ocr
Please try by building the latest version of tesseract from github

or install  from links given in https://github.com/tesseract-ocr/tesseract/wiki

I get the following output using the default eng.traineddata from the three repos - tessdata, tessdata_best, tessdata_fast, without any pre-processing of image.

# tesseract receipt.png - --psm 6 --tessdata-dir ./tessdata -c preserve_interword_spaces=1 -c page_separator=''

Warning. Invalid resolution 0 dpi. Using 70 instead.
CUL DAIRY

CHOBANI Y0G              $5.89 F
PRODUCE

HONEYCRTSP APPLES

0.931b@ $2.29/ Ib     $2.13 F
Tare Weight: 0.011b

BANANAS

3.16 1b®  $0.59/ Ib   $1.86 F
Tare Weight: 0.011b

BALANCE DUE               $9.88


# tesseract receipt.png - --psm 6 --tessdata-dir ./tessdata_best -c preserve_interword_spaces=1 -c page_separator=''

Warning. Invalid resolution 0 dpi. Using 70 instead.
CUL DAIRY

CHOBANI Y0G              $5.89 F
PRODUCE

HONEYCRISP APPLES

0.931b8  $2.20/ Ib     $213 F
Tare Weight: 0.011b

BANANAS

3.16 1b8 $0.59 Ib   $1.86 F
Tare Weight: 0.011b

BALANCE DUE               $9.88


# tesseract receipt.png - --psm 6 --tessdata-dir ./tessdata_fast  -c preserve_interword_spaces=1 -c page_separator=''

Warning. Invalid resolution 0 dpi. Using 70 instead.
CUL DAIRY

CHOBANI ¥OG              $5.89 F
PRODUCE

HONEYCRISP APPLES

0.93 Ib @ = $2.29/ Ib     $2.13 F
Tare Weight: 0.011b

BANANAS

3.16 1b @ —$0.59/ Ib   $1.86 F
Tare Weight: 0.01Ib

BALANCE DUE               $9.88



Brooks Johnson

unread,
May 10, 2018, 10:59:10 PM5/10/18
to tesseract-ocr
I've uninstalled and reinstalled from the PPA and my results resemble yours.  I used the tessdata_fast file for English - are these different from tessdata-ocr-eng that comes with Ubuntu?

ShreeDevi Kumar

unread,
May 11, 2018, 8:17:16 AM5/11/18
to tesser...@googlegroups.com
>  I used the tessdata_fast file for English - are these different from tessdata-ocr-eng that comes with Ubuntu?

The ppa has traineddata files from tessdata_fast. Ubuntu 18.04 will have the same.

Older versions of ubuntu (wihout ppa) will have traineddata files for tesseract 3.0x.

You can try all three, tessdata_fast, tessdata_best and tessdata to see which one works best in your case - spped/accuracy wise.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages