Unable to detect single digit cells in an invoice

426 views
Skip to first unread message

Leo Bergolth

unread,
Nov 15, 2016, 7:35:43 AM11/15/16
to tesseract-ocr
I am trying to use tesseract for processing invoices in tabular format.
As output format, I chose hocr or pdf because I need the positions of the text segments.

Recognition works great, except for cells like the "quantity" column that contain single digit values. Those single digit cells are skipped, while other cells in the same column that contain more than one digit are perfectly recognized.

(See the attached screenshot.)

Is there anything configuration parameter like a "minimum word size" that is tunable?
I've already tried -c load_system_dawg=F -c load_freq_dawg=F
and -psm modes from 3 to 12.
If I use -psm 6 and txt output, those digits are included but unfortunately I need the cell positions.

Any hints?
Cheers,
--leo

P.S.: I am using tesseract-3.04.
hocr-single-digits.png

Tom Morris

unread,
Nov 15, 2016, 10:29:16 AM11/15/16
to tesseract-ocr
How are you specifying the output format? For example, if you use the default pdf config file, it includes the line:

tessedit_pageseg_mode 1

which may override your intended -psm flag.

Having said that, you probably have more information than tesseract about the page layout, so you may want to try doing page segmentation yourself and feeding the resulting columns or cells to tesseract for recognition individually.

Tom

Leo Bergolth

unread,
Nov 16, 2016, 12:04:20 PM11/16/16
to tesseract-ocr
Am Dienstag, 15. November 2016 16:29:16 UTC+1 schrieb Tom Morris:
How are you specifying the output format? For example, if you use the default pdf config file, it includes the line:

tessedit_pageseg_mode 1

which may override your intended -psm flag.

Thanks for the hint.
But I've also tried with my own config (named leohocr) that contains only:

load_system_dawg 0
load_freq_dawg
0
tessedit_create_hocr
1

and called it like that:
tesseract clean01.tif t01_3 -c tessedit_pageseg_mode=3 leohocr
tesseract clean01
.tif t01_5 -c tessedit_pageseg_mode=5 leohocr
[...]
tesseract clean01
.tif t01_11 -c tessedit_pageseg_mode=11 leohocr
tesseract clean01
.tif t01_12 -c tessedit_pageseg_mode=12 leohocr

psm 1, 3, 6, 11 and 12 produce very good results but still the only problem are those missing single digit cells. :-(
 
Having said that, you probably have more information than tesseract about the page layout, so you may want to try doing page segmentation yourself and feeding the resulting columns or cells to tesseract for recognition individually.

I tried to feed a single column (see the attached input file single_col1.tif) but got the same results:
psm 1,3, 6, 11 and 12 produce usable results but again the single digit cells are missing.
See the attached Screenshot. 

I'd greatly appreciate any pointers!

Thanks,
--leo
single_col1.tif
Screenshot_20161116_175351.png

Art Rhyno.

unread,
Nov 17, 2016, 11:39:28 AM11/17/16
to tesser...@googlegroups.com

Hi Leo,

 

Your example has such good contrast that you might consider using the colors to identify single characters. I have attached a quick sample of what I mean. I used opencv and defer greatly to the blog post I reference at the top of the script, but the idea would be to try to catch single characters using opencv’s “inrange” function. I would use tesseract on the image first and weed out blobs for further processing based on the coordinates of what tesseract has already detected. I would then use single character mode on what’s left. Feel free to ping me if you are interested in this approach.

 

art

digits.py

Leo Bergolth

unread,
Nov 18, 2016, 7:41:22 AM11/18/16
to tesseract-ocr
Could anyone who knows about tesseracts internals give me some pointers to the suspected location where the single-digit blobs (excuse me if my terminology is wrong) are rejected?

I've played around with the ScrollView debug viewer and noticed that the single-digits contained in the "image blobs" and "initial partitions" displays but not in the resulting partitions shown with textord_tabfind_show_partitions.

What happens between those steps? Maybe I can turn on some additional debugging to find the reason why those digits are removed?

Thanks,
--leo

Leo Bergolth

unread,
Nov 18, 2016, 12:30:50 PM11/18/16
to tesseract-ocr
Am Donnerstag, 17. November 2016 17:39:28 UTC+1 schrieb Art Rhyno:

 Your example has such good contrast that you might consider using the colors to identify single characters. I have attached a quick sample of what I mean. I used opencv and defer greatly to the blog post I reference at the top of the script, but the idea would be to try to catch single characters using opencv’s “inrange” function. I would use tesseract on the image first and weed out blobs for further processing based on the coordinates of what tesseract has already detected. I would then use single character mode on what’s left. Feel free to ping me if you are interested in this approach.


Thanks for your suggestion! Looks like a neat way to circumvent the problem.
However, I'd prefer to find the reason why tesseract rejects those blobs first.
(See my other post.)
Maybe this can be fixed in tesseract, once I know some background... :-)

Cheers,
--leo

Art Rhyno.

unread,
Nov 18, 2016, 2:53:09 PM11/18/16
to tesser...@googlegroups.com

For sure, best of luck!

 

art

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d16bb097-f4a7-4deb-a5bd-fa1545e25c33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages