Passing RegEx to Zone Scans

228 views
Skip to first unread message

David Arnold

unread,
Jul 29, 2014, 4:27:10 PM7/29/14
to tesser...@googlegroups.com
Hello,

for a theoretical application of advanced invoice registration/indexing there, it would be very useful, if besides of training a specific invoice template, to pass a RegEx-Filter to Zone Scans.

Imagine you wan't to retrive the date of a receipt which is in a zone you either mark by hand or which is fixed. In the environment of a specific invoice template (training file) this might be always in the format:

24/JUL/2014

thus using only the followning broader subset:

##/AAA/####

or the following narrower subset:

dd/MMM/YYYY

I think if it would be possible to pass such a regex to the individual scan tellin tesseract to use only that specific subset of characters to process the image zone would giva a neer 100% accuracy even on dirty receipt scans which have seen a tropical monsun befor they have been scanned...

What do you know about / think about?

Nick White

unread,
Aug 12, 2014, 12:57:08 PM8/12/14
to tesser...@googlegroups.com
Hi David,

You're right, that would be useful. Tesseract has a basic version of
that, called "patterns"; search the manpage for a bit of information
on them.

However at present they can't be assigned per region, only as
possible patterns for the whole OCR job. Also they aren't
restrictive, but more "suggestive".

If you were using the API you could totally set only the pattern you
wanted, and only recognise the region you with the zone, and that
should work quite well. Give it a try if you have time, and let us
know how it works.

Nick
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> tesseract-ocr/418748be-e224-49ce-93b2-a8386cbbf7f5%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

David Arnold

unread,
Aug 17, 2014, 12:52:03 AM8/17/14
to tesser...@googlegroups.com
Hello Nick, 
thank you very much for your answer. I'm coming from the user space, my technical background is limited. This is why my understanding is always only half of the game, but at least getting some directions.

So I found this: http://www.openocr.net/ - I've worked with docker, so I understand a bit the architecture, and I like that it hides away the more diffcult stuff and it is nicely boxed :)
If you scroll down, there is a features list. The 4th item says:
  • Pass arguments to Tesseract such as character whitelist and page segment mode
Is this what we are looking for? "page segment mode" and "character whitelist"? If I puzzle that together correctly, this is the api access you talked about... But I don't feel confident enough to make a conclusion. I would prefer to abstract the things, I don't understand. This is fair, isn't it? :)

Thanks and I hope this might inspire anyone, who happens to read this anyhow...
Reply all
Reply to author
Forward
0 new messages