user-patterns details and examples

Andrew McGrath

unread,

Jan 20, 2014, 1:18:53 PM1/20/14

to tesser...@googlegroups.com

Hey Everyone,

This is my first post :-) Thanks for working on and maintaining this excellent tool!

I'm trying to refine the accuracy of the results we're getting back from Tesseract and seem to have encountered a lack of documentation around the user-patterns file.

My belief is that I should be generating this file much like the dawg files and user-word files, and referencing it in my config as as such:

user_patterns_suffix user-patterns

At the moment i'm trying to accomplish three things:

1. Ensure that any text strings starting with "www." expect some text and then a ".com" at the end.

2. Ensure that phone numbers are recognized. The actual text being transcribed is something like "(123) 123-1234". My assumption is that i could tell Tesseract expect two brackets containing 3 numbers, a space, three numbers, a dash and then 4 numbers. The real issue i'm getting is that its not aware that this pattern should only contain numbers, and it confuses things like the character D for the letter 0

3. Inform tesseract that I'm expecting a lot of prices, for example "$1.12", and that everything after the $ should be decimals or periods only

So my questions are:

Is there anyone who can tell me about the format of the user-patterns file and provide examples of their working user-patterns file / help me understand how to solve my pattern challenges? Also if there is anything else i need to do, other than reference this file in the config and include it in the same folder as my training data, that would be great to learn about.

What i've done so far:

I've created a pretty decent training set for my font (Around 4000 boxes) and a fairly complete dictionary file. I also defined the ambigchars to improve some of the simple 'find and replace' type scenarios, although i dont think i'm using this as it was intended as all my '0' type cases seem to do nothing. These things combined have had great results (Actually the dictionary has done the most for me), but i'm really trying to get to the next level by giving it some intelligence around the kinds of patterns it should expect to find. I had some issues with Tesseract 3.02 training tools, so i checked out the source for v3.03 and compiled it, resolving the issue i had.

Thanks for your help!

Nick White

unread,

Jan 22, 2014, 10:17:29 AM1/22/14

to tesser...@googlegroups.com

Hi Andrew,

Welcome! Sorry to have been a bit slow to reply.

> I'm trying to refine the accuracy of the results we're getting back from
> Tesseract and seem to have encountered a lack of documentation around the
> user-patterns file.

Yes, it certainly is an area where more documentation is needed.

I'll try to find the time to dig around the code and what
documentation there is on it to get back to you more on it soon.

In the meantime I'll answer some of your other questions and
thoughts.

The main thing I was thinking when reading your email is that you
can use number-dawg for some of these tasks. Going through your list:

> 2. Ensure that phone numbers are recognized. The actual text being transcribed
> is something like "(123) 123-1234". My assumption is that i could tell
> Tesseract expect two brackets containing 3 numbers, a space, three numbers, a
> dash and then 4 numbers. The real issue i'm getting is that its not aware that
> this pattern should only contain numbers, and it confuses things like the
> character D for the letter 0
> 3. Inform tesseract that I'm expecting a lot of prices, for example "$1.12",
> and that everything after the $ should be decimals or periods only

Take a look at the eng.number-dawg - you can get the wordlist it
uses by running the following:
$ combine-tessdata -u eng.traineddata eng.
$ dawg2wordlist eng.unicharset eng.number-dawg eng.number-wordlist

As described in the combine_tessdata manpage, each number is
represented by a space. Both of these rules should be really easy to
put into the number-dawg, something like:

( ) -
$ .
$ .
$ .

Note they're untested, and I haven't used number-dawg myself, but
that looks like it ought to work to me.

> 1. Ensure that any text strings starting with "www." expect some text and then
> a ".com" at the end.

The punc-dawg may be enough for this. Maybe something like this:

www. .com

> I also defined the ambigchars to improve some
> of the simple 'find and replace' type scenarios, although i dont think i'm
> using this as it was intended as all my '0' type cases seem to do nothing.

Yes, the '0' type cases don't make a large difference. Arguably they
should make a bit more. I wonder if there's a config variable to
control that...

Anyway, I agree, someone should document the user-patterns stuff.
I'll try to do so if I get time, but if anyone wants to look sooner,
or offer their own experiences with it, do go ahead!

Nick

Nick White

unread,

Jan 22, 2014, 10:28:38 AM1/22/14

to tesser...@googlegroups.com

On Wed, Jan 22, 2014 at 03:17:29PM +0000, Nick White wrote:
> Anyway, I agree, someone should document the user-patterns stuff.
> I'll try to do so if I get time, but if anyone wants to look sooner,
> or offer their own experiences with it, do go ahead!

I knew I'd seen *some* documentation for this feature. Naturally I
found it straight after sending the previous email. As mentioned in
the main tesseract manpage, dict/trie.h documents the format; see
http://code.google.com/p/tesseract-ocr/source/browse/trunk/dict/trie.h?r=999#188

If I'm reading it correctly, you should be able to specify your
rules with something like this:

www.\n\*.com
(\d\d\d) \d\d\d-\d\d\d\d
$\d\*.\d\d

Hope that helps!

Nick

Reply all

Reply to author

Forward