Re: How to improve recognition on TIFF black-and white Romanian text?

596 views
Skip to first unread message

Nick White

unread,
Aug 22, 2012, 4:38:30 AM8/22/12
to tesser...@googlegroups.com
Hi Jani,

Good questions, I'll answer them as best I can below:

> * Is any of the input formats preferable over others? I used PDF to TIFF via
> Ghostscript and I wonder if png/jpeg or other formats could have any advantage.
> If the original text is not color, does the TIFF device chosen matter?

TIFF should be fine. PNG can be easier to work with, as TIFF has so
many variants that it can cause unexpected problems sometimes. But
if it's opening it and processing it, you can stick with TIFF.

> * Is there a way to ensure optimal quality of the TIFF for purposes of OCR file
> via Ghostscript's command line options? I tried -r600, -r1000, -r1200 just to
> see if there's any difference and while there were improvements in recognition
> in 1000 vs 600 there were also regressions in Tesseract's output.

600DPI is generally recommended. You could try higher, but if you
say there were some improvements and some regressions, I'd just stay
at 600DPI.

> * The text is Romanian, so latin characters with a few twists but no complex
> shapes. Is there any extra training to be done or should the available language
> data be enough?

Does "a few twists" mean any extra characters or diacritics? If not,
then just keep the training the same. However you will get
improvements by creating a 'dictionary' file for Romanian and
telling Tesseract to use that. You can do that by getting a text
file with one word per line, then running 'wordlist2dawg' on it, and
then use 'combine_tessdata' to uncompile the current
eng.traineddata, add your dictionary 'dawg' file, then recompile it
with the 'combine_tessdata' command again (there may be an easier
way to do this, but I'm not sure; Zdenko, correct me if there is.)

> * Is it a common practice that is outside the scope of Tesseract to do
> post-processing/spelling correction if words are incorrectly recognized or is
> that a sign of more training/tweaking needed?

I don't know if it's common practise. Spell checking should really
be handled by training Tesseract appropriately. I have been working
on creating a training file for a while, and have been keen to try
to keep everything possible in the training rather than rely on
post-processing. Is there anything particular you had in mind for
post-processing? We could let you know if it would be possible or
sensible to try fiddling with the training instead.

Nick

Nick White

unread,
Aug 22, 2012, 10:13:20 AM8/22/12
to Jani Monoses, tesser...@googlegroups.com
On Wed, Aug 22, 2012 at 06:58:19AM -0700, Jani Monoses wrote:
> thanks for the prompt answer!

You're welcome. As I said, it's nice to have clear, well written
questions ;)

> 600DPI is generally recommended. You could try higher, but if you
> say there were some improvements and some regressions, I'd just stay
> at 600DPI.
>
> Alright, although there seemed to be more improvements than regressions at
> 1000dpi.

I don't think there are any fixed rules on this (someone else should
correct me if I'm wrong). So by all means use 1000dpi if it looks better.

> By the available language data I meant the already avaiable /usr/share/
> tesseract-ocr/tessdata/ron.traineddata for Romanian
> that comes in Ubuntu/Debian's packaging of Tesseract.

Aah, OK, forgive me, I didn't realise there was a Romanian training
that you were already using. Good.

> I was wondering if the Romanian dataset needs further training - I am not sure
> what well-trained means in this context.

Probably it wouldn't be worth further training. It isn't really
feasible to just "improve" the trainings at present, you would have
to create a wholly new training, which would take a lot of effort
and probably not have a big impact.

> I only meant spelling corrections in the post processing phase as I see quite a
> few non-words being recognized instead of
> what the original document has, usually one or two edit-distances away.
> Matching with dictionary words could fix these but
> then I wonder if it would not go against the intention of the OCR process,
> which is to recognize what is in the input, and not
> what the correct spelling of the input is. In my case the originals are all
> correctly spelled so I would need a post-processing step
> anyway but maybe it should not be a core part of Tesseract's pipeline.

OK, I see. One thing you could do would be to experiment with
increasing Tesseract's trust in its dictionary. I have done
something similar with my training. Create a file with this in:

language_model_penalty_non_freq_dict_word 0.2
language_model_penalty_non_dict_word 0.3

and save it to tessdata/configs/trustdict - wherever your tessdata
folder is (probably /usr/share/tesseract-ocr/)

The original values for those configuration variables are 0.1 and
0.15 respectively. Play around with increasing them and see whether
it helps.

Then when you run tesseract, do something like this:
tesseract input.png output -l ron trustdict

Hope this helps, and let us know how you get on.

Nick

Jani Monoses

unread,
Aug 22, 2012, 10:50:06 AM8/22/12
to Nick White, tesser...@googlegroups.com
>
> OK, I see. One thing you could do would be to experiment with
> increasing Tesseract's trust in its dictionary. I have done
> something similar with my training. Create a file with this in:
>
> language_model_penalty_non_freq_dict_word 0.2
> language_model_penalty_non_dict_word 0.3
>

Thanks, I tried this and the output is certainly different, but as
with the dpi changes
some things got better, other regressed with no clear winner.

I tried increasing the values even more but then the regressions seem
to multiply too.
What I notice now is that at higher dpi, all lowercase o is recognized
as e, so I'll probably stick to 600dpi for now.

So there's no way of just adding new words to the existing dictionary
without redoing the whole training?

Are any other tunables such as the above that you think may help looking into?

Jani

Nick White

unread,
Aug 22, 2012, 10:58:55 AM8/22/12
to Jani Monoses, tesser...@googlegroups.com
On Wed, Aug 22, 2012 at 05:50:06PM +0300, Jani Monoses wrote:
> So there's no way of just adding new words to the existing dictionary
> without redoing the whole training?

There is a way, yes. Create a ron.user-words file in your tessdata
directory, and a config file stating:

user_words_suffix user-words

(I think the config file is needed, but I'm not sure.) The
ron.user-words file should have a list of words, one per line, UTF8
encoded.

> Are any other tunables such as the above that you think may help looking into?

I found 'enable_new_segsearch 1' to be very helpful, but it might
already be enabled with Romanian (use combine_tessdata -u and check
the .config file if you want to see). Other than that, I can't
advise really. There isn't any documentation for most of the
configuration variables, so they're in the realm of "black magic".
grep -R VAR_H * | grep -v '^Binary '| grep -v 'svn-base'
on the source tree will give you a listing of things to try if you
feel like exploring.

Nick

Nick White

unread,
Aug 22, 2012, 12:53:10 PM8/22/12
to Jani Monoses, tesser...@googlegroups.com
On Wed, Aug 22, 2012 at 09:43:10AM -0700, Jani Monoses wrote:
> If I only do this I get:
>
> Re-initializing document dictionary...
> Error: word 'aerobuz/P' not in DAWG after adding it
> Error: failed to load /usr/share/tesseract-ocr/tessdata/ron.user-words
>
> So I need to do the words2dawg and recombination commands sequence as you
> suggested in your initial reply?

I believe the .user-words file should just be a plain UTF-8 text
file, with one word per line. Is that what you're using?

The wordlist2dawg command is only used for the main dictionaries;
user-words is different.

Nick

Jani Monoses

unread,
Aug 22, 2012, 1:04:53 PM8/22/12
to Nick White, tesser...@googlegroups.com
On Wed, Aug 22, 2012 at 7:53 PM, Nick White <nick....@durham.ac.uk> wrote:
> On Wed, Aug 22, 2012 at 09:43:10AM -0700, Jani Monoses wrote:
>> If I only do this I get:
>>
>> Re-initializing document dictionary...
>> Error: word 'aerobuz/P' not in DAWG after adding it
>> Error: failed to load /usr/share/tesseract-ocr/tessdata/ron.user-words
>>
>> So I need to do the words2dawg and recombination commands sequence as you
>> suggested in your initial reply?
>
> I believe the .user-words file should just be a plain UTF-8 text
> file, with one word per line. Is that what you're using?

Yes, the dictionary file provided by the hunspell-ro package in Ubuntu.

http://paste.ubuntu.com/1161140/

It is UTF-8 from what I can tell.

Nick White

unread,
Sep 6, 2012, 8:24:24 AM9/6/12
to Piyush Tiwari, tesser...@googlegroups.com
Hi Piyush,

> As you said 600 DPI image would be good for OCRs. But I am not able to relate
> 600 DPI with these parameters. My guess is DPI is same as density. Any
> suggestion would be highly appreciated.

DPI is the same as imagemagick's -density command, at least for what
we're using it for. Your command may be failing as convert needs to
be used like this:

convert inputfile.pdf -option1 -option2 outputfile.png

E.g. the input file needs to be before the output options.

Hope this helps. Other than that, the options you're using look fine
to me. Is there something specific that is causing problems?

Nick

Nick White

unread,
Sep 6, 2012, 8:26:07 AM9/6/12
to Jani Monoses, tesser...@googlegroups.com
On Wed, Aug 22, 2012 at 08:04:53PM +0300, Jani Monoses wrote:
> On Wed, Aug 22, 2012 at 7:53 PM, Nick White <nick....@durham.ac.uk> wrote:
> > On Wed, Aug 22, 2012 at 09:43:10AM -0700, Jani Monoses wrote:
> >> If I only do this I get:
> >>
> >> Re-initializing document dictionary...
> >> Error: word 'aerobuz/P' not in DAWG after adding it
> >> Error: failed to load /usr/share/tesseract-ocr/tessdata/ron.user-words

Hi Jani,

Did you ever work out what was causing this problem and fix it?

If not I'll take another look and see if I have more luck
tracking it down.

Nick

Jani Monoses

unread,
Sep 6, 2012, 12:01:23 PM9/6/12
to Nick White, tesser...@googlegroups.com
Hi Nick,

I removed the bogus words from that file (it is a list of words+ some
suffix metadata for the hunspell dictionary engine I guess), but I
still get errors. So it is not the '/' character.

$tesseract -l ron 005.tiff output uwo
Re-initializing document dictionary...
Error: word 'altimetrie' not in DAWG after adding it
Error: failed to load /usr/share/tesseract-ocr/tessdata/ron.user-words

$cat uwo
user_words_suffix user-words

Robert Komar

unread,
Sep 6, 2012, 1:18:34 PM9/6/12
to tesser...@googlegroups.com
I think the "-density 600" option should come before the
name of the pdf file. It will then scale the vector
output to that DPI (assuming the PDF file has reasonable
DPI values within it). Putting it after just sets the DPI
tag on the output while rendering the contents at 72 DPI.

I would leave off the -geometry option, as '4000' is
probably not what the width of the scaled contents actually
is. -depth is for telling convert what the input depth is
(if it can't figure it out itself), so I'd leave that off,
too. Try something simple like:

convert -density 600 inputfile.pdf -monochrome -compress \
Group4 outputfile.tif

Cheers,
Rob Komar
Reply all
Reply to author
Forward
0 new messages