On Wed, Aug 22, 2012 at 06:58:19AM -0700, Jani Monoses wrote:
> thanks for the prompt answer!
You're welcome. As I said, it's nice to have clear, well written
questions ;)
> 600DPI is generally recommended. You could try higher, but if you
> say there were some improvements and some regressions, I'd just stay
> at 600DPI.
>
> Alright, although there seemed to be more improvements than regressions at
> 1000dpi.
I don't think there are any fixed rules on this (someone else should
correct me if I'm wrong). So by all means use 1000dpi if it looks better.
> By the available language data I meant the already avaiable /usr/share/
> tesseract-ocr/tessdata/ron.traineddata for Romanian
> that comes in Ubuntu/Debian's packaging of Tesseract.
Aah, OK, forgive me, I didn't realise there was a Romanian training
that you were already using. Good.
> I was wondering if the Romanian dataset needs further training - I am not sure
> what well-trained means in this context.
Probably it wouldn't be worth further training. It isn't really
feasible to just "improve" the trainings at present, you would have
to create a wholly new training, which would take a lot of effort
and probably not have a big impact.
> I only meant spelling corrections in the post processing phase as I see quite a
> few non-words being recognized instead of
> what the original document has, usually one or two edit-distances away.
> Matching with dictionary words could fix these but
> then I wonder if it would not go against the intention of the OCR process,
> which is to recognize what is in the input, and not
> what the correct spelling of the input is. In my case the originals are all
> correctly spelled so I would need a post-processing step
> anyway but maybe it should not be a core part of Tesseract's pipeline.
OK, I see. One thing you could do would be to experiment with
increasing Tesseract's trust in its dictionary. I have done
something similar with my training. Create a file with this in:
language_model_penalty_non_freq_dict_word 0.2
language_model_penalty_non_dict_word 0.3
and save it to tessdata/configs/trustdict - wherever your tessdata
folder is (probably /usr/share/tesseract-ocr/)
The original values for those configuration variables are 0.1 and
0.15 respectively. Play around with increasing them and see whether
it helps.
Then when you run tesseract, do something like this:
tesseract input.png output -l ron trustdict
Hope this helps, and let us know how you get on.
Nick