*** On behalf of Andy Syme who could not post in this group probably
due to spam removal artefacts ***
...my problem is that I have some documents written in 1890-1920 that
I scanned & want to OCR. They are in English & using the standard
English language file I was getting 40-50% recognition. I then tried
to train a new font. I made an image file with at least 1 often 3 or
4 copies of each character & used pyTesseract to make the box file for
this new font. Rebuilt the trained data file (after some trial &
error), including adding the new font & updating the ambiguous
character sets e.g. g’ = g, \\’ = W etc.
When I rerun tesseract the OCR recognition is no better. I then
created a language file which was basically all the English files but
with only the ‘new font’ in. OCR accuracy dropped.
Is there something I’m doing wrong? The new box file had all the
letters (upper & lower) numbers & some punctuation but no newer
symbols (e.g. &s or @s ) as they are not present in these docs.
I can send the files I made if it will help you.
Will post this again if you prefer but I am desperately looking for
some help in this.
Andy
*** End Andy Syme ***
The provided file can be downloaded here:
https://docs.google.com/leaf?id=0B4FRY5H4TwI8ZWUzZDkzNjYtZTFiNC00NTBmLWIyY2ItMDFmNDAxZGI1ZTdk&hl=en