ocr on real (dirty) printing

107 views
Skip to first unread message

Peter Joh. Brunner

unread,
Apr 2, 2015, 10:20:29 AM4/2/15
to tesser...@googlegroups.com
I have a problem using tesseract with german fraktur.

firstly the text to be ocr'd is real printed text of about 1930.
the printing is a little dirty i.e. there are little points and strokes between
the letters.
though these are far smaller than the other letters, they are interpreted as
normal letters.

Is there a possibility to give parameters to tesseract that it
. either should neglect letters which do not fit the majority of the other
  letters,
. or it should only use letters in a given range of size
. or to firstly make the boxes,
  then correct the boxes, by hand or program,
  finally translate using the corrected boxes

a solution with a dictionary is not possible, because the text consists of only
names of persons and locations.

Another thing i wonder is:
when i ocr an image from .tiff to .txt
and makebox of the same image
some (few) letters are different recognized!

thanks for help in advance

Peter Joh. Brunner

unread,
Apr 16, 2015, 5:37:31 AM4/16/15
to tesser...@googlegroups.com
once again, with more information:


I have a problem using tesseract with german fraktur.

I work with tesseract 3.02.02 on SUSE Linux 13.2


firstly the text to be ocr'd is real printed text of about 1930.
the printing is a little dirty i.e. there are little points and strokes between
the letters.
though these are far smaller than the other letters, they are interpreted as
normal letters.oes-frak.frak.exp017


Is there a possibility to give parameters to tesseract that it
. either should neglect letters which do not fit the majority of the other
  letters,
. or it should only use letters in a given range of size
. or to firstly make the boxes,
  then correct the boxes, by hand or program,
  finally translate using the corrected boxes

I have already tried with a config-file to modify
  textord_min_xheight 24
  textord_xheight_mode_fraction 0.9
  textord_xheight_error_margin 0.1
  textord_descx_ratio_min 0.3
  tessedit_redo_xheight FALSE
it changes some things but nothing to neglect the points and strokes

following an example:
the appended picture is translated to the text
  15 Ellser Exdmsund Mögsgzerg
example.tiff
Reply all
Reply to author
Forward
0 new messages