Hopefully this is clearer than my previous mail...
My commandline invocation is as follows...
convert -density 600 mailinglist01.pdf tmp.tif
tesseract -l eng+spa tmp.tif stdout
I'm attaching the "mailinglist01.pdf" file...
I'm using data files downloaded from this section of the wiki...
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-304305
The text generated by tesseract contains the string
"417575 5.1 COMUNICACION, S.A. DE CV."
This is incorrect, as it should say
"417575 5.1 COMUNICACION, S.A. DE C.V."
It's missing a period between the "C" and the "V"
A quick tally tells me that the above commandline sequence triggers
this error 24 times...
Can anyone think of any Tesserect tweaks that would fix this?
OTOH it's easy to fix this with text processing, after a Tesseract
invocation. Do people usually fix these type of things with search
and replace?
These are the software versions...
~% convert --version
Version: ImageMagick 6.7.7-10 2014-08-28 Q16
http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP
~% tesseract --version
tesseract 3.05.01
leptonica-1.74.4
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 :
libtiff 4.0.3 : zlib 1.2.8
--
------------------------------
Ivan Monroy
Desarrollador en Tecnologías para la Transparencia
Datos -
https://quienesquien.wiki - @QuienQuienWiki
PODER -
http://projectpoder.org - @projectPODER
email -
iv...@rindecuentas.org
PGP --- 4EB8 DBD8 12DF 4CE2 D942 5FE6 CFB3 B835 BF0D 6582