Would want a period between C and V

i

未讀,

2017年11月9日下午3:37:332017/11/9

收件者：tesseract-ocr

Hey!

It's my first time using Tesseract. Apologies if my questions are offtopic.

This is the tesseract version:

tesseract 3.05.01

leptonica-1.74.4

libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 :

libtiff 4.0.3 : zlib 1.2.8

A recurrent error in the generated text concerns the string "C.V."

This string is often not being read / parsed / recognized correctly...

Quite often, the generated text will contain the incorrect "CV." string

instead of the correct "C.V." string.

I'm attaching a sample PDF.

FWIW the complete phrase is "S.A. DE C.V.", which is a common "type

of business entity" in Spanish-speaking geographies...

Would anyone have any suggestions for fixing this?

mailinglist01.pdf

Ivan

未讀,

2017年11月10日中午12:50:362017/11/10

收件者：tesser...@googlegroups.com

Hopefully this is clearer than my previous mail...

My commandline invocation is as follows...

convert -density 600 mailinglist01.pdf tmp.tif
tesseract -l eng+spa tmp.tif stdout

I'm attaching the "mailinglist01.pdf" file...

I'm using data files downloaded from this section of the wiki...
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-304305

The text generated by tesseract contains the string
"417575 5.1 COMUNICACION, S.A. DE CV."

This is incorrect, as it should say
"417575 5.1 COMUNICACION, S.A. DE C.V."
It's missing a period between the "C" and the "V"

A quick tally tells me that the above commandline sequence triggers
this error 24 times...

Can anyone think of any Tesserect tweaks that would fix this?

OTOH it's easy to fix this with text processing, after a Tesseract
invocation. Do people usually fix these type of things with search
and replace?

These are the software versions...

~% convert --version
Version: ImageMagick 6.7.7-10 2014-08-28 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP

~% tesseract --version

tesseract 3.05.01
leptonica-1.74.4
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 :
libtiff 4.0.3 : zlib 1.2.8

--
------------------------------

Ivan Monroy
Desarrollador en Tecnologías para la Transparencia

Datos - https://quienesquien.wiki - @QuienQuienWiki
PODER - http://projectpoder.org - @projectPODER
email - iv...@rindecuentas.org
PGP --- 4EB8 DBD8 12DF 4CE2 D942 5FE6 CFB3 B835 BF0D 6582

mailinglist01.pdf

Dan9er

未讀,

2017年11月12日中午12:52:012017/11/12

收件者：tesseract-ocr

Try making a file named spa.user-words in tesseract-ocr/tessdata with this line in it:

C.V.

This will tell tesseract that this is a special word that it should also look for. You can also add more words on each line in order of the frequency they appear in your context. This feature was added so you can add your-context-specific words to Tesseract's dictionary without having to retrain it.

Ivan

未讀,

2017年11月13日下午1:46:202017/11/13

收件者：tesser...@googlegroups.com

Thanks for your reply!

I created said spa.user-words file in tesseract-ocr/tessdata but it
didn't help. Maybe I'm doing something wrong...

However, I tested changes in the language specificaction of the
tesseract invocation...

Before, it was:

tesseract -l eng+spa tmp.tif stdout

Now, it's:
tesseract -l spa tmp.tif stdout

For some reason, this solved my issue.

I'm a bit perplexed...

Why did the changes in the -l flag fixed it?

> > email - iv...@rindecuentas.org <javascript:>

回覆所有人

回覆作者

轉寄