Would want a period between C and V

瀏覽次數:64 次
跳到第一則未讀訊息

i

未讀,
2017年11月9日 下午3:37:332017/11/9
收件者:tesseract-ocr
Hey!

It's my first time using Tesseract. Apologies if my questions are offtopic.

This is the tesseract version:

tesseract 3.05.01
 leptonica-1.74.4
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 :
 libtiff 4.0.3 : zlib 1.2.8

A recurrent error in the generated text concerns the string "C.V."
This string is often not being read / parsed / recognized correctly...

Quite often, the generated text will contain the incorrect "CV." string
instead of the correct "C.V." string.

I'm attaching a sample PDF.
FWIW the complete phrase is "S.A. DE C.V.", which is a common "type
of business entity" in Spanish-speaking geographies...

Would anyone have any suggestions for fixing this?
mailinglist01.pdf

Ivan

未讀,
2017年11月10日 中午12:50:362017/11/10
收件者:tesser...@googlegroups.com
Hopefully this is clearer than my previous mail...

My commandline invocation is as follows...

convert -density 600 mailinglist01.pdf tmp.tif
tesseract -l eng+spa tmp.tif stdout

I'm attaching the "mailinglist01.pdf" file...

I'm using data files downloaded from this section of the wiki...
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-304305

The text generated by tesseract contains the string
"417575 5.1 COMUNICACION, S.A. DE CV."

This is incorrect, as it should say
"417575 5.1 COMUNICACION, S.A. DE C.V."
It's missing a period between the "C" and the "V"

A quick tally tells me that the above commandline sequence triggers
this error 24 times...

Can anyone think of any Tesserect tweaks that would fix this?

OTOH it's easy to fix this with text processing, after a Tesseract
invocation. Do people usually fix these type of things with search
and replace?

These are the software versions...

~% convert --version
Version: ImageMagick 6.7.7-10 2014-08-28 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP

~% tesseract --version
tesseract 3.05.01
leptonica-1.74.4
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 :
libtiff 4.0.3 : zlib 1.2.8


--
------------------------------




Ivan Monroy
Desarrollador en Tecnologías para la Transparencia

Datos - https://quienesquien.wiki - @QuienQuienWiki
PODER - http://projectpoder.org - @projectPODER
email - iv...@rindecuentas.org
PGP --- 4EB8 DBD8 12DF 4CE2 D942 5FE6 CFB3 B835 BF0D 6582



mailinglist01.pdf

Dan9er

未讀,
2017年11月12日 中午12:52:012017/11/12
收件者:tesseract-ocr
Try making a file named spa.user-words in tesseract-ocr/tessdata with this line in it:
C.V.

This will tell tesseract that this is a special word that it should also look for. You can also add more words on each line in order of the frequency they appear in your context. This feature was added so you can add your-context-specific words to Tesseract's dictionary without having to retrain it.

Ivan

未讀,
2017年11月13日 下午1:46:202017/11/13
收件者:tesser...@googlegroups.com
Thanks for your reply!

I created said spa.user-words file in tesseract-ocr/tessdata but it
didn't help. Maybe I'm doing something wrong...

However, I tested changes in the language specificaction of the
tesseract invocation...

Before, it was:
tesseract -l eng+spa tmp.tif stdout

Now, it's:
tesseract -l spa tmp.tif stdout

For some reason, this solved my issue.

I'm a bit perplexed...

Why did the changes in the -l flag fixed it?
> > email - iv...@rindecuentas.org <javascript:>
回覆所有人
回覆作者
轉寄
0 則新訊息