Spanish text better processed in eng than in spa

313 views
Skip to first unread message

valentin...@gmail.com

unread,
Aug 28, 2017, 1:38:03 AM8/28/17
to tesseract-ocr
So... after following the instructions from quality improvement: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality I found what I think is a nice picture, I attach you tessinput.tif file I received as output.

When I ran tesseract 4.0.0 on the image I found that actually the eng version is providing a better nicer version of the analysis than the spanish version.

What can I do? I actually have seen recurrent errors with the same chart.
out.eng.txt
out.spa.txt
tessinput.tif

ShreeDevi Kumar

unread,
Aug 28, 2017, 2:15:41 AM8/28/17
to tesser...@googlegroups.com
Have you tried with the 'best' traineddatas?

What about results using best/Spanish vs best/spa?

I have opened this as an issue at https://github.com/tesseract-ocr/tessdata/issues/77

You can provide additional feedback there.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b1efae89-d9d5-4970-9b3e-5e29f9dd6620%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

valentin...@gmail.com

unread,
Aug 28, 2017, 6:24:01 PM8/28/17
to tesseract-ocr
So... I have installed the default tessdata used by the installer, which seems to be this one: https://github.com/tesseract-ocr/tessdata/blob/master/spa.traineddata

Looking to your comment I have installed the package: https://github.com/tesseract-ocr/tessdata/blob/master/best/spa.traineddata

But I have not found best/Spanish, is it missing in the upload?

The best/spa is REALLY better and comparable quality when compared to english, the have moreless the same level of errors.

Where is best/Spanish, looking to the effect I am really interested in testing it.

Btw, is there any way to tell tesseract that values are in a table, so that it will not make a mistake identifying lines with charts?


El lunes, 28 de agosto de 2017, 8:15:41 (UTC+2), shree escribió:
Have you tried with the 'best' traineddatas?

What about results using best/Spanish vs best/spa?

I have opened this as an issue at https://github.com/tesseract-ocr/tessdata/issues/77

You can provide additional feedback there.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Aug 28, 2017 at 6:04 AM, <valentin...@gmail.com> wrote:
So... after following the instructions from quality improvement: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality I found what I think is a nice picture, I attach you tessinput.tif file I received as output.

When I ran tesseract 4.0.0 on the image I found that actually the eng version is providing a better nicer version of the analysis than the spanish version.

What can I do? I actually have seen recurrent errors with the same chart.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Aug 28, 2017, 9:17:40 PM8/28/17
to tesser...@googlegroups.com
I had not checked the list.

It should actually be Latin.traineddata for all languages written in Latin script. Not Spanish, as I had written.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Aug 28, 2017, 11:10:34 PM8/28/17
to tesser...@googlegroups.com
>Btw, is there any way to tell tesseract that values are in a table, so that it will not make a mistake identifying lines with charts?

I don't think tesseract has that ability.

You will need to preprocess the image to remove lines. Leptonica has functions to do that, as well as a table detector.




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

valentin...@gmail.com

unread,
Aug 29, 2017, 10:08:05 AM8/29/17
to tesseract-ocr
spa and latin within best folders are moreless equivalent, there is no significant difference, although there are several failures they are quite reasonable. The one that provide real bad output are the official ones that are automatically installed.

Do you need help training the data? (is a neural network?) I can provide examples.

ShreeDevi Kumar

unread,
Aug 29, 2017, 10:40:52 AM8/29/17
to tesser...@googlegroups.com
I have opened this as an issue at https://github.com/tesseract-ocr/tessdata/issues/77

You can provide additional feedback there.

@theraysmith is doing the training at Google.  The examples you provide will be helpful to him and improve future training.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages