Tesseract 4 new Font

Maicon Azevedo

unread,

May 17, 2017, 2:02:24 AM5/17/17

to tesseract-ocr

Hello!

Guys I have tesseract 4 on Ubuntu 16.04.

Running the tesseract with -l por (portuguese from Brazil) I don't have the good results. The image use other font than the trained data (I think).

My question is. It's necessary to train tesseract again? I created the tif and box file with jtesseditor but I don't what I need to do with these files and how to write a good training data. I sow the https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but I didn't found any case similar with mine.

Thanks in advance!

ShreeDevi Kumar

unread,

May 17, 2017, 5:46:08 AM5/17/17

to tesser...@googlegroups.com

1. Which --oem are you using with tesseract 4, legacy engine or lstm?

--oem 0 or --oem 1

2. Is Brazilian Portuguese very different from Portuguese? Please see the trainingtext and wordlists on https://github.com/tesseract-ocr/langdata/tree/master/por

3. Provide a sample image with it's ground truth and point out the errors in it. Is the image at 300 dpi?

4. Please share the box/tiff pair to test for training.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a34d2a11-54d6-416f-87cd-164a8157aed6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Maicon Azevedo

unread,

May 17, 2017, 10:37:22 AM5/17/17

to tesseract-ocr

Hi shree,

1 - The results are the same with --oem 0 or --oem 1

2 - No, it's very similar. I saw this and was because of this I decided to ask if it's necessary to train the same lang with other fonts. Or I need to do something with the files in lang data, like copy to my installation?

3 - I use the sample.jpg (attached) and after I convert the image with this command: convert -density 300 sample.jpg -background white -compress none -colorspace Gray test.tif

After: tesseract --oem 3 test.tif output -l por

And the output(attached) is the text extracted from tesseract, as you can see my name Maicon doesn't appear. How I can provide the truth data? txt?

4 - I attached the files

On Wednesday, May 17, 2017 at 6:46:08 AM UTC-3, shree wrote:

1. Which --oem are you using with tesseract 4, legacy engine or lstm?

--oem 0 or --oem 1

2. Is Brazilian Portuguese very different from Portuguese? Please see the trainingtext and wordlists on https://github.com/tesseract-ocr/langdata/tree/master/por

3. Provide a sample image with it's ground truth and point out the errors in it. Is the image at 300 dpi?

4. Please share the box/tiff pair to test for training.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 17, 2017 at 2:49 AM, Maicon Azevedo <pnpinfo...@gmail.com> wrote:

Hello!

Guys I have tesseract 4 on Ubuntu 16.04.

Running the tesseract with -l por (portuguese from Brazil) I don't have the good results. The image use other font than the trained data (I think).

My question is. It's necessary to train tesseract again? I created the tif and box file with jtesseditor but I don't what I need to do with these files and how to write a good training data. I sow the https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but I didn't found any case similar with mine.

Thanks in advance!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

por.courierprime.exp0.box

por.courierprime.exp0.tif

por.font_properties

sample.jpg

output.txt

Ahmad Moawad

unread,

Jun 5, 2017, 9:03:32 AM6/5/17

to tesseract-ocr

Hello,

I have the similar situation to yours, I think you should copy the box/tiff fist before doing any training in tesstrain.sh

Reply all

Reply to author

Forward