Tesseract 4 new Font

364 views
Skip to first unread message

Maicon Azevedo

unread,
May 17, 2017, 2:02:24 AM5/17/17
to tesseract-ocr
Hello!

Guys I have tesseract 4 on Ubuntu 16.04.

Running the tesseract with  -l por (portuguese from Brazil) I don't have the good results. The image use other font than the trained data (I think).

My question is. It's necessary to train tesseract again? I created the tif and box file with jtesseditor but I don't what I need to do with these files and how to write a good training data.  I sow the https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but I didn't found any case similar with mine.

Thanks in advance!

ShreeDevi Kumar

unread,
May 17, 2017, 5:46:08 AM5/17/17
to tesser...@googlegroups.com
1. Which --oem are you using with tesseract 4, legacy engine or lstm?

--oem 0 or --oem 1

2. Is Brazilian Portuguese very different from Portuguese? Please see the trainingtext and wordlists on https://github.com/tesseract-ocr/langdata/tree/master/por

3. Provide a sample image with it's ground truth and point out the errors in it. Is the image at 300 dpi?

4. Please share the box/tiff pair to test for training.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a34d2a11-54d6-416f-87cd-164a8157aed6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Maicon Azevedo

unread,
May 17, 2017, 10:37:22 AM5/17/17
to tesseract-ocr
Hi shree,

1 - The results are the same with --oem 0 or --oem 1

2 - No, it's very similar. I saw this and was because of this I decided to ask if it's necessary to train the same lang with other fonts. Or I need to do something with the files in lang data, like  copy to my installation?

3 - I use the sample.jpg (attached) and after I convert the image with this command:   convert -density 300 sample.jpg -background white -compress none -colorspace Gray test.tif

After: tesseract --oem 3 test.tif output -l por

And the output(attached) is the text extracted from tesseract, as you can see my name Maicon doesn't appear. How I can provide the truth data? txt?

4 - I attached the files


On Wednesday, May 17, 2017 at 6:46:08 AM UTC-3, shree wrote:
1. Which --oem are you using with tesseract 4, legacy engine or lstm?

--oem 0 or --oem 1

2. Is Brazilian Portuguese very different from Portuguese? Please see the trainingtext and wordlists on https://github.com/tesseract-ocr/langdata/tree/master/por

3. Provide a sample image with it's ground truth and point out the errors in it. Is the image at 300 dpi?

4. Please share the box/tiff pair to test for training.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 17, 2017 at 2:49 AM, Maicon Azevedo <pnpinfo...@gmail.com> wrote:
Hello!

Guys I have tesseract 4 on Ubuntu 16.04.

Running the tesseract with  -l por (portuguese from Brazil) I don't have the good results. The image use other font than the trained data (I think).

My question is. It's necessary to train tesseract again? I created the tif and box file with jtesseditor but I don't what I need to do with these files and how to write a good training data.  I sow the https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but I didn't found any case similar with mine.

Thanks in advance!

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
por.courierprime.exp0.box
por.courierprime.exp0.tif
por.font_properties
sample.jpg
output.txt

Ahmad Moawad

unread,
Jun 5, 2017, 9:03:32 AM6/5/17
to tesseract-ocr
Hello,

I have the similar situation to yours, I think you should copy the box/tiff fist before doing any training in tesstrain.sh

Reply all
Reply to author
Forward
0 new messages