Word coordinate for single lines.

84 views
Skip to first unread message

ahka.an...@gmail.com

unread,
Jun 15, 2018, 8:42:00 AM6/15/18
to tesseract-ocr
Dear All,

In the project that I am currently working in, I have a pure text line cropped from an document image. 

As a next step, I need to recognize the text using and at the same time, I need to get the words coordinates.

To get that coordinates I am passing the hocr parameters to the command line and assign the page segmentation mode to 7 (line).

tesseract file.png out.txt --psm 7 hocr.

However, the output is really bad because by passing these parameters, the line will be conisders as a page and some words will not be detected at the output.

Is there another way to get the word coordinate of that line?

ahka.an...@gmail.com

unread,
Jun 22, 2018, 7:59:56 AM6/22/18
to tesseract-ocr
Could someone please try to give me an answer for my language.

Shree Devi Kumar

unread,
Jun 22, 2018, 9:59:23 AM6/22/18
to tesser...@googlegroups.com
Please try with a different psm and see if you get better results. If you share a sample image we can test and respond.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d24b268f-5cfa-4d20-89c0-9dfd2360f0dc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ahka.an...@gmail.com

unread,
Jun 22, 2018, 10:05:41 AM6/22/18
to tesseract-ocr


Thanks for the reply
Those are two line examples.

Shree Devi Kumar

unread,
Jun 22, 2018, 10:11:33 AM6/22/18
to tesser...@googlegroups.com
Try adding a slight white border to images and see if that helps.


For more options, visit https://groups.google.com/d/optout.

ahka.an...@gmail.com

unread,
Jun 22, 2018, 10:47:36 AM6/22/18
to tesseract-ocr
I have tried to add margins to the lines, but it did not make the results better. 

Also tried to use other psm values (11, 12 ..) it was not also enhancing the output.

It looks like the (hocr) parameter, is enforcing the psm to be as a page.

any Ideas how to imporve or enhance the results.


On Friday, June 15, 2018 at 2:42:00 PM UTC+2, ahka.an...@gmail.com wrote:

Lorenzo Bolzani

unread,
Jun 22, 2018, 3:17:18 PM6/22/18
to tesser...@googlegroups.com

With this configuration:

tesseract 3.05.01
 leptonica-1.75.3
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : zlib 1.2.8


Running:

tesseract --psm 7 -l eng 24-block-0-L-42.png out

gives me:

3765 Sexualhormonbind. Globulin 1, 15 30 , 16


Upscaling the image to height 50px gives me:

3765 Sexualhormonbind. Globulin 1,15 30,16


As attachment you find the hocr output  I get with your command.

This for the second image (as is):

3620 Risen 1,15 2,68


For images like this you may also cut it into three parts:

3765
Sexualhormonbind. Globulin
1,15 30,16

and use a different "tessedit_char_whitelist" for each, like this:


tesseract --psm 7 -l eng -c tessedit_char_whitelist="1234567890" crop.png out



Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
out.txt.hocr
Reply all
Reply to author
Forward
0 new messages