Regarding space in punjabi recogniton

39 views
Skip to first unread message

Vaibhav Kumar

unread,
Nov 21, 2018, 1:33:53 AM11/21/18
to tesseract-ocr
Hi,

I was trying to do text recogniton using tesseract on punjabi language.
The recognition is working fine.

But there is a little issue with it.
Image containing white spaces between the words is not getting recognised i.e. words are getting fetched into text file without spaces.

Please suggest some solution for this.

Looking for a positive response from your end.




Regards
Vaibhav

Shree Devi Kumar

unread,
Nov 21, 2018, 8:25:18 AM11/21/18
to tesser...@googlegroups.com
Please provide a sample test image and expected ground truth text.

Which version of trained data did you use?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6d14452b-60b1-43a5-9a22-142e8bc174a2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vaibhav Kumar

unread,
Nov 21, 2018, 9:13:25 AM11/21/18
to tesseract-ocr
PFA for the image.

I used the default punjabi traineddata on which tesseract-ocr is trained by the developers.

pun4.jpg

Shree Devi Kumar

unread,
Nov 21, 2018, 11:51:19 AM11/21/18
to tesser...@googlegroups.com
There are three repositories with trained data files

tessdata 
tessdata_best
tessdata_fast

Please  also share the version info and command used...

tesseract -v

 

On Wed, 21 Nov 2018, 09:13 Vaibhav Kumar <vk53...@gmail.com wrote:
PFA for the image.

I used the default punjabi traineddata on which tesseract-ocr is trained by the developers.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Vaibhav Kumar

unread,
Nov 21, 2018, 12:07:38 PM11/21/18
to tesseract-ocr
tesseract -v yields

tesseract 3.04.01
 leptonica-1.73
  libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.2

I used tessdata 
and PFA for output file
I used the command tesseract -l pan pun1.jpg pan


pan.txt

Shree Devi Kumar

unread,
Nov 21, 2018, 12:29:46 PM11/21/18
to tesser...@googlegroups.com
Please try with tesseract 4.0.0
It should give you better recognition



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Vaibhav Kumar

unread,
Nov 21, 2018, 12:35:08 PM11/21/18
to tesseract-ocr
I read tesseract 4.x works on ubuntu 18 .
I am using ubuntu 16.

Isn't there any other solution ?

Shree Devi Kumar

unread,
Nov 21, 2018, 12:55:27 PM11/21/18
to tesser...@googlegroups.com
Read the main wiki page.

 You can install using Alex's ppa on older versions of Ubuntu.


Make sure to uninstall the 3.04 version.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Vaibhav Kumar

unread,
Nov 21, 2018, 1:07:35 PM11/21/18
to tesseract-ocr
That worked.

Thanks for the help.
Reply all
Reply to author
Forward
0 new messages