How to add space between strings of document. Punjabi (Gurmukhi) language have a space issue, after ocr the image it is showing no space b/w the text.

194 views
Skip to first unread message

Mandeep Singh

unread,
May 24, 2017, 1:31:16 AM5/24/17
to tesseract-ocr
Hello Guys,

I am training data for Punjabi language i am getting space issue. How do i edit config file and how do i make own personel config file for my own custom language. Please assist me.


Output is : ੳਸਦਡਗ
i want and i assume output like this => ੳ ਸ ਦ ਡ ਗ
pan.raavi.exp0.tif

ShreeDevi Kumar

unread,
May 24, 2017, 2:14:42 AM5/24/17
to tesser...@googlegroups.com
Which O/S?
Which version of Tesseract?
How are you training?

Have you tried the packaged traineddata for Punjabi? What result do you get with that?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9e0aa40e-85e8-4659-87fb-9b586817e377%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mandeep Singh

unread,
May 31, 2017, 6:17:11 AM5/31/17
to tesseract-ocr
I am using Window 8.1 and tesseract version 3.04.

i am training the data with jTessBox editor and another method with C# Serak Trainer , but i didn't find any good solutions. There is major issue space.


On Wednesday, 24 May 2017 11:44:42 UTC+5:30, shree wrote:
Which O/S?
Which version of Tesseract?
How are you training?

Have you tried the packaged traineddata for Punjabi? What result do you get with that?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 24, 2017 at 10:14 AM, Mandeep Singh <mande...@gmail.com> wrote:
Hello Guys,

I am training data for Punjabi language i am getting space issue. How do i edit config file and how do i make own personel config file for my own custom language. Please assist me.


Output is : ੳਸਦਡਗ
i want and i assume output like this => ੳ ਸ ਦ ਡ ਗ

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
May 31, 2017, 6:24:54 AM5/31/17
to tesser...@googlegroups.com

The output you posted, is it using the 3.04 traineddata from repo?

What PSM did you use?

Try using the experimental tesseract4 version for windows , see wiki for links.


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Mandeep Singh

unread,
May 31, 2017, 6:46:37 AM5/31/17
to tesseract-ocr
kindly provide me your email address i want to discuss with this issue. yes i used 3.04 and what does it mean PSM?

ShreeDevi Kumar

unread,
May 31, 2017, 7:35:55 AM5/31/17
to tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
May 31, 2017, 8:41:10 AM5/31/17
to tesser...@googlegroups.com
Use --oem 1 (LSTM engine) with tesseract 4.0. You will get correct output.

Use for command line interface

                        binaries from https://github.com/UB-Mannheim/tesseract/wiki

Use for GUI - look for tesseract 4.0 versions

                      gImagesReader  https://github.com/manisandro/gImageReader/releases




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Mandeep Singh

unread,
Jun 1, 2017, 2:10:22 AM6/1/17
to tesseract-ocr

There is still space issue. kindly review this attachment .


Please help me out .

issue.PNG

Mandeep Singh

unread,
Jun 1, 2017, 3:43:37 AM6/1/17
to tesseract-ocr
kindly view this issue or please guide me how do i add config file for punjabi language.

ShreeDevi Kumar

unread,
Jun 1, 2017, 4:34:34 AM6/1/17
to tesser...@googlegroups.com
Are you using the 4.0 version of tesseract with --oem 1 (LSTM engine only)?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Mandeep Singh

unread,
Jun 1, 2017, 4:46:04 AM6/1/17
to tesseract-ocr
i had install tesseract.exe 4.0 on my system after that i am using jTessBoxEditor 2.0 for training data punjabi language. Thats it. i dont what does it mean by lstm? please guide me

ShreeDevi Kumar

unread,
Jun 1, 2017, 5:03:01 AM6/1/17
to tesser...@googlegroups.com
Please read the wiki links I sent.

If you have installed tesseract 4.0, please test first with the provided traineddata for Punjabi before trying to train.

Most times, existing traineddata provides the best result.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Jun 1, 2017, 5:04:14 AM6/1/17
to tesser...@googlegroups.com

has the traineddata for 4.0.

Mandeep Singh

unread,
Jun 1, 2017, 5:18:33 AM6/1/17
to tesseract-ocr

ohhh Thank you very much it is working. many many thanks to you.


but i have more questions.

1. if i am training new data still there is space problem.

2. How do i add more data in pan.traindata or can i edit existing traindata?

ShreeDevi Kumar

unread,
Jun 1, 2017, 5:24:50 AM6/1/17
to tesser...@googlegroups.com
Are you training for 3.0 or 4.0?

Do you have spaces between the letters in your training text?


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Mandeep Singh

unread,
Jun 1, 2017, 6:37:55 AM6/1/17
to tesseract-ocr

Now i am using Tesseract 4.0 version as per your guidance. I want to train data for version 4.0 . Yes i am making spaces b/w the text but it is not showing spaces b/w the text.
Please now tell me how do i train the data again for the new version.


On Thursday, 1 June 2017 14:54:50 UTC+5:30, shree wrote:
Are you training for 3.0 or 4.0?

Do you have spaces between the letters in your training text?


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 1, 2017 at 2:48 PM, Mandeep Singh <mande...@gmail.com> wrote:

ohhh Thank you very much it is working. many many thanks to you.


but i have more questions.

1. if i am training new data still there is space problem.

2. How do i add more data in pan.traindata or can i edit existing traindata?

On Thursday, 1 June 2017 14:34:14 UTC+5:30, shree wrote:

has the traineddata for 4.0.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Jun 1, 2017, 7:09:57 AM6/1/17
to tesser...@googlegroups.com

kmpre...@gmail.com

unread,
Mar 17, 2019, 3:56:14 PM3/17/19
to tesseract-ocr
your(Mandeep Singh) code is working for Punjabi because I'm also facing the same problem(space problem).

neet k

unread,
Jan 12, 2020, 7:31:21 AM1/12/20
to tesseract-ocr
Hiii Mandeep Singh,

I am facing same problem related to spaces , using Tesseract to recognize Text from images. The spaces between words are ignored for Punjabi text.

Library : Tess-Two

Platform : Android

it would be grateful if you could help me to fix the problem related to spaces. Hereby, attaching a screenshot, input and output text.

Regards

Tess OCR.jpg

Suresh Anand

unread,
Jan 12, 2020, 10:38:55 AM1/12/20
to tesser...@googlegroups.com
There's a parameter preserve word space .Have a look

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Ravneet Kaur

unread,
Jan 13, 2020, 1:03:43 AM1/13/20
to tesser...@googlegroups.com
Please Let me know about Parameter. Thanks

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/Q7mFMki7mRk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMk_d_XbFqUWKb8T9vu%2BEUihGM5HiP_Zmzxh9H-wuboqjXGj1g%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages