Some spaces are not recognized

1,173 views
Skip to first unread message

Sumedhe Dissanayake

unread,
May 18, 2018, 8:09:44 AM5/18/18
to tesseract-ocr
Sometimes spaces between words are ignored when tesseract is used to recognize Sinhala text.

- The traineddata from tesseract does not have a spacing problem, even though there ware changes in tesseract since it was uploaded.
- The spacing problem occurs regardless of whether I start the training from scratch or bootstrap with the traineddata from tesseract.
- The spacing problem gets worse with more training.
- Adding more space between the words during training does not make a difference.
- Adding double space between the words during recognition solves the problem.
- The spacing problem is not consistent, i.e. in the recognition of a text only some of the inter-word spaces are ignored (could not figure out any logic as to when it happens).

I have attached a screenshot, comparing a sample of input and output text.

Words missing spaces are underlined.


ShreeDevi Kumar

unread,
May 18, 2018, 9:02:44 AM5/18/18
to tesser...@googlegroups.com
image is not visible.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dfba845a-abe4-48fa-b834-7c64faf54f13%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Sumedhe Dissanayake

unread,
May 29, 2018, 6:46:31 AM5/29/18
to tesseract-ocr


On Friday, May 18, 2018 at 6:32:44 PM UTC+5:30, shree wrote:
image is not visible.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 18, 2018 at 5:39 PM, Sumedhe Dissanayake <sumedhedi...@gmail.com> wrote:
Sometimes spaces between words are ignored when tesseract is used to recognize Sinhala text.

- The traineddata from tesseract does not have a spacing problem, even though there ware changes in tesseract since it was uploaded.
- The spacing problem occurs regardless of whether I start the training from scratch or bootstrap with the traineddata from tesseract.
- The spacing problem gets worse with more training.
- Adding more space between the words during training does not make a difference.
- Adding double space between the words during recognition solves the problem.
- The spacing problem is not consistent, i.e. in the recognition of a text only some of the inter-word spaces are ignored (could not figure out any logic as to when it happens).

I have attached a screenshot, comparing a sample of input and output text.

Words missing spaces are underlined.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
tesseract-spacing-problem.png

ShreeDevi Kumar

unread,
May 29, 2018, 7:00:34 AM5/29/18
to tesser...@googlegroups.com
>The traineddata from tesseract does not have a spacing problem, 

Then the problem is related to training.




ShreeDevi Kumar

unread,
May 29, 2018, 7:03:43 AM5/29/18
to tesser...@googlegroups.com
set the config variable - "preserve_interword_spaces" to 1
And as 0
For diff runs
and see if that makes any difference

neet k

unread,
Jan 12, 2020, 7:00:03 AM1/12/20
to tesseract-ocr

Using Tesseract to recognize Text from images. The spaces between words are ignored for Punjabi text.

Library : Tess-Two

Platform : Android

How i can fix the problem related to spaces. Hereby, attaching a screenshot, input and output text.

Regards

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Tess OCR.jpg

Shree Devi Kumar

unread,
Jan 12, 2020, 11:39:18 PM1/12/20
to tesseract-ocr
I am not sure what version of tesseract and traineddata file you are using. It works fine with latest code and traineddata files from all three tessdata repos.

ubuntu@tesseract-ocr:~/TEST$ tesseract pan.png - -l pan --tessdata-dir ~/tessdata --psm 6 --oem 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
ਖੀ ਜ਼ਿੰਦਗੀ ਦਾ
ਤੋਂ ਵੱਡਾ ਗੁਣ
ubuntu@tesseract-ocr:~/TEST$ tesseract pan.png - -l pan --tessdata-dir ~/tessdata_best --psm 6 --oem 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
ਖੀ ਜ਼ਿੰਦਗੀ ਦਾ
ਤੋਂ ਵੱਡਾ ਗੁਣ
ubuntu@tesseract-ocr:~/TEST$ tesseract pan.png - -l pan --tessdata-dir ~/tessdata_fast --psm 6 --oem 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
ਖੀ ਜ਼ਿੰਦਗੀ ਦਾ
ਤੋਂ ਵੱਡਾ ਗੁਣ
ubuntu@tesseract-ocr:~/TEST$





--
pan.png

neet k

unread,
Jan 13, 2020, 1:58:00 AM1/13/20
to tesseract-ocr
Sir, 

I want to develop Android Application for same so i am using Android Platform. 

Firstly, As per sources i found that to integrate Tesseract to android Tess-two Library (A fork of Tesseract Tools for Android) is available https://github.com/rmtheis/tess-two . Further,  Tess-Two Library works with: Tesseract 3.05, Leptonica 1.74.1 , Traineddata file: https://github.com/tesseract-ocr/tessdata/tree/3.04.00.

Secondly, Yes i have Learned from your Previous Discussions that this problem is fixed in Tesseract 4.0. But i am not finding any help to use tesseract 4.0 for Android. Please provide sources and help for same.

Regards, 
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages