Some spaces are not recognized

1.185 weergaven
Naar het eerste ongelezen bericht

Sumedhe Dissanayake

ongelezen,
18 mei 2018, 08:09:4418-05-2018
aan tesseract-ocr
Sometimes spaces between words are ignored when tesseract is used to recognize Sinhala text.

- The traineddata from tesseract does not have a spacing problem, even though there ware changes in tesseract since it was uploaded.
- The spacing problem occurs regardless of whether I start the training from scratch or bootstrap with the traineddata from tesseract.
- The spacing problem gets worse with more training.
- Adding more space between the words during training does not make a difference.
- Adding double space between the words during recognition solves the problem.
- The spacing problem is not consistent, i.e. in the recognition of a text only some of the inter-word spaces are ignored (could not figure out any logic as to when it happens).

I have attached a screenshot, comparing a sample of input and output text.

Words missing spaces are underlined.


ShreeDevi Kumar

ongelezen,
18 mei 2018, 09:02:4418-05-2018
aan tesser...@googlegroups.com
image is not visible.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dfba845a-abe4-48fa-b834-7c64faf54f13%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bericht is verwijderd

Sumedhe Dissanayake

ongelezen,
29 mei 2018, 06:46:3129-05-2018
aan tesseract-ocr


On Friday, May 18, 2018 at 6:32:44 PM UTC+5:30, shree wrote:
image is not visible.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, May 18, 2018 at 5:39 PM, Sumedhe Dissanayake <sumedhedi...@gmail.com> wrote:
Sometimes spaces between words are ignored when tesseract is used to recognize Sinhala text.

- The traineddata from tesseract does not have a spacing problem, even though there ware changes in tesseract since it was uploaded.
- The spacing problem occurs regardless of whether I start the training from scratch or bootstrap with the traineddata from tesseract.
- The spacing problem gets worse with more training.
- Adding more space between the words during training does not make a difference.
- Adding double space between the words during recognition solves the problem.
- The spacing problem is not consistent, i.e. in the recognition of a text only some of the inter-word spaces are ignored (could not figure out any logic as to when it happens).

I have attached a screenshot, comparing a sample of input and output text.

Words missing spaces are underlined.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
tesseract-spacing-problem.png

ShreeDevi Kumar

ongelezen,
29 mei 2018, 07:00:3429-05-2018
aan tesser...@googlegroups.com
>The traineddata from tesseract does not have a spacing problem, 

Then the problem is related to training.




ShreeDevi Kumar

ongelezen,
29 mei 2018, 07:03:4329-05-2018
aan tesser...@googlegroups.com
set the config variable - "preserve_interword_spaces" to 1
And as 0
For diff runs
and see if that makes any difference

neet k

ongelezen,
12 jan 2020, 07:00:0312-01-2020
aan tesseract-ocr

Using Tesseract to recognize Text from images. The spaces between words are ignored for Punjabi text.

Library : Tess-Two

Platform : Android

How i can fix the problem related to spaces. Hereby, attaching a screenshot, input and output text.

Regards

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Tess OCR.jpg

Shree Devi Kumar

ongelezen,
12 jan 2020, 23:39:1812-01-2020
aan tesseract-ocr
I am not sure what version of tesseract and traineddata file you are using. It works fine with latest code and traineddata files from all three tessdata repos.

ubuntu@tesseract-ocr:~/TEST$ tesseract pan.png - -l pan --tessdata-dir ~/tessdata --psm 6 --oem 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
ਖੀ ਜ਼ਿੰਦਗੀ ਦਾ
ਤੋਂ ਵੱਡਾ ਗੁਣ
ubuntu@tesseract-ocr:~/TEST$ tesseract pan.png - -l pan --tessdata-dir ~/tessdata_best --psm 6 --oem 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
ਖੀ ਜ਼ਿੰਦਗੀ ਦਾ
ਤੋਂ ਵੱਡਾ ਗੁਣ
ubuntu@tesseract-ocr:~/TEST$ tesseract pan.png - -l pan --tessdata-dir ~/tessdata_fast --psm 6 --oem 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
ਖੀ ਜ਼ਿੰਦਗੀ ਦਾ
ਤੋਂ ਵੱਡਾ ਗੁਣ
ubuntu@tesseract-ocr:~/TEST$





--
pan.png

neet k

ongelezen,
13 jan 2020, 01:58:0013-01-2020
aan tesseract-ocr
Sir, 

I want to develop Android Application for same so i am using Android Platform. 

Firstly, As per sources i found that to integrate Tesseract to android Tess-two Library (A fork of Tesseract Tools for Android) is available https://github.com/rmtheis/tess-two . Further,  Tess-Two Library works with: Tesseract 3.05, Leptonica 1.74.1 , Traineddata file: https://github.com/tesseract-ocr/tessdata/tree/3.04.00.

Secondly, Yes i have Learned from your Previous Discussions that this problem is fixed in Tesseract 4.0. But i am not finding any help to use tesseract 4.0 for Android. Please provide sources and help for same.

Regards, 
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Allen beantwoorden
Auteur beantwoorden
Doorsturen
0 nieuwe berichten