Tesseract remove space when I use LTSM mode

91 views
Skip to first unread message

Enzo Merotto

unread,
Nov 3, 2020, 3:41:46 AM11/3/20
to tesseract-ocr
Hello,
I have a problem with the ltsm mode because it do not detect space and regroup every words in one.
Do you have an idea of why it does not detect spaces ?

Zdenko Podobny

unread,
Nov 3, 2020, 3:52:36 AM11/3/20
to tesser...@googlegroups.com
Please provide reproducible example of what you are doing, how, what is the result and desired result.

Zdenko


ut 3. 11. 2020 o 9:41 Enzo Merotto <louz...@gmail.com> napísal(a):
Hello,
I have a problem with the ltsm mode because it do not detect space and regroup every words in one.
Do you have an idea of why it does not detect spaces ?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/41cb6003-55ad-43d3-b8da-699fae606625n%40googlegroups.com.

Enzo Merotto

unread,
Nov 3, 2020, 5:56:32 AM11/3/20
to tesseract-ocr
We have recently change the version of tesseract from 3.02 to 4.0 to improve the performance and the rapidity, but when we use the LTSM mode, firstly we have a warning about the dpi: "Invalid resolution 0 dpi. Using 70 instead". We know why this problem appears. I don't know if the problem of non spaces detection comes from this warning. 
Look this example that is a french text:
CaptureText.PNG
We can see the warning and the transcribed text on the terminal without spaces. We expected:
"En votre aimable règlement,
Cordialement,"

This is how we use tesseract:  
CaptureCode1.PNG
CaptureCode3.PNGCaptureCode2.PNG
The image is a cv::Mat with 1 channel (8UC1).

Enzo Merotto

Zdenko Podobny

unread,
Nov 3, 2020, 6:31:31 AM11/3/20
to tesser...@googlegroups.com
IMO that is problem of your code. Have a look at tesseract code how to handle spaces.
Here is result for you image for different OEM:

> tesseract test_2020-11-03_122112048.png - --oem 0 -l fra

En votre aimable règlement,
Cordialement,

> tesseract test_2020-11-03_122112048.png - --oem 1 -l fra

En votre aimable règlement,
Cordialement,

> tesseract test_2020-11-03_122112048.png - --oem 2 -l fra

En votre aimable règlement,
Cordialement,






Zdenko


ut 3. 11. 2020 o 11:56 Enzo Merotto <louz...@gmail.com> napísal(a):

Enzo Merotto

unread,
Nov 3, 2020, 6:45:32 AM11/3/20
to tesseract-ocr
I'm not sure because in TESSERACT_ONLY mode there are spaces, so it works. It's not the case of LTSM mode.

Zdenko Podobny

unread,
Nov 3, 2020, 7:17:22 AM11/3/20
to tesser...@googlegroups.com
tesseract "executable" (which is also an example how to use the tesseract library) handles it correctly (for LSTM and legacy engine). So check the source code

Zdenko


ut 3. 11. 2020 o 12:45 Enzo Merotto <louz...@gmail.com> napísal(a):

Enzo Merotto

unread,
Nov 3, 2020, 9:53:17 AM11/3/20
to tesseract-ocr
We found the problem it was because we used the whitelist of SetVariables without space in the previous version of tesseract and we forgot to add it. We do not use SetVariables anymore. Now it works thank you.

Enzo Merotto

Reply all
Reply to author
Forward
0 new messages