Carriage return after each word

83 views
Skip to first unread message

pascal 06

unread,
Sep 14, 2025, 7:48:05 AMSep 14
to tesseract-ocr
Hello everybody,
I'm using Tesseract with GScan2pdf under Linux.
When I do an OCR on a document, Tesseract recognition is quite good however it put a carriage return after each word. It is very annoying !
Here an example on what it does:
  1. Scanned document in gscand2pdf:
    Capture d’écran_2025-09-14_12-34-35.png
  2. Text recognition in gscan2pdf:
    Capture d’écran_2025-09-14_12-34-56.png
  3. Generated pdf opened with Okular; copy selected text:
    Capture d’écran_2025-09-14_12-35-25.png
  4. Pasted text:
    Conformément
     à l’article
     12 du
     Règlement du Fonds, le Fonds a procédé à sa deuxième
     distribution.
    Ce deuxième
     remboursement
     de capital
     s'élève
     à
     un
     montant
     de
     5.74
     €
     par part,
     soit
     5,74 % du nominal
    investi.
Here information about my system:
pascal@pascal-Latitude-5580:~$ lsb_release -a 
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.3 LTS 
Release: 24.04 
Codename: noble
pascal@pascal-Latitude-5580:~$ tesseract --version 
tesseract 5.3.4
 leptonica-1.82.0
 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0 
Found AVX2 
Found AVX 
Found FMA Found SSE4.1 
Found OpenMP 201511 
Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5 
Found libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7
Capture d’écran_2025-09-14_12-44-28.png
Any idea ?
Thanks a lot in advance for your help.

pascal 06

unread,
Sep 17, 2025, 6:57:45 AMSep 17
to tesseract-ocr
Hello,
so somebody suggested me to do an OCR directly from command line. So using the example above, I did:
~~~
tesseract test.tif test.txt -l fra
~~~
and then the result:
>Conformément à l’article 12 du Règlement du Fonds, le Fonds a procédé à sa deuxième distribution.
>Ce deuxième remboursement de capital s'élève à un montant de 5.74 € par part, soit 5,74 % du nominal
>
>investi.

Do you have any idea what's going on?
Thank you in advance for any help you can provide,
Pascal

Tom Morris

unread,
Sep 18, 2025, 12:33:35 PMSep 18
to tesseract-ocr
Salut Pascal,

I'm glad that you were able to determine that Tesseract is working correctly.

On Wednesday, September 17, 2025 at 6:57:45 AM UTC-4 pascal....@gmail.com wrote:
Do you have any idea what's going on?

Since you are working with two different applications, which folks in this forum are
unlikely to have much knowledge of: 1) gscan2pdf, which uses/embeds Tesseract,
and 2) Okular, which I'm guessing is a PDF viewer.

There are a number of areas where things could go awry, including the way the PDF
is constructed and the way the text is selected and formatted on the clipboard.

I suspect that the text is being split into multiple text blocks and that each of those
text blocks is getting a new line added for "free" at the end. Where in the processing
chain this is happening isn't clear.

If your goal is simply to get the best rendition of the text, it sounds like you've 
discovered what is needed. If you want to get that specific combination of 
programs to work better, you're probably going to need to address it with
whoever supports them.

Bonne chance!

Tom
 

pascal 06

unread,
Sep 18, 2025, 2:31:43 PMSep 18
to tesseract-ocr
Bonsoir Tom,
je suppose que tu es francophone :)
Merci pour ta réponse !
Je vais continuer en anglais pour une meilleure compréhension pour les autres personnes :)
So,
Thanks a lot for your reply.
Indeed, Okular is a pdf reader.
I agree most of the things. However, I think that things already goes wrong in gscan2pdf OCR. As you wrote "I suspect that the text is being split into multiple text blocks and that each of those text blocks is getting a new line added for "free" at the end."
Hope I'll get help :)
In conclusion, for the moment the most important thing is I can do text search in many scanned document thanks to Tesseract.
Encore merci pour ton aide,
Pascal
Reply all
Reply to author
Forward
0 new messages