Tessarct won't recognise single characters

241 views
Skip to first unread message

Iain Downs

unread,
Jul 11, 2024, 2:35:50 PM7/11/24
to tesseract-ocr
I'm trying to extract page numbers from scanned pages of text.  Page Numbers are either at the top or at the bottom - sometimes with titles / authors / chapters.  Occasionally elsewhere, but I don't care about the exceptions.

I've loaded tesseract 5.4 (windows) and run some tests using the executable.  I'm finding that if the page number is a single digit on the line, tesseract ignores it (but otherwise does a fantastic job of OCR even with skewed and noisy images).

I've isolated the single line used that as input and tesseract tells me 'the page is empty'.

Here is a sample of a single line with a '1' in it resolution is 300dpi.
101_bottom.jpg

Ultimately I would be writing a program using tesseract, but in the first instance I'd like to see it work with the exe.

So, can I tell tesseract to be less fussy with individual characters and if not how would I do so programatically - if possible?

Thanks

Iain

René JM Clais

unread,
Jul 13, 2024, 6:20:54 AM7/13/24
to tesser...@googlegroups.com
Hi,
I try your example with tesseract for python - it works well

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com.

Iain Downs

unread,
Jul 13, 2024, 12:25:58 PM7/13/24
to tesseract-ocr
Can you give me some example code?  I'm currently trying to get tesseract working for C++ in Visual Studio and it's a bit of a nightmare.  python seems easier though it's not one of my main languages - I can try it out though!

Iain

Ger Hobbelt

unread,
Jul 13, 2024, 2:51:49 PM7/13/24
to tesseract-ocr
Have you tried compiling/building the examples code that comes with tesseract?

That should give some reasonable initial results - I can't comment on the autoconf or cmake stuff that comes with tesseract as I have my own c/c++ build rig for msvc, but the only real nuisance -- as far as I am concerned -- is the pango lib and you dont need that unless you want the training tool text2image to work as well.

Also rtfc'ing the tesseract cli source file itself might help, but, yeah, it ain't for rookies, shall we say. If you haven't got experience with other large "technical debt" codebases, then I can full well understand that it isn't easy to get tesseract to complete building.



Iain Downs

unread,
Jul 13, 2024, 3:16:00 PM7/13/24
to tesseract-ocr
ger.h - I'm currently in the process (several hours long so far) of getting ANYTHING using tesseract to build in VS2022.  I've got to the point where I seem to simply have to add the (many) dlls to the project directory to pass that hurdle.  vcpkg inside VS2022 is not at all straightforward, but I finally managed to get the package installed.  Tomorrow perhaps for more. 

But I'm not sure I see why the simple sample code will work in C++ and not using tesseract,exe directly.  I have coded up some samples in c# (using tesseract.net) but they are no better than the exe - so far.  I was hoping that I could get to the point where I could get the symbol level confidence and find that the '1' in the file was then found, but that level of the library is poorly documented.

I HAVE dealt with large scale C++ open source project and I truly hate it!  This is a personal project and I'm not entirely sure I have the will to read that sort of C++.  Whilst I was once an expert in C++ that was 20 years ago and I never got on with stl particularly, so it's a bit daunting!

Thanks for all input!

Iain

Message has been deleted

Iain Downs

unread,
Jul 14, 2024, 2:47:47 AM7/14/24
to tesser...@googlegroups.com

Apologies.  Python file in the google groups but for some reason didn’t come down with the email.

 

Also, I now have a sample program (nearly) working in C++.  My last step was to copy all the dlls from the vcpkg install into the source directory, otherwise they weren’t found when running.  I’m left with setting the location of the language file and it should work.  But the python will be helpful nonetheless.

 

Iain

 

From: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] On Behalf Of Dominic Mukilan
Sent: 13 July 2024 17:42
To: tesser...@googlegroups.com
Subject: Re: [tesseract-ocr] Tessarct won't recognise single characters

 

Attaching the python file, the supporting files, and requirements.txt

 

On Sat, 13 Jul 2024 at 21:56, Iain Downs <ia...@idcl.co.uk> wrote:

Can you give me some example code?  I'm currently trying to get tesseract working for C++ in Visual Studio and it's a bit of a nightmare.  python seems easier though it's not one of my main languages - I can try it out though!

 

Iain

On Saturday, July 13, 2024 at 11:20:54 AM UTC+1 renec...@gmail.com wrote:

Hi,

I try your example with tesseract for python - it works well

 

Le jeu. 11 juil. 2024 à 20:35, Iain Downs <ia...@idcl.co.uk> a écrit :

I'm trying to extract page numbers from scanned pages of text.  Page Numbers are either at the top or at the bottom - sometimes with titles / authors / chapters.  Occasionally elsewhere, but I don't care about the exceptions.

 

I've loaded tesseract 5.4 (windows) and run some tests using the executable.  I'm finding that if the page number is a single digit on the line, tesseract ignores it (but otherwise does a fantastic job of OCR even with skewed and noisy images).

 

I've isolated the single line used that as input and tesseract tells me 'the page is empty'.

 

Here is a sample of a single line with a '1' in it resolution is 300dpi.

Image removed by sender. 101_bottom.jpg

 

Ultimately I would be writing a program using tesseract, but in the first instance I'd like to see it work with the exe.

 

So, can I tell tesseract to be less fussy with individual characters and if not how would I do so programatically - if possible?

 

Thanks

 

Iain

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/AI48y7_QMlg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAOrS2tW_CUVUsOv%3DAXanD2947Q29xC8hO1z6kzXLciix8XHbJA%40mail.gmail.com.

~WRD000.jpg

Iain Downs

unread,
Jul 14, 2024, 3:20:59 AM7/14/24
to tesseract-ocr
I have FINALLY got the c++ samples working in Visual Studio 2022. The code I am using is the first tesseract sample code from here .

Bizarrely, this simple code finds the page numbers at the bottom of the page perfectly happily, whereas the tesseract executable did not.  This is good news - though confusing...

Thanks to all for your input on this - I think for the moment I'm enough ahead that I can call this issue closed.  I will be seeing if I can replicate this in c# which is a more productive environment for me than C++.

Iain

Iain Downs

unread,
Jul 14, 2024, 9:21:15 AM7/14/24
to tesseract-ocr
For those interested, the c# nuget package  Tesseract.OCR ALSO ignores the page numbers with a simple test program.  The possibly slightly older and better known c# package Tesseract does not load properly from Nuget - probably something I'm doing, but I can't image what!

Iain

René JM Clais

unread,
Jul 14, 2024, 9:56:33 AM7/14/24
to tesser...@googlegroups.com
import cv2
import pytesseract as tesser


originalImage = cv2.imread("myfile.jpg") #myfile.jpg   ===> original image

(thresh, imgbw) = cv2.threshold(originalImage,180,255, cv2.THRESH_BINARY)   # black and white



cv2.imshow('Black white image', imgbw)
cv2.waitKey(0)  #make enter
cv2.destroyAllWindows()

#tesseract transformation 
#



custom_config = r' -l  ' + 'eng' + '    --psm 6  ' 


text= tesser.image_to_string(imgbw,config=custom_config )  
print(text)  #the text


Zdenko Podobny

unread,
Jul 14, 2024, 10:13:15 AM7/14/24
to tesser...@googlegroups.com
custom_config = r' -l  ' + 'eng' + '    --psm 6 

What is the point of this? To slow down the script? 

Zdenko


ne 14. 7. 2024 o 15:56 René JM Clais <renec...@gmail.com> napísal(a):

René JM Clais

unread,
Jul 14, 2024, 1:44:12 PM7/14/24
to tesser...@googlegroups.com
I don't understand what do you mean ?

Zdenko Podobny

unread,
Jul 14, 2024, 1:47:22 PM7/14/24
to tesser...@googlegroups.com
So you do not understand the code you posted?

Zdenko


ne 14. 7. 2024 o 19:44 René JM Clais <renec...@gmail.com> napísal(a):

Ger Hobbelt

unread,
Jul 14, 2024, 8:08:54 PM7/14/24
to tesseract-ocr
> Bizarrely, this simple code finds the page numbers at the bottom of the page perfectly happily, whereas the tesseract executable did not.  This is good news - though confusing...

IIRC, the library has the psm default set to PSM_SINGLE_BLOCK = 6, while tesseract CLI sets psm to PSM_AUTO = 3 when you don't specify it explicitly, hence two different psm 'defaults', which may well explain the discrepancy you observe.

René JM Clais

unread,
Jul 15, 2024, 6:29:25 AM7/15/24
to tesser...@googlegroups.com
My code is working well  and your remarks are out of the context.

René JM Clais

unread,
Jul 15, 2024, 6:30:25 AM7/15/24
to tesser...@googlegroups.com
My code is working well and your remarks are out of the context.

Zdenko Podobny

unread,
Jul 15, 2024, 10:03:55 AM7/15/24
to tesser...@googlegroups.com
My remark is about code quality. Code quality is relevant. Or indication that somebody is doing copy&paste without understanding code - that is dangerous.


Zdenko


po 15. 7. 2024 o 12:30 René JM Clais <renec...@gmail.com> napísal(a):

Mona Dastar

unread,
Jul 15, 2024, 10:09:58 AM7/15/24
to tesser...@googlegroups.com
Hi everyone 
Regarding what Zdenko said, after the first section of module 3 I stopped because I had questions and I couldn’t understand the code, I have trouble with the last module what do you think?
Since that I didn’t study and I am getting farther and further away. 
I appreciate your tips.  


Zdenko Podobny

unread,
Jul 15, 2024, 10:12:24 AM7/15/24
to tesser...@googlegroups.com
Code that was posted here is not dangerous. Just a python coder would  make it the right way.


Zdenko


po 15. 7. 2024 o 16:09 Mona Dastar <mona....@gmail.com> napísal(a):

Karol Wójcik

unread,
Jul 15, 2024, 10:34:02 AM7/15/24
to tesseract-ocr
A true Python coder would not ever use a terribly written library like pytesseract in the first place. The way of passing command line params is a minor thing, compared to that.

Iain Downs

unread,
Jul 15, 2024, 12:40:24 PM7/15/24
to tesseract-ocr
Thanks ger.  I found DefaultPageSegMode (this in the c# tessearct package - not tried C++ yet, though I don't see to need to).  That worked fine.

Iain

Reply all
Reply to author
Forward
0 new messages