Tessarct won't recognise single characters

Iain Downs

unread,

Jul 11, 2024, 2:35:50 PM7/11/24

to tesseract-ocr

I'm trying to extract page numbers from scanned pages of text. Page Numbers are either at the top or at the bottom - sometimes with titles / authors / chapters. Occasionally elsewhere, but I don't care about the exceptions.

I've loaded tesseract 5.4 (windows) and run some tests using the executable. I'm finding that if the page number is a single digit on the line, tesseract ignores it (but otherwise does a fantastic job of OCR even with skewed and noisy images).

I've isolated the single line used that as input and tesseract tells me 'the page is empty'.

Here is a sample of a single line with a '1' in it resolution is 300dpi.

Ultimately I would be writing a program using tesseract, but in the first instance I'd like to see it work with the exe.

So, can I tell tesseract to be less fussy with individual characters and if not how would I do so programatically - if possible?

Thanks

Iain

René JM Clais

unread,

Jul 13, 2024, 6:20:54 AM7/13/24

to tesser...@googlegroups.com

Hi,

I try your example with tesseract for python - it works well

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com.

Iain Downs

unread,

Jul 13, 2024, 12:25:58 PM7/13/24

to tesseract-ocr

Can you give me some example code? I'm currently trying to get tesseract working for C++ in Visual Studio and it's a bit of a nightmare. python seems easier though it's not one of my main languages - I can try it out though!

Iain

Ger Hobbelt

unread,

Jul 13, 2024, 2:51:49 PM7/13/24

to tesseract-ocr

Have you tried compiling/building the examples code that comes with tesseract?

That should give some reasonable initial results - I can't comment on the autoconf or cmake stuff that comes with tesseract as I have my own c/c++ build rig for msvc, but the only real nuisance -- as far as I am concerned -- is the pango lib and you dont need that unless you want the training tool text2image to work as well.

Also rtfc'ing the tesseract cli source file itself might help, but, yeah, it ain't for rookies, shall we say. If you haven't got experience with other large "technical debt" codebases, then I can full well understand that it isn't easy to get tesseract to complete building.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2e56b599-4dcf-4b93-8e1b-40a57b36d3e9n%40googlegroups.com.

Iain Downs

unread,

Jul 13, 2024, 3:16:00 PM7/13/24

to tesseract-ocr

ger.h - I'm currently in the process (several hours long so far) of getting ANYTHING using tesseract to build in VS2022. I've got to the point where I seem to simply have to add the (many) dlls to the project directory to pass that hurdle. vcpkg inside VS2022 is not at all straightforward, but I finally managed to get the package installed. Tomorrow perhaps for more.

But I'm not sure I see why the simple sample code will work in C++ and not using tesseract,exe directly. I have coded up some samples in c# (using tesseract.net) but they are no better than the exe - so far. I was hoping that I could get to the point where I could get the symbol level confidence and find that the '1' in the file was then found, but that level of the library is poorly documented.

I HAVE dealt with large scale C++ open source project and I truly hate it! This is a personal project and I'm not entirely sure I have the will to read that sort of C++. Whilst I was once an expert in C++ that was 20 years ago and I never got on with stl particularly, so it's a bit daunting!

Thanks for all input!

Iain

Message has been deleted

Iain Downs

unread,

Jul 14, 2024, 2:47:47 AM7/14/24

to tesser...@googlegroups.com

Apologies. Python file in the google groups but for some reason didn’t come down with the email.

Also, I now have a sample program (nearly) working in C++. My last step was to copy all the dlls from the vcpkg install into the source directory, otherwise they weren’t found when running. I’m left with setting the location of the language file and it should work. But the python will be helpful nonetheless.

Iain

From: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] On Behalf Of Dominic Mukilan
Sent: 13 July 2024 17:42
To: tesser...@googlegroups.com
Subject: Re: [tesseract-ocr] Tessarct won't recognise single characters

Attaching the python file, the supporting files, and requirements.txt

On Sat, 13 Jul 2024 at 21:56, Iain Downs <ia...@idcl.co.uk> wrote:

Can you give me some example code? I'm currently trying to get tesseract working for C++ in Visual Studio and it's a bit of a nightmare. python seems easier though it's not one of my main languages - I can try it out though!

Iain

On Saturday, July 13, 2024 at 11:20:54 AM UTC+1 renec...@gmail.com wrote:

Hi,
I try your example with tesseract for python - it works well

Le jeu. 11 juil. 2024 à 20:35, Iain Downs <ia...@idcl.co.uk> a écrit :

I'm trying to extract page numbers from scanned pages of text. Page Numbers are either at the top or at the bottom - sometimes with titles / authors / chapters. Occasionally elsewhere, but I don't care about the exceptions.

I've loaded tesseract 5.4 (windows) and run some tests using the executable. I'm finding that if the page number is a single digit on the line, tesseract ignores it (but otherwise does a fantastic job of OCR even with skewed and noisy images).

I've isolated the single line used that as input and tesseract tells me 'the page is empty'.

Here is a sample of a single line with a '1' in it resolution is 300dpi.

Ultimately I would be writing a program using tesseract, but in the first instance I'd like to see it work with the exe.

So, can I tell tesseract to be less fussy with individual characters and if not how would I do so programatically - if possible?

Thanks

Iain

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c42d435c-4db5-48b5-94d3-5b761d340731n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2e56b599-4dcf-4b93-8e1b-40a57b36d3e9n%40googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/AI48y7_QMlg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAOrS2tW_CUVUsOv%3DAXanD2947Q29xC8hO1z6kzXLciix8XHbJA%40mail.gmail.com.

~WRD000.jpg

Iain Downs

unread,

Jul 14, 2024, 3:20:59 AM7/14/24

to tesseract-ocr

I have FINALLY got the c++ samples working in Visual Studio 2022. The code I am using is the first tesseract sample code from here .

Bizarrely, this simple code finds the page numbers at the bottom of the page perfectly happily, whereas the tesseract executable did not. This is good news - though confusing...

Thanks to all for your input on this - I think for the moment I'm enough ahead that I can call this issue closed. I will be seeing if I can replicate this in c# which is a more productive environment for me than C++.

Iain

Iain Downs

unread,

Jul 14, 2024, 9:21:15 AM7/14/24

to tesseract-ocr

For those interested, the c# nuget package Tesseract.OCR ALSO ignores the page numbers with a simple test program. The possibly slightly older and better known c# package Tesseract does not load properly from Nuget - probably something I'm doing, but I can't image what!

Iain

René JM Clais

unread,

Jul 14, 2024, 9:56:33 AM7/14/24

to tesser...@googlegroups.com

import cv2

import pytesseract as tesser

originalImage = cv2.imread("myfile.jpg") #myfile.jpg ===> original image

(thresh, imgbw) = cv2.threshold(originalImage,180,255, cv2.THRESH_BINARY) # black and white

cv2.imshow('Black white image', imgbw)

cv2.waitKey(0) #make enter

cv2.destroyAllWindows()

#tesseract transformation

#

custom_config = r' -l ' + 'eng' + ' --psm 6 '

text= tesser.image_to_string(imgbw,config=custom_config )

print(text) #the text

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2e56b599-4dcf-4b93-8e1b-40a57b36d3e9n%40googlegroups.com.

Zdenko Podobny

unread,

Jul 14, 2024, 10:13:15 AM7/14/24

to tesser...@googlegroups.com

custom_config = r' -l ' + 'eng' + ' --psm 6

What is the point of this? To slow down the script?

Zdenko

ne 14. 7. 2024 o 15:56 René JM Clais <renec...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_qXNcQyuQYBTVdkx1kLYnVpLJJQ-1a%3DwM7SBCcJsmANvw%40mail.gmail.com.

René JM Clais

unread,

Jul 14, 2024, 1:44:12 PM7/14/24

to tesser...@googlegroups.com

I don't understand what do you mean ?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wgH%2Bt4ZCs_nE0zoPtmwT6gzmRcF5YDZsJrZBAoghSdmA%40mail.gmail.com.

Zdenko Podobny

unread,

Jul 14, 2024, 1:47:22 PM7/14/24

to tesser...@googlegroups.com

So you do not understand the code you posted?

Zdenko

ne 14. 7. 2024 o 19:44 René JM Clais <renec...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_qX3dFGnKRZ0UN4Aqq0%2BU-Vzm1hyn6qhYzLDuNn0T9bjw%40mail.gmail.com.

Ger Hobbelt

unread,

Jul 14, 2024, 8:08:54 PM7/14/24

to tesseract-ocr

> Bizarrely, this simple code finds the page numbers at the bottom of the page perfectly happily, whereas the tesseract executable did not. This is good news - though confusing...

IIRC, the library has the psm default set to PSM_SINGLE_BLOCK = 6, while tesseract CLI sets psm to PSM_AUTO = 3 when you don't specify it explicitly, hence two different psm 'defaults', which may well explain the discrepancy you observe.

René JM Clais

unread,

Jul 15, 2024, 6:29:25 AM7/15/24

to tesser...@googlegroups.com

My code is working well and your remarks are out of the context.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yQ6N5o3Nav32_b7t72UrmAaKyASUKDj40mDWBVdMS_ww%40mail.gmail.com.

René JM Clais

unread,

Jul 15, 2024, 6:30:25 AM7/15/24

to tesser...@googlegroups.com

My code is working well and your remarks are out of the context.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yQ6N5o3Nav32_b7t72UrmAaKyASUKDj40mDWBVdMS_ww%40mail.gmail.com.

Zdenko Podobny

unread,

Jul 15, 2024, 10:03:55 AM7/15/24

to tesser...@googlegroups.com

My remark is about code quality. Code quality is relevant. Or indication that somebody is doing copy&paste without understanding code - that is dangerous.

Zdenko

po 15. 7. 2024 o 12:30 René JM Clais <renec...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_rj%3DSqNfs3PquEmZZEk8gye8j%2Be3Tw9A_69PyBNPfyBDw%40mail.gmail.com.

Mona Dastar

unread,

Jul 15, 2024, 10:09:58 AM7/15/24

to tesser...@googlegroups.com

Hi everyone

Regarding what Zdenko said, after the first section of module 3 I stopped because I had questions and I couldn’t understand the code, I have trouble with the last module what do you think?

Since that I didn’t study and I am getting farther and further away.

I appreciate your tips.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yry14vEULwnCG59FCxC-UK-9KyQjwT5H26NMvy3fmKmg%40mail.gmail.com.

Zdenko Podobny

unread,

Jul 15, 2024, 10:12:24 AM7/15/24

to tesser...@googlegroups.com

Code that was posted here is not dangerous. Just a python coder would make it the right way.

Zdenko

po 15. 7. 2024 o 16:09 Mona Dastar <mona....@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CACDAmMv84rXbpoChtBPfz1ekaoP%2BRQWLQBWX%2BA2Y8xCyS%2Bx5LA%40mail.gmail.com.

Karol Wójcik

unread,

Jul 15, 2024, 10:34:02 AM7/15/24

to tesseract-ocr

A true Python coder would not ever use a terribly written library like pytesseract in the first place. The way of passing command line params is a minor thing, compared to that.

Iain Downs

unread,

Jul 15, 2024, 12:40:24 PM7/15/24

to tesseract-ocr

Thanks ger. I found DefaultPageSegMode (this in the c# tessearct package - not tried C++ yet, though I don't see to need to). That worked fine.

Iain

Reply all

Reply to author

Forward