Parameters to improve detection of sparse text

245 views
Skip to first unread message

Scaly Green Orc

unread,
Apr 25, 2023, 6:22:32 AM4/25/23
to tesseract-ocr
Hi there hello,

(the text layer is FUBAR so I'm resorting to OCR).

I'm running in sparse text mode (PSM=11). There's a lot of text but I care only about a small subset. I'm running the recognition on grayscale images taken from the PDF. I reckon I shouldn't have a problem with image quality, although I do notice different results depending on how much DPI I allow. It works mostly fine.

But I'm having issues with bits being chopped off / not recognised when (I think) there's too much space or too little text. In the chart linked above, for instance, in the text at the bottom of the second page (numbered list), the numbers of the first column do not get recognised. So, for instance, I get "Exploitant /Operator" instead of "1 - Exploitant /Operator". Then it will work if it's, say "10 - Exploitant /Operator" (two digits). Which leads me to believe that my problem is with small blocks and/or lots of space.

I've tried using parameters `preserve_interword_spaces` and `textord_space_size_is_variable`, seemingly to no avail.

Could someone please tell me which parameters I could play with to improve the detection of sparse chunks or increase the engine's tolerance for whitespace?

If you have any other suggestion as to how to improve the OCR, I'll gladly take it as well.

Kind regards,
 Orc.

Zdenko Podobny

unread,
Apr 25, 2023, 7:06:20 AM4/25/23
to tesser...@googlegroups.com
First of all - this input is a regular pdf (e.g. there is text instead of an image) - IMO it should be easier to extract accurate text from the file instead of OCRing it...

Next: tesseract can handle simple layout analysis (e.g. book pages), but for complex layouts like that pdf, you need to use custom page layout analysis/segmentation (e.g. to split input image to homogeneous text blocks/paragraphs/lines). For example when I OCR just description on the page 2 (where you mentioned errors) I got this output:

> tesseract page2_description.png - --psm 11
1- Exploitant / Operator :

6 - Hangars disponibles / Hangars available : NIL

Centre hospitalier FANNONAY = /FAX : 04 75 67 35 00

7 - Réparations / Repairs facility : NIL

2 - CAA : DSAC Centre-Est (voir/see GEN)

8 -Type de surface / Surface : béton /concrete

3-AVT:NIL

9 - Force portante / Strength: 4 1.

4 - RFFS : 5 extincteurs poudre/powder fire extinguishers 50 kg.

5 - Police - Douanes / Police - Customs : NIL


page2_description.png

Zdenko


ut 25. 4. 2023 o 8:22 Scaly Green Orc <npc...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4b8913d4-8c37-42e3-8f8d-04be241cd45fn%40googlegroups.com.

Scaly Green Orc

unread,
Apr 25, 2023, 12:30:53 PM4/25/23
to tesseract-ocr
On Tuesday, 25 April 2023 at 09:06:20 UTC+2 zdenop wrote:
First of all - this input is a regular pdf (e.g. there is text instead of an image) - IMO it should be easier to extract accurate text from the file instead of OCRing it...

Next: tesseract can handle simple layout analysis (e.g. book pages), but for complex layouts like that pdf, you need to use custom page layout analysis/segmentation (e.g. to split input image to homogeneous text blocks/paragraphs/lines). For example when I OCR just description on the page 2 (where you mentioned errors) I got this output:

> tesseract page2_description.png - --psm 11
1- Exploitant / Operator :

6 - Hangars disponibles / Hangars available : NIL

Centre hospitalier FANNONAY = /FAX : 04 75 67 35 00

7 - Réparations / Repairs facility : NIL

2 - CAA : DSAC Centre-Est (voir/see GEN)

8 -Type de surface / Surface : béton /concrete

3-AVT:NIL

9 - Force portante / Strength: 4 1.

4 - RFFS : 5 extincteurs poudre/powder fire extinguishers 50 kg.

5 - Police - Douanes / Police - Customs : NIL


page2_description.png

Zdenko

Zdenko,

Thank you for your reply.

Yes, it's a regular PDF. But most (not always all) of the text is borked. Try to copy/paste text from it and you'll see. I looked around for solutions to salvage it, and it seemed like OCR was what was most consistently recommended in such cases. I need to treat a stack of these programmatically.

I hear you on the segmentation as a solution, i.e.extracting relevant blocks and ocr'ing those. I was hoping I could avoid that additional effort. What I find vexing is that it /almost/ works. I was hoping there might be things I could tweak about tesseract's analysis. For instance, isn't there a threshold setting somewhere that makes it ignore the "1 - " in Screenshot 2023-04-25 142634.png when it has to consider it as part of the whole page? As in, how much whitespace is acceptable? I've gone through the whole list of tesseract parameters (tesseract --print-parameters) and tried to tweak those that seemed promising... but hardly any seemed to make any difference. It's not readily clear which parameters are relevant for what usage.

Orc.
 

Tom Morris

unread,
Apr 25, 2023, 4:59:25 PM4/25/23
to tesseract-ocr
On Tuesday, April 25, 2023 at 8:30:53 AM UTC-4 Scaly Green Orc wrote:

Yes, it's a regular PDF. But most (not always all) of the text is borked. Try to copy/paste text from it and you'll see. I looked around for solutions to salvage it, and it seemed like OCR was what was most consistently recommended in such cases. I need to treat a stack of these programmatically.

Although cut & paste in the browser (and Acrobat Reader) doesn't work, here's the beginning of what gets extracted by pdftotext -layout


                       APPROCHE A VUE                                                Transport public à la demande                                  ANNONAY
                                                                                     Common carriage on request
                       Visual approach                                                                                               Centre hospitalier/Hospital
                                                                                                 16 JUN 22
                                                                                                                                                  AD 3 APP 01
                                                                                                                                 ALT : 1364 (49 hPa)
                                                                                                                                 LAT : 45 14 30 N               VAR : 2° E (20)
                                                                                                                                 LONG : 004 39 57 E

                       FIS : LYON Info 135.525                                                                                                EN TERRASSE / TERRACED

                       COM : TROPHUS 3/TRAVIATA 3 : 85.575
                                                                                           vers/to St Etienne


 which seems like it could be enough to work with. The full text file is attached.

Tom


AD-3.HANO.txt
Reply all
Reply to author
Forward
0 new messages