Parameters to improve detection of sparse text

Scaly Green Orc

unread,

Apr 25, 2023, 6:22:32 AM4/25/23

to tesseract-ocr

Hi there hello,

I'm trying to OCR VA charts such as this one: <https://www.sia.aviation-civile.gouv.fr/dvd/eAIP_20_APR_2023/Atlas-VAC/PDF_AIPparSSection/VACH/AD/AD-3.HANO.pdf>

(the text layer is FUBAR so I'm resorting to OCR).

I'm running in sparse text mode (PSM=11). There's a lot of text but I care only about a small subset. I'm running the recognition on grayscale images taken from the PDF. I reckon I shouldn't have a problem with image quality, although I do notice different results depending on how much DPI I allow. It works mostly fine.

But I'm having issues with bits being chopped off / not recognised when (I think) there's too much space or too little text. In the chart linked above, for instance, in the text at the bottom of the second page (numbered list), the numbers of the first column do not get recognised. So, for instance, I get "Exploitant /Operator" instead of "1 - Exploitant /Operator". Then it will work if it's, say "10 - Exploitant /Operator" (two digits). Which leads me to believe that my problem is with small blocks and/or lots of space.

I've tried using parameters `preserve_interword_spaces` and `textord_space_size_is_variable`, seemingly to no avail.

Could someone please tell me which parameters I could play with to improve the detection of sparse chunks or increase the engine's tolerance for whitespace?

If you have any other suggestion as to how to improve the OCR, I'll gladly take it as well.

Kind regards,

Orc.

Zdenko Podobny

unread,

Apr 25, 2023, 7:06:20 AM4/25/23

to tesser...@googlegroups.com

First of all - this input is a regular pdf (e.g. there is text instead of an image) - IMO it should be easier to extract accurate text from the file instead of OCRing it...

Next: tesseract can handle simple layout analysis (e.g. book pages), but for complex layouts like that pdf, you need to use custom page layout analysis/segmentation (e.g. to split input image to homogeneous text blocks/paragraphs/lines). For example when I OCR just description on the page 2 (where you mentioned errors) I got this output:

> tesseract page2_description.png - --psm 11

1- Exploitant / Operator :

6 - Hangars disponibles / Hangars available : NIL

Centre hospitalier FANNONAY = /FAX : 04 75 67 35 00

7 - Réparations / Repairs facility : NIL

2 - CAA : DSAC Centre-Est (voir/see GEN)

8 -Type de surface / Surface : béton /concrete

3-AVT:NIL

9 - Force portante / Strength: 4 1.

4 - RFFS : 5 extincteurs poudre/powder fire extinguishers 50 kg.

5 - Police - Douanes / Police - Customs : NIL

Zdenko

ut 25. 4. 2023 o 8:22 Scaly Green Orc <npc...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4b8913d4-8c37-42e3-8f8d-04be241cd45fn%40googlegroups.com.

Scaly Green Orc

unread,

Apr 25, 2023, 12:30:53 PM4/25/23

to tesseract-ocr

On Tuesday, 25 April 2023 at 09:06:20 UTC+2 zdenop wrote:

First of all - this input is a regular pdf (e.g. there is text instead of an image) - IMO it should be easier to extract accurate text from the file instead of OCRing it...

Next: tesseract can handle simple layout analysis (e.g. book pages), but for complex layouts like that pdf, you need to use custom page layout analysis/segmentation (e.g. to split input image to homogeneous text blocks/paragraphs/lines). For example when I OCR just description on the page 2 (where you mentioned errors) I got this output:

> tesseract page2_description.png - --psm 11
1- Exploitant / Operator :

6 - Hangars disponibles / Hangars available : NIL

Centre hospitalier FANNONAY = /FAX : 04 75 67 35 00

7 - Réparations / Repairs facility : NIL

2 - CAA : DSAC Centre-Est (voir/see GEN)

8 -Type de surface / Surface : béton /concrete

3-AVT:NIL

9 - Force portante / Strength: 4 1.

4 - RFFS : 5 extincteurs poudre/powder fire extinguishers 50 kg.

5 - Police - Douanes / Police - Customs : NIL

Zdenko

Zdenko,

Thank you for your reply.

Yes, it's a regular PDF. But most (not always all) of the text is borked. Try to copy/paste text from it and you'll see. I looked around for solutions to salvage it, and it seemed like OCR was what was most consistently recommended in such cases. I need to treat a stack of these programmatically.

I hear you on the segmentation as a solution, i.e.extracting relevant blocks and ocr'ing those. I was hoping I could avoid that additional effort. What I find vexing is that it /almost/ works. I was hoping there might be things I could tweak about tesseract's analysis. For instance, isn't there a threshold setting somewhere that makes it ignore the "1 - " in

when it has to consider it as part of the whole page? As in, how much whitespace is acceptable? I've gone through the whole list of tesseract parameters (tesseract --print-parameters) and tried to tweak those that seemed promising... but hardly any seemed to make any difference. It's not readily clear which parameters are relevant for what usage.

Orc.

Tom Morris

unread,

Apr 25, 2023, 4:59:25 PM4/25/23

to tesseract-ocr

On Tuesday, April 25, 2023 at 8:30:53 AM UTC-4 Scaly Green Orc wrote:

Yes, it's a regular PDF. But most (not always all) of the text is borked. Try to copy/paste text from it and you'll see. I looked around for solutions to salvage it, and it seemed like OCR was what was most consistently recommended in such cases. I need to treat a stack of these programmatically.

Although cut & paste in the browser (and Acrobat Reader) doesn't work, here's the beginning of what gets extracted by pdftotext -layout

APPROCHE A VUE Transport public à la demande ANNONAY
Common carriage on request
Visual approach Centre hospitalier/Hospital
16 JUN 22
AD 3 APP 01
ALT : 1364 (49 hPa)
LAT : 45 14 30 N VAR : 2° E (20)
LONG : 004 39 57 E

FIS : LYON Info 135.525 EN TERRASSE / TERRACED

COM : TROPHUS 3/TRAVIATA 3 : 85.575
vers/to St Etienne

which seems like it could be enough to work with. The full text file is attached.