How to train tesseract and how to recognize multiple columns and multiple fonts

100 views

Skip to first unread message

Guillermo Manglano

unread,

Sep 1, 2017, 7:02:59 AM9/1/17

to tesseract-ocr

First post doesn't show.

I have the task of taking a PDF with images to a txt or csv file to store at a database. I am trying to use OCR on images like the one attached.

The results are as poor as the following:

`20—0
¿ ABÚEADD LDIDI ALBARH, JDSE
AHTÚHIÚ
—- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019
: ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD`

Of special importance is the phone number (944 355019), it seems close to correct but it still has wrong digits which makes the whole thing useless.

After much reading I still do not know how to train tesseract. I am following this instructions among others, but when I try to do:

text2image --text=training_text.txt --outputbase=spa.arial.exp0 --font='Arial' --fonts_dir=/home/Fonts

I get

Could not find font named Nimbus Sans. Pango suggested font

Please correct --font arg.:Error:Assert failed:in file text2image.cpp, line 437

Segmentation fault (core dumped)

How to approach this problem with multiple fonts, multiple columns, and spanish as language?

Dan9er

unread,

Sep 1, 2017, 9:22:21 AM9/1/17

to tesseract-ocr

First of all, there is already finished langdata for Spanish here. Download all the files then run combine_tessdata spa. (with the period)

Second, the fonts folder you're trying to access is ~/.fonts, NOT /home/Fonts. Actually, you should run nautilus (the file browser) as root (by running gksudo) then move your fonts to /usr/share/fonts. That is the default location for fonts and it allows all users on the system to use the fonts you downloaded.

Reply all

Reply to author

Forward

0 new messages