First post doesn't show.
I have the task of taking a PDF with images to a txt or csv file to store at a database. I am trying to use OCR on images like the one attached.
The results are as poor as the following:
`20—0
¿ ABÚEADD LDIDI ALBARH, JDSE
AHTÚHIÚ
—- EnlúndeLarreájzegm25- Sºt] . . . . . 944 355019
: ABDGADD 5E'I'IEH ÁLUAREI 5EUERIHD`
Of special importance is the phone number (944 355019), it seems close to correct but it still has wrong digits which makes the whole thing useless.
After much reading I still do not know how to train tesseract. I am following this instructions among others, but when I try to do:
text2image --text=training_text.txt --outputbase=spa.arial.exp0 --font='Arial' --fonts_dir=/home/Fonts
I get
Could not find font named Nimbus Sans. Pango suggested font
Please correct --font arg.:Error:Assert failed:in file text2image.cpp, line 437
Segmentation fault (core dumped)
How to approach this problem with multiple fonts, multiple columns, and spanish as language?
