I prepared a few sample PNG files including Polish-language text using different TeX fonts. I processed them with OCRopus and I stated the program ignores all diacritic characters replacing them with the similar ASCII characters. For example the phrase: “pójdź kińże tę chmurność w głąb flaszy” is rendered as: “pdjd2 kih2e tg ChmurnosC W glqb flaszy”.
I read a little about the previous OCRopus versions using the Tesseract program and I learned that UTF-8 recognition was one of the biggest advantages of these applications. The new OCRopus is poorly documented as yet so I don’t know why OCRopus ignores UTF-8 encoded characters.
I use the simple ‘ocropus file.png’ command. What should I do in order to allow OCRopus to use UTF-8? Maybe I should use some switch in the command line? Or maybe I should learn OCRopus the TeX fonts? Or maybe I should install some additional packages?
I have no idea what could I do. Every help will be welcomed.
Thank you for the information, Tom.
Waiting for the answer I tried Tesseract and compared it to OCRopus. Here is the report about my experiences and conclusions...
I prepared the LaTeX template file that includes the text samples in five languages – English, French, German, Polish, and Russian – and that uses seven different TeX families, series, and shapes of the fonts in 10 pt size – rm, sf, tt, bf, it, sl, and sc. I prepared also the shell script that generates on the basis of the mentioned LaTeX template the separate LaTeX files for each language, process them to DVI files, converts DVI files to GIF ones, and finally runs Tesseract on these GIF files. The script prepares by default the GIF files using seven different resolutions – from 150 dpi to 450 dpi.
Then I ran the script and started to compare the results. The overall result is the best in the case of 300 dpi resolution though some characters are recognized better in the other resolutions. It isn’t surprise that the best results gives the processing of the English-language text. The results for German are slightly worse than for English. The results for Polish and Russian are slightly worse than for German. The results for French are much worse than the results for Polish and Russian.
As for the results for Polish and Russian they aren’t the same. The quality of the analysis of the text in those languages is similar but the text in Latin as well as some numbers neighboring with the Russian text are recognized by Tesseract as a Russian text. For example the string “OHamburgefonsz” was recognized as “ОНатЬиг5е1соп$2” or “ОНаШЬиг3еҐопЅ2”, or something else; the number “6” was sometimes recognized as Russian letter “б”; and in the worst case of tt font the string “1234567890” was recognized as “12з45в7зэо”.
Because you started to train OCRopus with the German language I’ll show you the results of the analysis of the German text produced by Tesseract.
I used the following text:
FALSCHES ÜBEN VON XYLOPHONMUSIK QUÄLT JEDEN GRÖSSEREN ZWERG.
falsches üben von xylophonmusik quält jeden grösseren zwerg.
OHamburgefonsz 1234567890 !@*()=+[]|;:,./? - #$%&_{} ~^ <> \ " ` ' ‘’«»“„”
Tesseract analyzing the 300 dpi GIF using seven mentioned TeX fonts produced the following result:
lrm
FALSCHES ÜBEN VON XYLOPHONMUSIK QUALT JEDEN GRÖSSEREN ZWERG.
falsches üben von Xylophonmusik quält jeden grösseren Zwerg.
()Han1burgef0nsZ 1234567890 l@*():+[]|;:,./? - #$%&_{} M <> \ " ` ' “<<›>“„”
2sf
FALSCHES UBEN VON XYLOPHONIVIUSIK QUALT JEDEN GROSSEREN ZWERG.
falsches üben von xylophonmusik quält jeden grösseren Zwerg.
OHamburgefonsz 1234567890 !©*()=+[]|;:,./? - #$%&_{} M <> \ " ` ' "<<>>“„"
3tt
FALSCHES ÜBEN VON XYLOPHONMUSIK QUALT JEDEN GRÜSSEREN zwERG.
falsches üben von xylophonmusik quält jeden grösseren zwerg.
onamburgefonsz 1234567890 a@*()=+[]|;=,./? _ #$'7„&_{} "^ <> \ " ` ' “<<›>“„”
4 bf
FALSCHES ÜBEN VON XYLOPHONMUSIK QUÄLT JEDEN GRÖSSEREN ZWERG
falsches üben von xylophonmusik quält jeden grösseren Zwerg.
OHamburgefonsz 1234567890 !@*()=+[]|;:,./7 f #$%&_{} M <> \ " ` ' ”<<››“„”
5 it
FALSCHES UBEN VON XYLOPHONMUSJK QUALT JEDEN GROSSEREN ZWERG.
falsches üben von xylophonmusik quält jeden grösseren zwerg.
OHamburgef0nsz 1234567890 /@*():+[//;:,./? - #$%@§_ M <> \ ” ` ' ”<«››“„”
6 sl
FALSCHES UBEN VON XYLOPHONMUSIK QUALT JEDEN GROSSEREN ZWERG.
falsches üben von Xylophonmusík quält jeden grösseren Zwerg.
OHa1nburgef0nsz 1234567890 .'@*():+[]/;:,./? - #$%&_ M <> \ " ` ' ”<<›>“„”
7sc
FALSCHES ÜBEN VON XYLOPHONMUSIK QUÄLT JEDEN GROSSEREN ZWERG.
FALSCHES UBEN VON XYLOPHONMUSIK QUALT JEDEN GROSSEREN ZWERG.
OHAMBURGEEONSZ 1234567890 !@*():+[]|;:,./? - #$%&:_{} “^ <> \ " ` ' “<<>>“„”
The hardest part is of course the string of the punctuation marks – it is partially language dependent because Tesseract used with the different training data interprets that string in a different way. The string “OHamburgefonsz” was recognized without any errors just two times – it is partially language dependent too (Tesseract used on most of the languages except for the Russian recognized that string properly for the three times). As for the text in German the biggest problem are uppercase and capital letters using diacritical marks.
At present from my point of view Tesseract has six advantages over OCRopus:
1. It understands a bunch of languages including Polish while OCRopus is in the phase of the training with German.
2. It understands different fonts without training while OCRopus fails spectacularly with unknown fonts – because that requires more explanation I put it next to that list.
3. It works much faster (for example Tesseract processed alice.png file in 1.8 sec. while OCRopus did it in 1 min. 42.6 sec.)
4. It works always (for example Tesseract processed properly GIF and PNG files prepared with dvigif and dvipng from the same DVI file while OCRopus aborted in both cases – it displayed “ValueError: cannot convert float NaN to integer” error message in the case of GIF file and “AssertionError: input image is not binary” error message in the case of PNG file).
5. It can be installed in different distributions – I use Slackware – while OCRopus uses the install script that is customized to use with Debian and its derivatives so in order to test it I had to switch the system to Linux Mint.
6. It works slightly better than OCRopus even with the simple text such as provided in the alice.png sample file (Tesseract recognized the text without flaws while OCRopus changed some “w” lowercases into “W” uppercases).
As for the font recognition I tested both Tesseract and OCRopus on the same English text written with the seven mentioned above TeX fonts:
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
the quick brown fox jumps over the lazy dog.
OHamburgefonsz 1234567890 !@*()=+[]|;:,./? - #$%&_{} ~^ <> \ " ` ' ‘’«»“„”
Tesseract recognized it as:
lrm
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
the quick brown fox jumps over the lazy dog.
OHamburgef0nsZ 1234567890 !@*():+[]|;:,./? — #$%&_{} H <> \ " \ ' "<<>>“,,”
OCRopus did a complete mess of it:
lHE dflICR BHO/AZ LOX TflJIL8 OAEH lHE IvNz Doc.
lHE (SCICR BHOH./ EO2 iCyIb8 OAEH lHE FvNz Doc.
i RC
'it; a o ff 'a
o[UmpnL8ceIlRx Ii292-) i(U)m(]=-ii(:.'., .- :Rrr ---- \<r /xi/if'abf,;'
I believe the future OCRopus versions will be better and better so I’ll follow the development of that program.
Thank you for your assistance once again.
A few days ago my machine stopped to work displaying the screen full of the obscure error messages. I took the picture of the screen and rebooted the machine. Because I was too lazy to spend an hour on copying out the contents of the screen manually I decided to try some OCR engines. I inspected gocr 0.49, OCRopus 0.5.4, and Tesseract 3.01.
After four days of the intensive work I learned a bit about OCR and now I know none of the mentioned programs is able to process properly the strings of numbers and letters such as “[226158.728554] [<c1430000>] ? cs5520_init_one+0x14e/0x35f”. Personally I doubt there is any other OCR engine capable to process such a text on the basis of the photo of the moderate quality. The only solution is to copy out these messages manually.
It’s the instructive example of the state of affairs named the irony of fate.
I studied the “Report on the comparison of Tesseract and ABBYY FineReader OCR engines” by Heliński, Kmieciak, and Parkoła (http://lib.psnc.pl/dlibra/docmetadata?id=358&from=publication&showContent=true). It is very interesting – at least for the users of these two programs – though the other people interested in OCR engines should be satisfied reading that document as well. The report is very reliable and informative. Thank you, professor, for that valuable link.