Tesseract 3.04 Training for a particular case.

77 views

Skip to first unread message

Leonardo Centoventotto

unread,

Sep 14, 2017, 2:12:43 AM9/14/17

to tesseract-ocr

Hi All

I am relatively new to Tesseract. I've been using some Tesseract wrapper libraries on nuget to get OCR capabilities in some c# application. Results where fine, but now I need to read records from low quality scans of some vehicle-related documentations.

MY SCENARIO:

In practice, I have some jpegs coming out of various scanners (I cannot ask the original scanner users to scan at a given quality) and on those jpegs, there are fields like this

(D.1) NISSAN
(D.2) J11 R8
(J.2) SJN782837... bla bla ba

some fields are numeric, some are alphanumerics, some are currency, some are alphanums (like car plates or VIN code)

there are no lowercase letters, no accented capitals - accents for capitals are like this --> '

all of the documents seem to have been produced featuring "Lucida Console", I took my time to match them with syntethic word-produced material and they match perfectly.

if I use tesseract 3.04 with pubicly available traindatas, both english and italian languages produce insufficient output.

I will tell you some examples: J is often read as 3, so (J.2) becomes (3.2). Also, SJN78 could be interpreted as S3N78 or S]N68.

other problems:

D --> O, 0 (zero)

F -> P and sometimes P -> F

numbers are mostly okay (some 6 -> 5)

Tesseract 4.0 gives better results and seems to be faster. But unfortunately, its improved accuracy seems not to be as uniform or consistent. T3.04 is worse, but always worse in the same way.
also training tesseract 4.0 is documented poorly, especially when it comes at training it from non-syntethic real world material. So for now I am sticking with 3.04, let me know if you think I should try again with 4.0!

WHAT HAVE I DONE:

I already have set up an improve-quality pipeline, which is basically deskewing, adjusting threshold and rescaling to obtain the best results I have been able to obtain.

I also post-process some fields with regex... so if tesseract reads "(3.1) ALPHANUMSEQUENCE" I know I can substitute it with (J.1) BLABLABLA... but this only works inside field markers.
actual field data cannot be entirely guessed with such techniques.

WHAT I AM TRYING TO DO:

I am trying to train Tesseract. I understand windows lacks some support for tesseract training, so I am using Ubuntu 16.04 for the purpose of training Tesseract.
So Ubuntu 16.04, 64 bits, with tesseract 3.04 installed from PPA, and training tools working.

HOW I AM DOING IT:

I have, let say 300 real-world scanned jpegs. they are already BW, background grain almost completely removed. resolution is not excellent, but since tesseract gives overall good results, I hope some training will do.

I have a script (bash) which is taking all of those images and it processes them like this:

1) Deskew

2) Place a white overlay on some areas of the documents which don't contain usefull data. Also this prevents improbabile boxes to be generated at later stages.

3) some cosmetics and threshold adjustments

4) invoke tesseract to make boxes.

5) parse each box file and eliminate some boxes which seem to be useless to me (correct me if I am wrong). For example... tesseract places a box around loooong edges, borders and so on..
those boxes are normally read as Tildes (~). I remove them from the boxfile by watching at their width and height (I spent 10 minutes guessing the box file coord system).

5a) this step is complex. it is a pre-automation of the 6th step. basically I extract information from the boxfile, and precisely the first vertical column of text (first char of each line). I then replace some characters. for example, all (3.[0-9]) are changed
to (J.[0-9]) by regex substitution (sed). this is because there are some repetitive errors in boxfiles and I don't want to fix hundreds of them manually in jTessBoxEditor.

6) the final box file can be loaded inside jTessBoxEditor, so I can fix other errors manually. I also have a chance to make sure past steps worked as expected

7) then I run training from existing boxes, I tried using english or italian as a base starting language.

8) then I find the output in the tessdata folder which is created inside the folder with the boxfiles and I finally test tesseract using that trained stuff.

9) it does not work ;) so I feel badly f*c*ed. results are worse than before training.

10) I post here, hoping I can get some clues.

SOME MATERIAL:

attaching some material... if you are curious about scripts, I will publish them but I am not sure it will be the case. so let me know.

any clue?

img386.tif

img386.box

Reply all

Reply to author

Forward

0 new messages