To whom it may concern, Just created a tesseract 3.04 trained data (see attached). I call it "MRZ" because it has the OCR-B as the only font and trained with characters A-Z, 0-9 and the lesser-than symbol (<). Seems to be fast and accurate in my projects. On Monday, June 24, 2013 at 6:58:47 PM UTC+2, Nick White wrote: > > Hi Peter, > > Sorry for the lack of response, I think us regulars here are all > quite busy at the moment. > > Have you searched the archives of this mailing list? I seem to > recall someone previously deciding to go with a different project > which focused just on MRZ recognition. > > Tesseract will do a reasonable job, as you have found, but perhaps > a dedicated program could do even better (and for less effort on > your part). > > As far as improving your Tesseract results, though, I'd recommend > looking into user_patterns. It isn't well documented, but if the > format you're expecting is predictable it should help. Also have you > set up a unicharambigs file? That may help a little too (not much, > but it's probably worth adding for the common cases of 5 -> S, 8 -> > B, etc). > > > One more unrelated question. How to read data from image with > non-standard > > orientation > > (upside down, rotated left/right by 90 degrees)? How to use OSD feature? > > I confess I don't actually know. I think Tesseract might try to > guess this entirely by itself. Does anyone else here know any > better? > > Once you're happy your MRZ training is as good as it will get, would > you be happy to have it added to the main Tesseract repository? If > so (and it'd be great if you were) open an issue on the bug tracker > with the training file, and add some comments to the top of > mrz.config about how it was created and where the source files for > it are (see my grc.traineddata for an example). > > Thanks Peter, and sorry again for not getting back to you sooner, > > Nick > > P.S. One other thing I just thought of: is the DPI you're feeding > into Tesseract the same as the DPI you trained with (300)? Ideally > it should be. Also you're right to preprocess using thresholding; > Tesseract isn't particularly good at that step and it's much better > if you can do it first. > > On Wed, Jun 19, 2013 at 11:45:10PM -0700, Peter wrote: > > Hello. > > > > I'm trying to train Tesseract for OCR. My goal is to be able to > recognize text > > from MRZ zone of various documents (mainly national ID). The training > process > > should be pretty straightforward and I'd expect good results since all I > have > > to deal with is one font (OCR-B), capital letters of Latin alphabet > (A-Z), > > digits 0-9 and "less than" sign (<). Unfortunately the results are worse > than > > expected. While the effects for preprocessed images (thresholding using > GIMP) > > are pretty good (not perfect - in many cases Tesseract treats 0 as O, > sometimes > > treats 5 as S and occasionally inserts unexpected whitespace between > letters), > > data taken directly from an unprocessed scanned image is rather poorly > > recognized. There are many cases where Tesseract thinks O is 0, 5 is S, > 8 is S, > > 2 is Z, H is M, U is W etc. While 0 vs O case can be tough (OCR-B 0 and > O don't > > look too different) and perhaps beyond Tesseract capabilities, I think > other > > ambiguities can and should be eliminated. As I'm new to Tesseract (have > been > > using it for just a few days now) I hope you can suggest me the optimal > > training for OCR-B font or even provide me with some good training > sample. > > Here's what I did to train Tesseract: > > > > 1) Prepared training text with OCR-B font (train1.odt, see attachments), > > converted it to .pdf with LibreOffice Writer (train1.pdf, see > attachments) > > 2) Opened train1.pdf in GIMP and saved it as 300 dpi tif (resolution: > 2479 x > > 3508), can't attach it as its size is more than 30 MB > > 3) Prepared font_properties file with the following line: ocrb 0 0 1 0 0 > > 4) Executed the following Tesseract commands: > > > > tesseract mrz.ocrb.exp0.tif mrz.ocrb.exp0 batch.nochop makebox > (corrected > > mrz.ocrb.exp0.box included as attachment) > > tesseract mrz.ocrb.exp0.tif mrz.ocrb.exp0 box.train > > unicharset_extractor mrz.ocrb.exp0.box > > shapeclustering -F font_properties -U unicharset mrz.ocrb.exp0.tr > > mftraining -F font_properties -U unicharset mrz.ocrb.exp0.tr > > cntraining mrz.ocrb.exp0.tr > > > > changed shapetable, normproto, inttemp, pffmtable, unicharset names so > that > > they're prefixed with mrz > > > > combine_tessdata mrz. > > > > To read data from images, I run: > > > > tesseract -l mrz output > > > > Is there anything that can be done better? Maybe my training file is not > good > > enough? > > If that's the case, can you please provide me with a better one? > > > > One more unrelated question. How to read data from image with > non-standard > > orientation > > (upside down, rotated left/right by 90 degrees)? How to use OSD feature? > > > > > > > > -- > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to tesser...@googlegroups.com > > > To unsubscribe from this group, send email to > > tesseract-oc...@googlegroups.com > > For more options, visit this group at > > http://groups.google.com/group/tesseract-ocr?hl=en > > > > --- > > You received this message because you are subscribed to the Google > Groups > > "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email > > to tesseract-oc...@googlegroups.com . > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > > > >