MRZ/MRP (Machine-readable zone/passport) dataset for tesseract v4

Mamadou

unread,

May 27, 2019, 1:38:11 AM5/27/19

to tesseract-ocr

Hello,

We have open sourced (BSD license) MRZ/MRP (Machine-readable zone/passport) dataset and models for Tesseract v4.

The dataset contains more than #7 thousands images (.tif) with ground truth (.gt.txt) from Google image augmented with few synthetic data.

It's ready to be used to train with Tesseract v4.

If you're lazy and don't want to train the models by yourself then, try the ones under tessdata_best (float-model) or tessdata_fast (int-model) folders.

Accuracy: 99.7%

Source code: https://github.com/DoubangoTelecom/tesseractMRZ

Regards,

Lorenzo Bolzani

unread,

May 29, 2019, 4:08:53 AM5/29/19

to tesser...@googlegroups.com

Hi Mamadou,

this sounds very interesting. How did you do the training and accuracy measurements? What parameters did you use for the model?

Thanks, bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a92ec47e-5055-4ffe-a174-f437d3c7ccf2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mamadou

unread,

May 29, 2019, 5:12:57 PM5/29/19

to tesseract-ocr

Hello Lorenzo,

We're fine tuning en.traineddata without modifications with charset restriction within [A-Z0-9]. We're using the default parameters and the model converges very fast.

We have #1376 images from Google image used to test the accuracy. The reported accuracy is min(detector, recognizer). These #1376 images can't be directly used with tesseract and requires a detector and preprocessor.

On Wednesday, May 29, 2019 at 10:08:53 AM UTC+2, Lorenzo Blz wrote:

Hi Mamadou,
this sounds very interesting. How did you do the training and accuracy measurements? What parameters did you use for the model?

Thanks, bye

Lorenzo

Il giorno lun 27 mag 2019 alle ore 07:38 Mamadou <diopm...@doubango.org> ha scritto:

Hello,

We have open sourced (BSD license) MRZ/MRP (Machine-readable zone/passport) dataset and models for Tesseract v4.
The dataset contains more than #7 thousands images (.tif) with ground truth (.gt.txt) from Google image augmented with few synthetic data.
It's ready to be used to train with Tesseract v4.
If you're lazy and don't want to train the models by yourself then, try the ones under tessdata_best (float-model) or tessdata_fast (int-model) folders.

Accuracy: 99.7%
Source code: https://github.com/DoubangoTelecom/tesseractMRZ

Regards,

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,

May 30, 2019, 12:22:15 PM5/30/19

to tesser...@googlegroups.com

Thanks.

Added links in https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-Contributions

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a92ec47e-5055-4ffe-a174-f437d3c7ccf2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all

Reply to author

Forward