tesseract is reading passport mrz text from image incorrectly, its identifying <<<<<<<< as kkkk or cccc

134 views
Skip to first unread message

sara waheed

unread,
Jan 27, 2024, 5:19:42 AM1/27/24
to tesseract-ocr
I am trying to read the passport mrz string from the image i am using Tesseract and OpenCV for image processing i have tried three different ways  none of them worked

**Attempt 1**
I have this image  when i do ocr on it teseract read as

    IDAUT10000999<6<<<<<<<<<<<<<<<
    7109094F1112315AUT<<<<<<xcc<<6
    MUSTERFRAU<<ISOLDE<<<<<<<<cc<<

which is incorrect it treats <<< as x or c or k when I use the `mrz-java` library to read the details from the string it gives the following error

    [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1 IDAUT10000999<6<<<<<<<<<<<<<<<
    [error] 7109094F1112315AUT<<<<<<xcc<<6
    [error] MUSTERFRAU<<ISOLDE<<<<<<<<cc<<
    [error]  at 24-25,1: Invalid character in MRZ record: x

**Attempt 2**

then I converted the image to grayscale and binarized it using `OpenCV` Here is the below code

        val roiImagePath = "src/main/resources/ocr/passport/two-page-passport-mrz-detected.jpeg"
       
        val grayScaleROI = new Mat()
          val roiImage = Imgcodecs.imread(roiImagePath)
          Imgproc.cvtColor(roiImage, grayScaleROI, Imgproc.COLOR_BGR2GRAY)
          val roiGaryImagePath = "src/main/resources/ocr/passport/two-page-passport-mrz-detected-gray.jpeg"
       
          Imgcodecs.imwrite(roiGaryImagePath, grayScaleROI)
          val binary = new Mat()
          Imgproc.adaptiveThreshold(grayScaleROI, binary, 255, Imgproc.ADAPTIVE_THRESH_MEAN_C, Imgproc.THRESH_BINARY , 15, 25)
          val roiBinaryImagePath = "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary.jpeg"
          Imgcodecs.imwrite(roiBinaryImagePath, binary)
   
     val tesseract = new Tesseract()
      tesseract.setDatapath("/usr/share/tesseract-ocr/4.00/tessdata")
      tesseract.setVariable("user_defined_dpi", "600")
      val result = tesseract.doOCR(new File(roiBinaryImagePath))
      val mrzStr = result.replace(" ", "")
      println(s"two page passport mrz string is: "+mrzStr)

it created the following binary image

and the code output is
tesseract reads mrz string from the binary image as

    IDAUT1DODD999<E<KK<KKKKEKEKEK
    7AD9D9GF1TEZSISAUTKKKKKKKKKEKG
    MUSTERFRAUSKISOLDEKKKKKKKKKKK
and `mrz-java` reads the string and generates the following error

    [error] Error parsing MRZ string: Failed to parse MRZ null IDAUT1DODD999<E<KK<KKKKEKEKEK
    [error] 7AD9D9GF1TEZSISAUTKKKKKKKKKEKG
    [error] MUSTERFRAUSKISOLDEKKKKKKKKKKK
    [error]  at 0-0,0: Different row lengths: 0: 29 and 1: 30

**Attempt 3**

then I resized the image

    Val width = 1000 // Increase width proportionately (adjust based on your needs)
      val height = (width * binary.rows()) / binary.cols() // Maintain aspect ratio
   
      val resizedRoiImage = new Mat()
      Imgproc.resize(binary, resizedRoiImage, new Size(width, height), 0.0, 0.0, Imgproc.INTER_NEAREST)
   
      val resizedImageROIPath =  "src/main/resources/ocr/passport/two-page-passport-mrz-detected-binary-resized_image.jpg"
      Imgcodecs.imwrite(resizedImageROIPath, resizedRoiImage)

mrz string read by Tesseract

    TOAUTIOOOOIISKhcceccccddddddce
    FIOPOSAFIFESSISAUTReececeececs
    MUSTERFRAUCCKISOLDECKccccdcddd

and the error is

    [info] 15:54:04.200 633 [main] MrzParser INFO - Check digit verification failed for document number: expected 0 but got h
    [error] Error parsing MRZ string: Failed to parse MRZ MRTD_TD1 TOAUTIOOOOIISKhcceccccddddddce
    [error] FIOPOSAFIFESSISAUTReececeececs
    [error] MUSTERFRAUCCKISOLDECKccccdcddd
    [error]  at 15-16,0: Invalid character in MRZ record: c

  
can anyone please help how I read the text properly also I have tried one regex to convert c or k back to <<< it did not work either if anyone can suggest some workaround or any improvement in code please help me with that thanks
two-page-passport-mrz-detected-binary-resized_image.jpg
two-page-passport-mrz-detected-gray.jpeg
two-page-passport-mrz-detected.jpeg
two-page-passport-mrz-detected-binary.jpeg

Zdenko Podobny

unread,
Jan 27, 2024, 5:26:40 AM1/27/24
to tesser...@googlegroups.com
What about reading docs and a little bit googling?

tesseract two-page-passport-mrz-detected.jpeg - --psm 6 -l mrz

IDAUT10000999<6<<<<<<<<<<<<<<<
7109094F1112315AUT<<<<<<<<<<<6
MUSTERFRAU<<ISOLDE<<<<<<<<<<<<



Zdenko


so 27. 1. 2024 o 11:19 sara waheed <sarawah...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/440788ab-1d76-4612-a4b5-a1a4c2cd09a5n%40googlegroups.com.

sara waheed

unread,
Jan 27, 2024, 6:02:08 AM1/27/24
to tesseract-ocr
if I didn't research how would I know Tesseract needs image processing? I am new to OCR and in the learning phase please be kind and help thanks :)   

Zdenko Podobny

unread,
Jan 27, 2024, 6:11:18 AM1/27/24
to tesser...@googlegroups.com
Well in this case it works without image processing ;-) 

Anyway mrz is not "official" Tesseract training and there are people who play with it, so it will take some time to search and dig their findings/experience/expertise....

Zdenko


so 27. 1. 2024 o 12:02 sara waheed <sarawah...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages