Tesseract performance On ID cards and passports

1,760 views
Skip to first unread message

Alexey Pismenskiy

unread,
Sep 1, 2023, 6:03:49 PM9/1/23
to tesseract-ocr
I'm looking into OCR for ID cards and drivers licenses, and I found out that tesseract performs relatively poor on ID cards, compared to other OCR solutions. For this original image: https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png the results are:

tesseract: "4d DL 999 as = Ne allo) 2NICK © , q 12 RESTR oe } lick: 5 DD 8888888888 1234 SZ"
easyocr:  '''9 , ARKANSAS DRIVER'S LICENSE CLAss D 4d DLN 999999999 3 DOB 03/05/1960 ] 2 SCKPLE 123 NORTH STREET CITY AR 12345 ISS 4b EXP 03/05/2018 03/05/2026 15 SEX 16 HGT 18 EYES 5'-10" BRO 9a END NONE 12 RESTR NONE Ylck Sorble DD 8888888888 1234 THE'''
google cloud vision: """SARKANSAS\nSAMPLE\nSTATE O\n9 CLASS D\n4d DLN 9999999993 DOB 03/05/1960\nNick Sample\nDRIVER'S LICENSE\n1 SAMPLE\n2 NICK\n8 123 NORTH STREET\nCITY, AR 12345\n4a ISS\n03/05/2018\n15 SEX 16 HGT\nM\n5'-10\"\nGREAT SE\n9a END NONE\n12 RESTR NONE\n5 DD 8888888888 1234\n4b EXP\n03/05/2026 MS60\n18 EYES\nBRO\nRKANSAS\n0"""

and word accuracy is:

             tesseract  |  easyocr  |  google
words         10.34%    |  68.97%   |  82.76%

This is "out if the box" performance, without any preprocessing. I'm not surprised that google vision is that good compared to others, but easyocr, which is another open source solution performs much better than tesseract is this case. I have the whole project dedicated to this, and all other results are much better for easyocr: https://github.com/apismensky/ocr_id/blob/main/result.json, all input files are files in https://github.com/apismensky/ocr_id/tree/main/images/sources
After digging into it for a little bit, I suspect that bounding box detection is much better in google (https://github.com/apismensky/ocr_id/blob/main/images/boxes_google/AR.png) and easyocr (https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png), than in tesseract (https://github.com/apismensky/ocr_id/blob/main/images/boxes_tesseract/AR.png).
I'm pretty sure, about this, cause when I manually cut the text boxes and feed them to tesseract it works much better.


Now questions:

- What is the part of the codebase in tesseract that is responsible for text detection and which algorithm is it using?
- What is impacting bounding box detection in tesseract so it fails on these types of images (complex layouts / background noise... etc)
- Is it possible to use the same text detection procedure as easyocr or improve the existing one?  
- Maybe possible to switch text detection algo based on the image type or make it pluggable where user can configure from several options A,B,C...


Thanks. 

nguyen ngoc hai

unread,
Sep 4, 2023, 4:02:27 AM9/4/23
to tesseract-ocr
Hi, 
I would like to hear other's opinions on your questions too.
In my case, when I try using Tesseract for Japan train tickets, I have to do a lot of steps for preprocessing (remove background colors, noise + line removal, increase contrast,  etc.) to get satisfactory results. 
I am sure what you are doing (locating text boxes, extracting them, and feeding them one by one to tesseract) can get better accuracy results. However, when the number of text boxes increases, it will undoubtedly affect your performance. 
Could you share the PSM mode for getting those text boxes' locations ?  I usually use the AUTO_OSD to get the boxes and expand them a bit at the edges before passing them to Tesseract. 

Regards
Hai
 

Alexey Pismenskiy

unread,
Sep 5, 2023, 11:17:23 AM9/5/23
to tesseract-ocr
These results are for PSM=1, I think I have tried other values, but I haven't notice any improvements. 

Regards, 
Alexey

Alexey Pismenskiy

unread,
Sep 5, 2023, 11:32:56 AM9/5/23
to tesseract-ocr
Hai, could you please tell me what you are doing for pre-processing? 
Do you have any source code you can share? 
Are those results consistently better for images scanned with different quality (resolution, angles, contrast etc)? 


On Monday, September 4, 2023 at 2:02:27 AM UTC-6 nguyenng...@gmail.com wrote:

Alexey Pismenskiy

unread,
Sep 5, 2023, 6:20:26 PM9/5/23
to tesseract-ocr
OK, so EasyOCR is using CRAFT for text detection (https://pypi.org/project/craft-text-detector/), and it gives much better results for my image. Here is the image with bounding boxes from CRAFT: https://github.com/apismensky/ocr_id/blob/main/outputs/AR_text_detection.png
And it also produces a folder with bunch of crops of the original image: https://github.com/apismensky/ocr_id/tree/main/outputs/AR_crops
which could be feed to tesseract, using psm=7, which gives an output: 
crop_0.png:     5ARKANSAS DRIVER’S LICENSE
crop_1.png:
crop_2.png:     9¥ CLASS LD
crop_3.png:     4a DLN. 999999999: pos 03/05/1960
crop_4.png:
crop_5.png:     1 SAMPLE
crop_6.png:     2NICK
crop_7.png:
crop_8.png:     8123 NORTH STREET
crop_9.png:     CITY, AR 12345
crop_10.png:     4bEXP
crop_11.png:     4aiss
crop_12.png:     03/05/2026 \/"— \
crop_13.png:     03/05/2018
crop_14.png:     1SSEX 16HGT
crop_15.png:     18 EYES
crop_16.png:     5'-10*
crop_17.png:     M
crop_18.png:     BRO
crop_19.png:     9a END NONE
crop_20.png:     12 RESTR NONE
crop_21.png:     Vick Cample
crop_22.png:     5 DD 8888888888 1234
CRAFT + tesseract result:     5ARKANSAS DRIVER’S LICENSE      9¥ CLASS LD     4a DLN. 999999999: pos 03/05/1960      1 SAMPLE     2NICK      8123 NORTH STREET     CITY, AR 12345     4bEXP     4aiss     03/05/2026 \/"— \     03/05/2018     1SSEX 16HGT     18 EYES     5'-10*     M     BRO     9a END NONE     12 RESTR NONE     Vick Cample     5 DD 8888888888 1234
which is waaaayyy better than when tesseract is trying to detect bounding boxes itself. 
The whole script is here: 
I'm also using psm=0 to detect image rotation angle and fix rotation before applying CRAFT

Would it be possible to use CRAFT in tesseract for bounding boxes? 

nguyen ngoc hai

unread,
Sep 6, 2023, 9:07:52 PM9/6/23
to tesseract-ocr
Hi Apismensky,

Here are the code and sample I used for preprocessing, I extracted the ticket region of the train ticket from a picture taken by a smartphone. Since the angle, distance, brightness, and many other factors can change the picture quality. 
I would say scanned images or fixed-position camera-taken images have more consistent quality. 

Here is the original image:

sample_to_remove_lines.png

# TRy to remove lines
org_image = cv2.imread("/content/sample_to_remove_lines.png")
cv2_show('org_image', org_image)
gray = cv2.cvtColor(org_image,cv2.COLOR_BGR2GRAY)

thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
cv2_show('thresh Otsu', thresh)


# removing noise dots.
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, np.ones((2,2),np.uint8))
cv2_show('opening', opening)

thresh = opening.copy()
mask = np.zeros_like(org_image, dtype=np.uint8)

# Extract horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (60 ,1))
remove_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(mask, [c], -1, (255, 255, 255), 8)
# cv2_show('mask extract horizontal lines', mask)

# Extract vertical lines
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,70))
remove_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(mask, [c], -1, (255, 255, 255), 8)

cv2_show('mask extract lines', mask)

result = org_image.copy()
# Loop through the pixels of the original image and modify based on the mask
for y in range(mask.shape[0]):
    for x in range(mask.shape[1]):
        if np.all(mask[y, x] == 255):  # If pixel is white in mask
            result[y, x] = [255, 255, 255]  # Set pixel to white

cv2_show("result", result2)

gray = cv2.cvtColor(result2,cv2.COLOR_BGR2GRAY)
_, simple_thresh = cv2.threshold(gray, 195, 255, cv2.THRESH_BINARY)
cv2_show('simple_thresh', simple_thresh)


in the above code, u can ignore the cv2_show function since it is just my custom method for showing images. 
You can see that the idea is to remove some noise, remove lines, and then simple-thresh. 
extracted_lines.png

removed_lines.png


ready_for_locating_text_box.png

I would say, from this point, the AUTO_OSD mode of Tesseract PSM can also give the text box for the above picture, it also needs to check with RIL mode (maybe RIL.WORD or RIL.TEXTLINE) to get the right level of textboxes. 
In my opinion, the same preprocessing methods can only be applied to a certain group of samples. It is in fact very hard to cover all the cases.  For example: 

black_background.png

I found it difficult to locate the text box where the text is white, and the background is dark colors. the black text on the white background is easy to locate and then OCR. I am not sure what are a good method to locate those white texts on the dark background colors.
I hope to hear your as well as others's suggestions on this matter. 

Regards
Hai

Alexey Pismenskiy

unread,
Sep 7, 2023, 4:19:26 PM9/7/23
to tesseract-ocr
Thanks for sharing Hai 
Looks like CRAFT can detect regions despite the background: https://github.com/apismensky/ocr_id/blob/main/images/boxes_craft/black_background_text_detection.png
It also creates cuts for each text region which can be OCR-ed separately and then joined together as a result.
When I ran your example with https://github.com/apismensky/ocr_id/blob/main/ocr_id.py I've got the following output: 

CRAFT + crop result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。 株 式 会 社 ス キャ ナ 保 存 スト レー ジ プ ロ ジェ クト 件 名 T573-0011 2023/4/30 大 阪 市 北 区 大 深町 3-1 支払 期限 山口 銀行 本 店 普通 1111111 グラ ン フ ロン ト 大 阪 タ ワーB 振込 先 TEL : 06-6735-8055 担当 : ICS 太 郎 66,000 円 (税込 ) a 摘要 数 重 単位 単 価 金額 サン プル 1 1 式 32,000 32,000 サン プル 2 1 式 18000 18,000 2,000 2,000' 8g,000' 2,000
crop word_accuracy: 48.78048780487805

I've tried to create a map of boxes using .uzn files and pass it to tesseract, but results are worse: 
CRAFT result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。

株 式 会 社 ス キャ ナ 保 存

スト レー ジ プ ロ ジェ クト

〒573-0011

2023/4/30

大 阪 市 北 区 大 深町 3-1

山口 銀行 本 店 普通 1111111

グラ ン フ ロン ト 大 阪 タ ワーB

TEL : 06-6735-8055

担当 : ICS 太 郎

66,000 円 (税込 )

サン プル 1

1| 式

32,000

32,000

サン プル 2

1| 式

18000

18,000

2,000

2,000.

8,000

8,000

craft word_accuracy: 36.58536585365854. 

Apparently 金額 is not there; 
Sorry, my Japanese is little bit rusty :-) 
I have an impression that when I pass the map with .uzn text regions to tesseract it applies one transformation to pre-process the image, but when I'm passing each individual images it preprocess it separately, applying the best strategy for each region? Of course it is slower this way. 

nguyen ngoc hai

unread,
Sep 8, 2023, 8:18:42 AM9/8/23
to tesseract-ocr
Hi Alexey, 

Thank you very much for trying out my sample. It is very informative to understand how CRAFT could extract correctly the text regions. As far as I know, Tesseract has a very nice Python wrapper tesserocr, which provides many easy-to-use methods to analyze the image texts with a range of PSM, and RIL modes. However, unfortunately, I was not able to find a good method from the API to extract efficiently all the text regions for some multi-background and text colors samples. 

The results you provided are actually very promising. I have not read your code carefully yet, but may I ask that after getting all the text regions, did you pass them one by one to Tesseract or how did you get the CRAFT + crop result: ... (with accuracy 48.78)? 

As I noticed, some lines on the sample can be noisy for the results. I think that if applying the line removal method, the results can be better.  I do not quite understand the technique of creating a map of boxes using .uzn files and passing it to Tesseract, can you explain a bit further? And yes, you are right, not only the 金額 was not there, but all of the dark-background text regions are not shown in the second results ( such as 摘要 数 重 単位 単 価 金額, etc. )

Apology for the conversation becoming longer, but your questions are yet to be answered. I am deeply interested in understanding them too. 

Regards
Hai. 

Alexey Pismenskiy

unread,
Sep 8, 2023, 11:27:53 AM9/8/23
to tesseract-ocr
Hai, sorry I have missed a lot of details in my last message, so I will try to clarify.
Disclaimer: I'm not a computer vision guru, nor a ML or data science guy - just regular software development background.   
- API to extract efficiently all the text regions for some multi-background and text colors samples
I don't think that tesseract out-of-the-box has a decent text region detection. That is what I'm trying to figure out in my post. Tesseract folks have not responded to it yet, IDK if anyone of them is here in this mail group. Looks like there are better options out there (CRAFT is just one of them https://arxiv.org/abs/1904.01941)  IDK why they can not be integrated into tesseract.
- did you pass them one by one to Tesseract - yes, when CRAFT is executed in https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L233 it creates a bunch of crop files, for your ticket example they are in https://github.com/apismensky/ocr_id/tree/main/images/boxes_craft/ticket_crops. It also creates a text region map in https://github.com/apismensky/ocr_id/blob/main/images/boxes_craft/ticket_text_detection.txt. Most of them are rectangles (8 numbers in one row x1,y1,....x4,y4) but some maybe polygones (as you can see in the other files). 
Then I sort all the files by crop_NUMBER in  https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L265C6-L265C6 so that they are ordered by their appearance in the original image. 
Then I loop through all of them in https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L267 and feed each image to tesseract in https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L270. Notice that I'm using psm=7 there, cause we already know that each image is a box with a single text line, and the join them together in crop_result = ' '.join(res). 
Also notice that I'm not doing any pre-processing, I wonder what the result will be with some preprocessing for each image - hopefully better? 
I have tried another approach - passing a map of text regions, detected by CRAFT to tesseract, so it will not try to  do its own text detection. 
The motivation was to reduce the number of calls to tesseract for each crop (reduce the time)
That's what .uzn files are for. 
So for you example it will be something like: 
tesseract ticket.png - --psm 4 -l script/Japanese
Notice that https://github.com/apismensky/ocr_id/blob/main/images/sources/ticket.uzn is in the same folder as an original image, and it has the same name as the file (minus extension)
There is a little function that convert CRAFT text boxes to tesseract .uzn files, https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L175
The problem is that you can not really use pytesseract.image_to_data, I assume this is because of filename mismatch: image_to_data (most probably) creates a temp file in the filesystem that does not match to .uzn file name. 
So I did it by calling subprocess.check_output(command, shell=True, text=True) in  https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L101C18-L101C73 to kinda manually run tesseract as an external process. As I mentioned in my las message this approach did not give me the output for the regions with inverted colors (white letters on black background) 
Hopefully that makes sense, LMK of you have further questions.  


BTW I was looking for some more or less substantial information about an architecture of tesseract - at least at the level of main components, pipeline, algorithms etc - could not find it.  if you (or anyone) are aware - please LMK. 

nguyen ngoc hai

unread,
Sep 12, 2023, 8:30:30 PM9/12/23
to tesseract-ocr
Hi Alexey, 
Thank you very much for your detailed explanation. 
Sorry for my late reply. I got dragged into different matters in the last few days. 

Apparently, I was not aware of the .uzn file usage for Tesseract before. Thank you.  

In my previous project, I did apply preprocessing for each block image (as some may have different background noises or low-quality images). However, doing that is really not a good approach for such large-size images with a great number of text boxes. I used Python multi-process to boost the speed up a little bit. With that, depending on the number of CPU cores, we can process multiple images parallelly.  

In the above sample of mine, as you almost get 100% correct results from the text boxes. I will try to apply some preprocessing methods, to see if the results can be improved further. 
I will let you know right after that. 

Meanwhile, I still hope to hear updates on your questions.

Regards
Hai. 
Reply all
Reply to author
Forward
0 new messages