unicharset_extractor issue

517 views
Skip to first unread message

Stevan Cakic

unread,
Aug 25, 2019, 2:02:00 PM8/25/19
to tesseract-ocr
Hi,

I'm trying to create .traindata for numbers (.tif file example):

test.png

First, I run this command: tesseract eng.strangelabelmachinefont.exp0.tif .strangelabelmachinefont.exp0 batch.nochop makebox
After that, I run this: tesseract eng.strangelabelmachinefont.exp0.tif eng.strangelabelmachinefont.exp0 box.train

For now, everything is ok. I can see eng.strangelabelmachinefont.exp0.box file created with this content:
1 13 17 34 61 0
8 51 15 81 61 0
3 97 14 125 59 0
5 141 13 170 58 0
0 184 13 216 58 0
3 231 13 261 58 0

I have a problem when calling this command: unicharset_extractor eng.strangelabelmachinefont.exp0.box
When I call above command file unicharset is created with this content:
8
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
1 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 1 # 1 [31 ]0
8 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 8 # 8 [38 ]0
3 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 5 3 # 3 [33 ]0
5 8 0,255,0,255,0,0,0,0,0,0 Common 6 2 6 5 # 5 [35 ]0
0 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 7 0 # 0 [30 ]0

Problem is when I run next command: shapeclustering -F font_properties unicharset file_name.tr

I get tons of errors, mostly with bad format in tr file

Reading unicharset ...
Bad format in tr file, reading fontname, unichar
Bad box coordinates in boxfile string! 0 Common 0

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined     # Joined [4a 6f 69 6e 65 64 ]a

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1       # Broken

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 1 # 1 [31 ]0

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 8 # 8 [38 ]0

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 5 3 # 3 [33 ]0

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 6 2 6 5 # 5 [35 ]0

Bad format in tr file, reading box coords
Bad box coordinates in boxfile string! 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 7 0 # 0 [30 ]0

Bad format in tr file, reading box coords
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4
Stopped with 0 merged, min dist 0.263473
Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple unichars = 0

I read this article: https://www.systutorials.com/docs/linux/man/5-unicharset/ but didn't help me.
My configuration: Windows 10 x64, using tesseract-ocr-w64-v5.0.0-alpha.20190708
This number is well recognized with pytesseract 

pytesseract.image_to_string(Image.open(image_path), lang="eng", config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

But my goal is to create dataset to recognize digits in this situation for example:

369490.png

I also try with some algorithms to remove these horizontal lines but results are not better, so it's better than to create custom .dataset
Does anyone have any suggestion, is this problem with my version on tesseract, or I have to  something manually with unicharset file?
Thanks.

Best Regards,
Stevan

Cailey McVay

unread,
Oct 28, 2020, 9:52:07 AM10/28/20
to tesseract-ocr
Hello,
I have also run into a similar problem. I am trying to create a more accurate unicharset file in order to interpret the image down below.
The box file I have created for some sample images looks like this:
0 3 1 14 19 0
9 18 0 29 20 0
3 33 1 46 19 0
. 50 1 56 19 0
2 64 1 75 19 0
5 76 1 93 19 0
2 92 1 111 19 0
0 4 1 15 19 1
8 19 1 30 19 1
3 34 1 46 19 1
. 54 1 57 5 1
4 65 1 77 19 1
1 82 1 91 19 1
4 96 1 107 19 1
However, my unicharset file looks like this:
14
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
0 8 0,255,0,255,0,0,0,0,0,0 Common 3 2 3 0 # 0 [30 ]0
9 8 0,255,0,255,0,0,0,0,0,0 Common 4 2 4 9 # 9 [39 ]0
3 8 0,255,0,255,0,0,0,0,0,0 Common 5 2 5 3 # 3 [33 ]0
. 16 0,255,0,255,0,0,0,0,0,0 Common 6 6 6 . # . [2e ]p
2 8 0,255,0,255,0,0,0,0,0,0 Common 7 2 7 2 # 2 [32 ]0
5 8 0,255,0,255,0,0,0,0,0,0 Common 8 2 8 5 # 5 [35 ]0
8 8 0,255,0,255,0,0,0,0,0,0 Common 9 2 9 8 # 8 [38 ]0
4 8 0,255,0,255,0,0,0,0,0,0 Common 10 2 10 4 # 4 [34 ]0
1 8 0,255,0,255,0,0,0,0,0,0 Common 11 2 11 1 # 1 [31 ]0
6 8 0,255,0,255,0,0,0,0,0,0 Common 12 2 12 6 # 6 [36 ]0
7 8 0,255,0,255,0,0,0,0,0,0 Common 13 2 13 7 # 7 [37 ]0

We were able to create some other files like the normproto file and the trained box file which I attached below. I was wondering if our unicharset file could have impacted the other files created because when we run our language on other sets of images they produce no text. Tesseract by itself is able to produce text from our images but they are usually off by one number. When we use tesseract with the trained data we created we receive no text output. I am wondering if we also have to manually edit our unicharset file. 
Best regards,
Cailey

Screen Shot 2020-10-28 at 7.40.24 AM.png
NTS.box.tr
Reply all
Reply to author
Forward
0 new messages