wanna know how user-words effects recognition result

2,456 views
Skip to first unread message

James Liu

unread,
Nov 24, 2016, 2:04:27 AM11/24/16
to tesseract-ocr
Hey,

I'm improving tesseract recognition results. My task is to recognize terminal output files, which should be easy. The result by using default model is just OK.

My idea is to pick all the unique words in system directory paths and put them into user-words. Intuitively this idea should work, but I just got ignorable improvement. 

So I wanna know if this idea is reasonable. If yes, how can I fully explore the potential of this idea?

I run 
The CMD I run was:
tesseract test1.tiff test1 -l eng -psm 4 --user-words uniq_dir_name
where test1.tiff, uniq_dir_name are attached.

Also, I run CMD without user-words:
tesseract test1.tiff test1_old -l eng -psm 4

test1_old.txt and test1.txt are all uploaded for your reference. With user-words, the word "jbd2 " is recognized, while without user-word it was recognized as "jbdZ". That's the only improvement I got.

(I hope at least "proc", "INFO", "sys" could be recognized.)

Any comments are welcome. Thanks a lot.
test1.tiff
test1_old.txt
test1.txt
uniq_dir_name

James Liu

unread,
Nov 26, 2016, 10:31:38 PM11/26/16
to tesseract-ocr
I thought the word "sys" should be correctly recognized with the user-words option used. It should be easy...

(Advice is also needed on how to attract people to this thread.)

在 2016年11月24日星期四 UTC+8下午3:04:27,James Liu写道:

Allistair

unread,
Nov 27, 2016, 5:41:55 AM11/27/16
to tesser...@googlegroups.com
Hi,

I've had a little play and have some success that maybe will give you some further ideas.

I used the Tesseract manual way of loading user words - yours I was not 100% was correct (but it could be):

CONFIG FILES AND AUGMENTING WITH USER DATA

For me using brew on a Mac that meant renaming your user words to simply user-words and putting it in /usr/local/Cellar/tesseract/3.04.01_1/tessdata and then put the typical turn-off english dictionary via a config file called liu which went to tessdata/configs

load_system_dawg     0
user_words_suffix    user-words

Then I run 

tesseract terminal.tiff terminal liu

Whilst I am sure this config is using the correct config (because if I rename the words file it complains) I was unable to get any different result to yours.

Then I look a look at some control parameters


load_system_dawg is already off but I thought this other one sounded interesting:

language_model_penalty_non_dict_word

I maxed that to 1 but this did not help either.

I then decided to just cut out a piece of your terminal.tiff by itself to isolate the line with the issues, e.g. the line with /proc/sys/kernel/hung_task_timeout_secs" 

Still the same issue (I think that was expected but it's useful sometimes I find to isolate). I also converted your tiff to jpg to get rid of some page errors that Tesseract was giving me.

So now I ended up with tesseract running perfectly, no command line warnings etc. on an isolated bit of text it is failing to get right. 

The input resolution seems OK and doubling it had no effect on the result.

So then I wondered whether Tesseract was not liking the fact the part with sys is contained in a continuous stream of characters with no spaces, i.e. it's not "words" because there are no spaces between them.

So to try this out I took my new tiff and doctored out the slashes and underscores from the image and then first I re-ran tesseract without ANY of the config and the same issue occurs

proc sgs kernel hung task timeout secs

However, now using my config

tesseract term2onlyalpha.jpg term2onlyalpha liu

Yields the correct result

proc sys kernel hung task timeout secs

Just to check what about the config is working I commented out user_words_suffix and it went back to sgs. So I commented that back in and turned the load_system_dawg back on and this maintained the correct result.

SO

I have a correct result out of tesseract but only when I doctored the input image to remove slashes. 

I don't know how important those slashes are to you but you COULD try replacing slashes in your text if you have a possibility to intercept that before it becomes an image that goes to Tesseract. If you need slashes or some notion of slashes you could try using a different separator character like an underscore or something that does not confuse Tesseract whilst it's parsing the characters. 

OR

You could try and see if modifying tesseract's internal engine parameters will get you anywhere. I tweaked ONE parameter with no impact but there are a great many.

If you run 

tesseract --print-parameters

You will get a list of all the parameters you can put in your config file which could make a difference to your image as it is. 

Another tip is you can get debug out of tesseract as its running. To find these parameters do:

tesseract --print-parameters | grep debug 

Then you can use them with your tesseract runs. For instance I tried word_to_debug=sys and when it was failing I got NO output from this, but when it is working (with the image with slashes removed) I get output:

tesseract -c word_to_debug=sys term2onlyalpha.jpg term2onlyalpha liu

Best Raw Choice : sgs : R=47.8142, C=-3.75406, F=1, Perm=2, xht=[12.2215,15.2019], ambig=0

pos NORM NORM NORM

str s g s

state: 1 1

C -3.377 -3.754 -3.095

etc...

So this lets you see a bit of some of the values being used when it works and then you can go back to the full print-parameters list to actually modify some of those options to see if you can get it working.

You will need for some debugging the visual debugging tool ScrollView.jar

https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging

Which will start showing you cool stuff like segmentation (looks OK)

Inline images 2

Inline images 3

Moral of the story - work ahead for you to understand what engine params you can change that might help you tweak this gremlin out in relation to slashes causing some kind of problem for tesseract around the y and g and I also noticed proc vs prec sometimes. 

Cheers



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/26291430-30f8-4920-9e63-7296245f8759%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

James Liu

unread,
Nov 28, 2016, 4:39:40 AM11/28/16
to tesseract-ocr
Hi Allistair,

Thank you so much for your thouroughly experiment. 

Based on your finding, I thought it we can make slash a punctuation and then tesseract could somehow regard the string between slashes as words. Then I added slash into eng.punc-dawg and re combine-tessdata. And BOOM, nothing happened. 

I'm now buried in the parameter list. Maybe there's something could really help, but I need some more time to dig deeper.

Do you have a bitcoin address? I want to buy you a cup of coffe.😀


在 2016年11月27日星期日 UTC+8下午6:41:55,Allistair C写道:
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Allistair

unread,
Nov 28, 2016, 5:01:01 AM11/28/16
to tesser...@googlegroups.com
That's very kind of you but I'm in it for the sport, it's good fun trying to make Tesseract work. Do keep us updated on this.

Of course, OCR like Tesseract is now old-skool - most folk are researching/using machine learning techniques in computer vision for OCR. You should try out Google Cloud Vision's text recognition and see how you get on - unsure how good it is at pages of copy like yours but I had great success with natural world scenes that a previous attempt with Tesseract totally failed at.

Cheers

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages