Customising Tesseract for character recognition

16,531 views
Skip to first unread message

Saurabh Gandhi

unread,
Feb 16, 2011, 11:48:56 PM2/16/11
to tesser...@googlegroups.com
Hello everyone,

I am currently using tesseract 3.x for license plate recognition.
I have an algorithm which does a good job in pre-processing the input image to localize the plate.
However, when I use the Tesseract OCR engine to classify the plate number, the recognition is not that accurate. I have gone through the tesseract whitepapers as well as some of the threads discussing the LPR using tesseract.

From all this, I have identified the following ways of improving the results:
  1. Customise the tesseract engine to recognize only the characters from A-Z,0-9,.(dot), (space) by setting the character white-list. My understanding is that the white-list is the list of characters that are going to be sensed. I was inquisitive to know what the blacklist is meant to do?
  2. A lot of times I have seen fairly good number plate images being OCRed inaccurately. This could possibly be due to the word recognition stage. Has anyone found a way to disable the dictionary / word recognition.
  3. Then there are some page segmentation modes (PSM_AUTO,PSM_SINGLE_BLOCK, PSM_CHAR etc). Does PSM_CHAR imply that it will consider the input image as a single character and run the algorithm accordingly without attempting word recognition?
  4. Another important configuration macro that I have seen within the code was AVS_FASTEST = 0,  AVS_MOST_ACCURATE = 100. However, I could not find the same being used anywhere in the code. Does this have any impact on the character recognition accuracy?
  5. Finally, I also plan to use the confidence level data. Are there any indicators of confidence for characters as well. There is word confidence data which can be found in TessBaseAPI::AllWordConfidences().
Awaiting your valuable insights.
Thank you.

Regards,
Saurabh Gandhi

Ray Smith

unread,
Feb 18, 2011, 1:27:13 AM2/18/11
to tesser...@googlegroups.com
From all this, I have identified the following ways of improving the results:
  1. Customise the tesseract engine to recognize only the characters from A-Z,0-9,.(dot), (space) by setting the character white-list. My understanding is that the white-list is the list of characters that are going to be sensed. I was inquisitive to know what the blacklist is meant to do?
  1. Just the opposite of whitelist. You can disable specific characters from the usual set.
  1. A lot of times I have seen fairly good number plate images being OCRed inaccurately. This could possibly be due to the word recognition stage. Has anyone found a way to disable the dictionary / word recognition.
  1. Play with segment_penalty_dict_*
  1. Then there are some page segmentation modes (PSM_AUTO,PSM_SINGLE_BLOCK, PSM_CHAR etc). Does PSM_CHAR imply that it will consider the input image as a single character and run the algorithm accordingly without attempting word recognition?
  1. Yes.
  1. Another important configuration macro that I have seen within the code was AVS_FASTEST = 0,  AVS_MOST_ACCURATE = 100. However, I could not find the same being used anywhere in the code. Does this have any impact on the character recognitionaccuracy?
  1. This control is dead in 3.01. Replaced by ocr_engine_mode. It just controls the combination of tesseract vs cube. Cube increases the accuracy slightly, but adds a lot of compute time.
  1. Finally, I also plan to use the confidence level data. Are there any indicators of confidence for characters as well. There is word confidence data which can be found in TessBaseAPI::AllWordConfidences().
  1. Yes, and they are exposed in the new ResultIterator in 3.01, otherwise you have to go down into the guts of the data structures.

Sriranga(78yrsold)

unread,
Feb 18, 2011, 4:51:33 AM2/18/11
to tesser...@googlegroups.com
Customise the tesseract engine to recognize only the characters from A-Z,0-9,.(dot), (space) by setting the character white-list     Kindly furnish the name of the folder in which whitelist as well as blacklist are existed. I want to utilise the same for Kannada scripts.
-sriranga(78yrs)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

Saurabh Gandhi

unread,
Feb 18, 2011, 4:56:02 AM2/18/11
to tesser...@googlegroups.com, Sriranga(78yrsold)
You can simply use this in your program just after init to set whitelist / blacklist:

api.Init(argv[0], lang, &(argv[arg]), argc-arg, false);
api.SetVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZ.0123456789 ");


--
Regards,
Saurabh Gandhi

Saurabh Gandhi

unread,
Feb 18, 2011, 5:53:17 AM2/18/11
to Jose, tesser...@googlegroups.com, Sriranga(78yrsold)
Hello Jose,

Setting the mode to PSM_SINGLE_BLOCK or PSM_SINGLE_LINE will not force horizontal reading. These modes will just assume that your input image itself is segmented and consists of just a single line. So, if you want horizontal reading you will have to segment your image and provide it to tesseract like this:

WORD1    WORD2 (segmented image 1)

WORD1    WORD2 (segmented image 2)

Hope that answers your question.

--
Regards,
Saurabh Gandhi




On Fri, Feb 18, 2011 at 3:45 PM, Jose <dio...@gmail.com> wrote:
this is JPG look like

WORD1   WORD2 (white space is quite "big"
WORD1   WORD2
WORD1   WORD2
WORD1   WORD2
WORD1   WORD2
WORD1   WORD2
WORD1   WORD2

and it reads like:

WORD1 
WORD1 
WORD1 
WORD1 
WORD1 
WORD1 
WORD2
WORD2 
WORD2 
WORD2 
WORD2 
WORD2 
WORD2 

any help would be really apreciated! I've been stuck with this for a month :(


Saurabh Gandhi

unread,
Feb 18, 2011, 6:06:16 AM2/18/11
to Jose, tesser...@googlegroups.com, Sriranga(78yrsold)
Did you try PSM_SINGLE_COLUMN. I think that is what you need. Could you try this and let us know how it behaves please.

 PSM_SINGLE_COLUMN,  ///< Assume a single column of text of variable sizes.

--
Regards,
Saurabh Gandhi




On Fri, Feb 18, 2011 at 4:29 PM, Jose <dio...@gmail.com> wrote:
Is there no other workarround? If I reduce the white space size of the WORD1 WORD2 then it all works fine! This space is making the OCR think it's another column! Is there no another way? Splitting the image as many rows looks something not really eficient

Saurabh Gandhi

unread,
Feb 18, 2011, 6:32:09 AM2/18/11
to Jose, tesser...@googlegroups.com, Sriranga(78yrsold)
Yes, thats right.

--
Regards,
Saurabh Gandhi




On Fri, Feb 18, 2011 at 4:57 PM, Jose <dio...@gmail.com> wrote:
ok I'll try that! I have to modify this on the tesseractmain.cpp right? (I'm using command line execution)

I replace this line : api.SetPageSegMode(tesseract::PSM_AUTO); for api.SetPageSegMode(tesseract::PSM_SINGLE_COLUMN); and then recompile right? 

thanks for the help

Saurabh Gandhi

unread,
Feb 18, 2011, 6:51:06 AM2/18/11
to Jose, tesser...@googlegroups.com, Sriranga(78yrsold)
great...

--
Regards,
Saurabh Gandhi




On Fri, Feb 18, 2011 at 5:16 PM, Jose <dio...@gmail.com> wrote:
you now Saurabh, that was EXACTLY was I was looking for! I couldn't be more thankful to you! that line of code changed my life :D

thank you again :)

Jose

unread,
Feb 18, 2011, 5:12:54 AM2/18/11
to tesser...@googlegroups.com, Saurabh Gandhi, Sriranga(78yrsold)
Saurabh by setting on this: PSM_AUTO,PSM_SINGLE_BLOCK, PSM_CHAR are you forcing the page to read horizontally? My problem is that I have a column of two words separated by a white space (each word is in a diferent font) and Instead of seeing one column of two words the OCR sees two columns of one word! Any thought there? any help would be gladly apreciated

Jose Granja

unread,
Feb 17, 2011, 4:18:07 AM2/17/11
to tesser...@googlegroups.com, tesser...@googlegroups.com
Hi, do you now how to force the page layout to recognise it as horizontal? my issue is with that! you ll make me the happiest person on earth
--

Jose

unread,
Feb 18, 2011, 6:38:56 AM2/18/11
to Saurabh Gandhi, tesser...@googlegroups.com, Sriranga(78yrsold)
Ok I'm recompiling now... I'll let you know when it's done! thanks for the help anyway :)

Jose

unread,
Feb 18, 2011, 5:15:12 AM2/18/11
to tesser...@googlegroups.com, Saurabh Gandhi, Sriranga(78yrsold)

Jose

unread,
Feb 18, 2011, 6:27:37 AM2/18/11
to Saurabh Gandhi, tesser...@googlegroups.com, Sriranga(78yrsold)

Jose

unread,
Feb 18, 2011, 5:59:16 AM2/18/11
to Saurabh Gandhi, tesser...@googlegroups.com, Sriranga(78yrsold)

Jose

unread,
Feb 18, 2011, 6:46:45 AM2/18/11
to Saurabh Gandhi, tesser...@googlegroups.com, Sriranga(78yrsold)

Jose

unread,
Feb 24, 2011, 5:05:03 AM2/24/11
to Saurabh Gandhi, tesser...@googlegroups.com, Sriranga(78yrsold)
Hi, (as you now Saurabh because we talked in private the other day) I tried the PSM_SINGLE_COLUMN and the accuracy drops dramatically... I can't afford to loose that accuracy. Is it possible to change the way the output is display? Looking a the code it seems rather hard to change it... perhaps I could print the pos x,y of the word found and then I could work out the horizontal/vertial layout? What are your thoughts? regards

Dmitry Silaev

unread,
Feb 24, 2011, 5:46:08 AM2/24/11
to tesser...@googlegroups.com, dio...@gmail.com
I don't know if it's affordable for you, but imho decent results can
only be achieved if you do segmentation yourself and then pass image
fragments to Tesseract on a word-by-word basis. Problems may appear
when you have words that are too short, however, as I can see, it's
not your case.

Long time ago, I had started my project relying on Tess's segmentation
and struggled much with it, until I came to a word-by-word approach.
Finally, I even switched to the character-wise recognition which at
last produces decent results. Mostly this transition was caused by
specifics of input images I'm working on (photos, usually of low
quality), but I think this is almost required for ideally scanned
images too.

There are some fruitful math ideas behind Tess's segmentation, but I
think the current implementation is not mature enough to be used
extensively in the production mode.

Warm regards,
Dmitry Silaev

Jose

unread,
Feb 24, 2011, 5:50:57 AM2/24/11
to Dmitry Silaev, tesser...@googlegroups.com
Dmitry the recognition works the only thing is the way it is parsing it... :S I think segmentation of the images would be too much painful! I only won't to change the other that is display or the bounding boxes so I could now the x and y of the word recognized and thereby can organise the results better myself! don't you think it's a good aproach?

thank you very much for you help

Dmitry Silaev

unread,
Feb 24, 2011, 6:03:33 AM2/24/11
to Jose, tesser...@googlegroups.com
Unfortunately not only text output order can suffer from Tess's
segmentation, but also extents of some text fragments can be
identified incorrectly (say one "segmented" row can span over two
"real" rows, probably in partial way), and that in turn can lead to
*completely* irrelevant recognition results.

However you can run as many as possible tests on your images and
"prove" that this probably is not the case, and hope that segmentation
errors are won't be "destructive" and only will introduce this kind of
"disorder". Then certainly you can use your (x,y)-sort method and be
happy ))

Warm regards,
Dmitry Silaev

Jose

unread,
Feb 24, 2011, 6:17:13 AM2/24/11
to Dmitry Silaev, tesser...@googlegroups.com
In my particular case is just a matter that the first word of each column is in one font and the other is in another so instead of reading column by column it reads all the columns of the first row and then all the columns of the second row! My god is really hard to explain in english. I get an accurate result: >90% but instead I get the concat of the column 1 and column 2! I'm trying my best to understand the OCR but it's really hard for me as I don't have any OCR background. I don't see any other approach than printing where is the word ridden and try to postprocess all the results after, please correct me if I'm wrong or you see some improvements that can be made.

please excuse my bad english

regards,

jose

Dmitry Silaev

unread,
Feb 24, 2011, 6:27:27 AM2/24/11
to tesser...@googlegroups.com
The best way to explain everything would be just to send your source
image examples, describe what information you want to get from them
and provide the community with the code snippets you use to interface
with Tess. And please be as detailed as possible.

Warm regards,
Dmitry Silaev

Jose

unread,
Feb 24, 2011, 6:33:13 AM2/24/11
to tesser...@googlegroups.com, Dmitry Silaev
Ok I'll try to do that this afternoon.

thank you for the help

regards,

jose

Jose

unread,
Mar 13, 2011, 3:03:59 PM3/13/11
to tesser...@googlegroups.com, Dmitry Silaev
Hi Dmitry,

sorry for the delay... I produced some samples and see if you can give them a look!

regards,

jose


3.txt
3.tiff

patrickq

unread,
Mar 13, 2011, 3:48:35 PM3/13/11
to tesseract-ocr
Tesseract 3.00 gets this text 100% correct, including the smudged
numbers at the bottom. See:
http://www.scanbizcards.com/plate1.jpg
http://www.scanbizcards.com/plate2.jpg

(scanning was done with ScanBizCards on an iPhone - if you try it
yourself with the app on Android or iPhone, please disable image
processing in the settings prior to scanning)

Patrick
>  3.txt
> < 1KViewDownload
>
>  3.tiff
> 32KViewDownload

Jose

unread,
Mar 13, 2011, 4:10:43 PM3/13/11
to tesser...@googlegroups.com, patrickq
Hi Patrick,

yes the results are correct! but the format of the results it is not! that's my trouble

patrickq

unread,
Mar 13, 2011, 4:18:20 PM3/13/11
to tesseract-ocr
You expect way too much from Tesseract: it's not Tesseract's job to
slice and dice the text according to various organizational
requirements of applications - that's for the application to handle.
You can get all the coordinates of all characters and easily determine
which one are in what you consider the first column and which are in
the 2nd column. In ScanBizCards' case considering our target material,
we treat each line as a single number formed of two sequences - but if
we wanted to treat the input as columns, it would take us a mere 20
minutes of coding or organize the results that way. We actually don't
even pay attention to where Tesseract thinks lines end and start, we
figure that out ourselves based on coordinates. It's not hard.

Patrick

Dmitry Silaev

unread,
Mar 13, 2011, 6:06:51 PM3/13/11
to tesser...@googlegroups.com, Jose
Jose,

I run Tesseract revision 549 from the command line under Windows with
no special config and get the segmentation which is almost correct.
What language file do you use? I used the following command line

tesseract 3.tiff test3 -l eng

with no pageseg_mode (-psm argument) as well as with it, and always
the result was satisfactory.

Let me know the details on your command line and OS.

Warm regards,
Dmitry Silaev

Jose

unread,
Mar 14, 2011, 6:42:56 AM3/14/11
to Dmitry Silaev, tesser...@googlegroups.com
Hi Dmitry,

thanks for the help!

and the end what I did is modify the return result function and include the top location of the the bounding box. then I have the following result:

<value>xxxxx</value><top>yyyyy</top>
<value>xxxxx1</value><top>yyyyy1</top>
<value>xxxxx2</value><top>yyyyy2</top>
<value>xxxxx3</value><top>yyyyy3</top>
<value>xxxxx4</value><top>yyyyy4</top>
<value>xxxxx5</value><top>yyyyy5</top>
<value>xxxxx6</value><top>yyyyy6</top>
<value>xxxxx7</value><top>yyyyy7</top>

then I parse the results and I can now that xxxxx1 and xxxxx2 where in the same line due looking at the top value. the approach works fine to me but I had to modify the sourcecode of tesseract

regards,

jose

Dmitry Silaev

unread,
Mar 14, 2011, 6:52:57 AM3/14/11
to tesser...@googlegroups.com, Jose
I think the best approach would be to stay as far as possible from
modifying the 3rd party code. Take a closer look to ResultIterator and
PageIterator classes. Often they suffice for getting all information
you need about Tess's recognition results.

Warm regards,
Dmitry Silaev

Jose

unread,
Mar 14, 2011, 6:55:34 AM3/14/11
to Dmitry Silaev, tesser...@googlegroups.com
yes, I got the information from the result! I only modify has the result method prints the result.. nothing more of course! I got the information from the bounding box of the result! I'm not modifying it deeper than that.

Jose

unread,
Mar 14, 2011, 6:56:23 AM3/14/11
to Dmitry Silaev, tesser...@googlegroups.com
*I only modify how the result is printed! nothing else... I grab all the info from the word and it's bounding box! that is ok right?

Dmitry Silaev

unread,
Mar 14, 2011, 7:00:52 AM3/14/11
to tesser...@googlegroups.com, Jose
Ehmm... I don't get it. If you've succeeded in using iterators, it's
at your full disposal to format the output in any way you want
programmatically, isn't it?

Warm regards,
Dmitry Silaev

Jose

unread,
Mar 14, 2011, 7:13:11 AM3/14/11
to Dmitry Silaev, tesser...@googlegroups.com
I fire the execution of the tesseract in the command line and I didn't find a way to format the results with more info. 

Dmitry Silaev

unread,
Mar 14, 2011, 7:18:50 AM3/14/11
to Jose, tesser...@googlegroups.com
Why don't you consider making your own project and statically include
in it Tesseract, or use Tesseract as a dynamic link library? In that
way you can implement any formating and other special logic you
wish...

Warm regards,
Dmitry Silaev

Jose

unread,
Mar 14, 2011, 7:22:38 AM3/14/11
to Dmitry Silaev, tesser...@googlegroups.com
In future that will be my desired approach! for the time beeing I just need a fast and easy solution! I know it's not the most beautiful approach... but I haven't touch a lot of the tesseract framework in order to break anything! I was just short of time and it was easier for me to modify the source code. As I just modified a function I could go back to normal and implement that outside the tesseract framework.

thank you very much for the help!

regards,

jose

Prachi Joshi

unread,
Dec 14, 2011, 2:25:36 AM12/14/11
to tesser...@googlegroups.com
how to set all these variables? 

Aruna Devi

unread,
Feb 17, 2012, 12:41:57 AM2/17/12
to tesser...@googlegroups.com
Even i wanted to know how to make tesseract to read my image horizontally. I have an image consisting of 6 rows, After training i found that my image is read from right side(Should be from left) and also its going down by column and not the row. How to solve this issue?

Andres

unread,
Feb 17, 2012, 8:24:35 AM2/17/12
to tesser...@googlegroups.com
Just by curiosity, how did you find that ?

2012/2/17 Aruna Devi <arunad...@gmail.com>
Even i wanted to know how to make tesseract to read my image horizontally. I have an image consisting of 6 rows, After training i found that my image is read from right side(Should be from left) and also its going down by column and not the row. How to solve this issue?

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com

Aruna Devi

unread,
Feb 20, 2012, 11:19:47 AM2/20/12
to tesseract-ocr
by seeing the output i got. My image has 6 rows and 12 columns, but in
my output i got 12 rows and 6 columns , and all was read from right
first.(should have started from left)

On Feb 17, 6:24 pm, Andres <andrej...@gmail.com> wrote:
> Just by curiosity, how did you find that ?
>
> 2012/2/17 Aruna Devi <arunadevia...@gmail.com>

zdenko podobny

unread,
Oct 14, 2012, 7:46:09 AM10/14/12
to tesser...@googlegroups.com, saura...@gmail.com

On Sat, Oct 13, 2012 at 10:47 PM, JVIyer <jawa...@gmail.com> wrote:
A lot of times I have seen fairly good number plate images being OCRed inaccurately. This could possibly be due to the word recognition stage. Has anyone found a way to disable the dictionary / word recognition.

Saurabh, Have you been able to accomplish this ? Could you kindly share your insigths ? I have a similar need. 
Thanks a lot in advance.

First of all - make sure that "fairly good" is also relevant for binarized version of your image.
Next - dictionaries can be disable only at init time [1], [2]. So create config file where you specified (load_system_dawg F) which dictionaries[3] should not be loaded .


-- 
Zdenko
 
On Wednesday, February 16, 2011 10:48:56 PM UTC-6, Saurabh Gandhi wrote:
Hello everyone,

I am currently using tesseract 3.x for license plate recognition.
I have an algorithm which does a good job in pre-processing the input image to localize the plate.
However, when I use the Tesseract OCR engine to classify the plate number, the recognition is not that accurate. I have gone through the tesseract whitepapers as well as some of the threads discussing the LPR using tesseract.

From all this, I have identified the following ways of improving the results:
  1. Customise the tesseract engine to recognize only the characters from A-Z,0-9,.(dot), (space) by setting the character white-list. My understanding is that the white-list is the list of characters that are going to be sensed. I was inquisitive to know what the blacklist is meant to do?
  2. A lot of times I have seen fairly good number plate images being OCRed inaccurately. This could possibly be due to the word recognition stage. Has anyone found a way to disable the dictionary / word recognition.
  3. Then there are some page segmentation modes (PSM_AUTO,PSM_SINGLE_BLOCK, PSM_CHAR etc). Does PSM_CHAR imply that it will consider the input image as a single character and run the algorithm accordingly without attempting word recognition?
  4. Another important configuration macro that I have seen within the code was AVS_FASTEST = 0,  AVS_MOST_ACCURATE = 100. However, I could not find the same being used anywhere in the code. Does this have any impact on the character recognition accuracy?
  5. Finally, I also plan to use the confidence level data. Are there any indicators of confidence for characters as well. There is word confidence data which can be found in TessBaseAPI::AllWordConfidences().
Awaiting your valuable insights.
Thank you.

Regards,
Saurabh Gandhi
Reply all
Reply to author
Forward
0 new messages