Retrain Tesseract 4.0.0 beta to recognise handwritten digits

3,796 views
Skip to first unread message

Ramakant Kushwaha

unread,
Jul 17, 2018, 11:13:29 AM7/17/18
to tesseract-ocr
Hi,

Recently I trying to retrain Tesseract 4.0 for recognising handwritten digits. I am following official page but finding it very difficult. It would be great if someone can elaborate below steps

Lorenzo Bolzani

unread,
Jul 17, 2018, 11:34:08 AM7/17/18
to tesser...@googlegroups.com

Have a look at this thread:



It's easier than it seems, you do not need per character boxes with 4.0, just one per line (that ocr-d automatically generates). If your text is already split into lines you do not have to do anything more.

Unicharset and lstmf files are also created by ocr-d.


Feel free to ask if you get stuck, now I have this working but it's a bumpy road (lot of assertion failed/segmentation fault if you miss something).


Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/97e29010-f602-42e9-b3b8-121fb151a49e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ramakant Kushwaha

unread,
Jul 17, 2018, 1:17:32 PM7/17/18
to tesseract-ocr
Thank you so much for guiding me. 

I had read links and sub-links provided and as suggested I will use OCR-D(https://github.com/OCR-D/ocrd-train)  for training 
I want to know what is the best way to create  pairs of [*.tif, *.gt.txt]  from tif image for two and more fonts . Is their any specific tool to generate line *.tif and *.gt.txt files as required by OCR-D. 
I have data like below tiff image(Total 20 images), Please guide me 
Thank you



On Wednesday, July 4, 2018 at 8:20:54 PM UTC+5:30, Joe wrote:
Hi everybody!

I'm trying this tool https://github.com/OCR-D/ocrd-train/ but without success so far. Tesseract and Leptonica are installed by the scripts.
Inspired by the test set provided in that repo, I created pairs of [*.tif, *.gt.txt] with binarized chars and TTF's from two fonts (1869 text lines in total).
You can see an example of my set in attachment that also contains files created by the training process.

My guess is that something is wrong with my data.
Sometimes I can see the char train value increasing instead of decreasing and the final error rate still too high (about 60%).

That new training process with LSTM is driving me crazy!
I would appreciate if anyone with experience could take a look to my data set.


Joe. 


On Tuesday, July 17, 2018 at 9:04:08 PM UTC+5:30, Lorenzo Blz wrote:

Have a look at this thread:



It's easier than it seems, you do not need per character boxes with 4.0, just one per line (that ocr-d automatically generates). If your text is already split into lines you do not have to do anything more.

Unicharset and lstmf files are also created by ocr-d.


Feel free to ask if you get stuck, now I have this working but it's a bumpy road (lot of assertion failed/segmentation fault if you miss something).


Bye

Lorenzo
2018-07-17 15:03 GMT+02:00 Ramakant Kushwaha <ramakant...@gmail.com>:
Hi,

Recently I trying to retrain Tesseract 4.0 for recognising handwritten digits. I am following official page but finding it very difficult. It would be great if someone can elaborate below steps

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Lorenzo Bolzani

unread,
Jul 17, 2018, 2:08:37 PM7/17/18
to tesser...@googlegroups.com
​​
Generating the training data is a completely different problem from training tesseract.

If you want to recognize full words it's better to have full words (or numbers), not individual characters so that the process of splitting the words into characters is done by tesseract.

Unless you just want to recognize individual characters. This looks more like a MNIST-like task for a simple neural network.

I think there are tools to cut images into lines but I've never used one. Or you could do this by programming with opencv.

There is no tool to generate the gt.txt you need to write these by hand. In this case your text is very regular so you may just create one line manually (1 2 3 4...) and duplicate that one.  Or you could use a very good online ocr service.


But I'm not convinced this data is good for training. How does the real data that you want to recognize looks like? Individual digits or full numbers?




To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Soumik Ranjan Dasgupta

unread,
Jul 18, 2018, 2:08:42 AM7/18/18
to tesser...@googlegroups.com
Try creating a text corpus with only digits using various handwritten fonts that come close to your dataset from fonts.google.com
Use tesstrain.sh for rendering the images, and lstmtraining to train tesseract - you'll achieve a fair accuracy.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--
Regards,
Soumik Ranjan Dasgupta

Ramakant Kushwaha

unread,
Jul 18, 2018, 2:33:59 AM7/18/18
to tesseract-ocr
@Soumik,Thanks Soumik, but I am not getting it, please provide me some links to understand it. I am very new to this thing. can you guide me in creating text corpus of digit with different fonts

@Lorenzo, I want to detect digits written in boxex of below image, it's a cash deposit form of a bank with very complex layout. I have to capture details of account no and pan no.

Soumik Ranjan Dasgupta

unread,
Jul 18, 2018, 3:50:49 AM7/18/18
to tesser...@googlegroups.com
I normally use a custom python file to generate the training text. Attaching a sample text corpus containing only digits 1234.


For more options, visit https://groups.google.com/d/optout.
eng.training_text

Soumik Ranjan Dasgupta

unread,
Jul 18, 2018, 3:52:49 AM7/18/18
to tesser...@googlegroups.com
Follow https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 to create the traineddata. Copy the eng.traineddata file to $TESSDATA_PREFIX directory, and you'll be good to go.

Lorenzo Bolzani

unread,
Jul 18, 2018, 5:18:32 AM7/18/18
to tesser...@googlegroups.com
​​
This is exactly the MNIST problem. I would not use tesseract for this. You can download something like this:


that comes with pre-trained models too.

The problem you'll have will be to extract the digits from the boxes, I would use opencv, probably SIFT to align the form. Then you need to delete the black borders, or just leave them there and see what happens. Or repeat the training adding random black boxes around the digits.

So I would first try to understand how you want to extract the data: how your REAL data looks like. There is no point in training on something different. Unless this is an exercise or an assignment and you'll get the digits already extracted.

If you want to train tesseract on your images do some blur+threshold so that numbers become black blobs. Run findComponent with opencv. Sort by x and y. Now just iterate over the blobs and crop from the original image and assign the labels (you know the correct one because your sequence is fixed).
Delete or manually straighten the skewed lines with gimp to keep things simple.

blobs_img = blur + threshold(img)
digits = findComponents(blobs_img) and sort
i = 0
for d in digits:
      tiff = crop from original image using d coordinates
      gtx.txt = i
      i = (i+1)%10

Now you have the tiff images and the gt.txt files to run ocr-d.

Maybe there are some tools to do this by hand, one digit at a time:



Lorenzo


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Ramakant Kushwaha

unread,
Jul 18, 2018, 6:20:11 AM7/18/18
to tesseract-ocr
@Lorenzo
As per my understanding MNIST in useful for detecting individual char/digit, so for using MNIST I have to do below steps, correct me if i am wrong
1. Gray + Threshold (Opencv)
2. Extract Connected components (MSER opencv)
3. run a loop over connected components list(sorted) and crop individual digit or raw
4. pass it to MNIST trained model 
5. save the result

NOTE: (Here I do not how I will distinguish between account number and Mobile number or date)
I have not tried this, will try above method based on your suggestion 

##TESSERACT
I am using Tesseract, because I want to extract words(like account ,PAN,date,mobile ) and their corresponding values (key value pair extraction), thats why I thought it will be good to use tesseract 
what I have done till now
1. I am getting scanned image
2. crop desired region(PIL + OPENCV)
3. gray + blur + threshold
3. connected component extraction (MSER+OPENCV)
4. black(text) and white(white background)
5. passing this(4th step image) to tesseract, it's only detecting digital words and numeric, it's also detecting some hand written digits 
6. Please help me on this step (should i need to use MNIST or Train Tesseract for handwritten )
7. I am not able to remove border around the digits(Please suggest some technique)

I am attaching sample images (origional image)



Image after (blur+threshold+MSER+blank&white)

  

Final result will look like:
Account : 123456789054321
PAN: CYY******1*
Mobile:7777788888
Date:17/07/2018

Please suggest alternative for solving this. 

Lorenzo Bolzani

unread,
Jul 18, 2018, 7:56:05 AM7/18/18
to tesser...@googlegroups.com
​​

A MNIST trained model does character recognition, not detection. You first need to isolate characters to use it. The advantage is that it is already trained and I think it may work better than fine tuning tesseract because the handwritten digits are quite different from standard fonts.


The difference between recognizing characters and words is this:
character: you send individual characters to tesseract
word: you send a big image with the whole word

If the image contains "1234" I can call tesseract four times with four images 1,2,3,4 and then join the results. Or I can pass just one big image giving me "1234". You can get the same results in both ways. I know this is very easy for you, just to make sure that we are talking about the same thing.

In this case there are the form boxes to complicate things. If you find a good way to delete the boxes you can do whole words otherwise go for individual characters.

If you want to do whole words it's much better to train on whole words, like a full line from the 20 pages.

In theory, for real text/words, doing words is better but here you have "random" codes and numbers. I think the easier thing here is to go for individual characters.


There are three completely different tasks that are getting mixed:
1. generating the training data (extracting it from the 20 pages)
2. extracting the individual characters (or words) from real forms
3. doing the actual recognition.

Number 1: roughly follow the blur/threshold + findComponent idea to generate tiff and gt.txt files. If you use a pre-trained MNIST model (like in the link I provided) you do not need this step at all, you already have the trained model (and the training data too).

Number 2: it depends a lot on the quality of the forms. There are a lot of different ways to do this part. My first attempt would be to realign the form with a reference template using OpenCV SIFT. This is usually very precise. Then you can just crop the individual boxes because now you know the exact pixel position of the form elements. No need for blur, mser or other things.
Depending on the alignment precision, image quality/size, etc. you may still have some boxes borders in your crops. You can just move to step 3 and see what happens: maybe it just works and you are done. Otherwise you need to find a way to delete the boxes borders. I would simply try to take smaller crops, no big deal if you cut off a few pixels from letters sometimes. If that fails, custom code, morphology opening, findComponents wiping the ones too thin or too small, hough lines, etc. depending on how big the borders are, how many, how often, on how many sides, etc.

Maybe you can do this part (detection/extraction) with tesseract too, try the hocr output. I've never used it, I do not know if it works with tables. Maybe it won't work on the whole page but it will work on two small crops of the upper blocks you are interested in (this means that you still need to align the page unless your scans are already aligned/oriented).

Use gimp to prepare the images for these tests and see what works betters, do not waste time doing this "programming first". Crop the region with gimp and run hocr from command line.

You can also do this part (detection/extraction) with hough lines, template matching, findComponents, etc. Use what works best, it also depends on speed requirements.

Number 3: this is easy once you have a small image with just a single character inside. You do not need to do a binary black/white image, gray is fine (at least it is what works best for me). You can use a MNIST trained model or tesseract. If you have enough time try both and see what works best. For tesseract try different image sizes.


Lorenzo


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Ramakant Kushwaha

unread,
Jul 19, 2018, 2:02:07 AM7/19/18
to tesseract-ocr
Thanks Lorenzo, 
I will try OPENCV + SIFT + MNIST, will update you soon. 

chandra churh chatterjee

unread,
Jul 19, 2018, 6:08:34 AM7/19/18
to tesser...@googlegroups.com
I have already used tesseract 4.0 version for training on hand written digits.
The steps are as follows:
1.The best way to do is use some handwriten fonts from Google or any where else.
2.use the "tesstrain.sh" script to generate the starter trained data using the text corpus containing only 0-9 digits in a random function , create such a text corpus and generate the starter trained .
3. Use the starter trained data to generate final traineed data after lstm training 


If you want a detailed description, I can supply you with a complete documentation of steps.

Chandra Churh Chatterjee


--

Ramakant Kushwaha

unread,
Jul 19, 2018, 6:32:07 AM7/19/18
to tesser...@googlegroups.com
Thanks @Chandra, I am beginner for this, Please help me with the complete documentation.


On Thu, Jul 19, 2018 at 3:38 PM, chandra churh chatterjee <chandrachurh...@gmail.com> wrote:
I have already used tesseract 4.0 version for training on hand written digits.
The steps are as follows:
1.The best way to do is use some handwriten fonts from Google or any where else.
2.use the "tesstrain.sh" script to generate the starter trained data using the text corpus containing only 0-9 digits in a random function , create such a text corpus and generate the starter trained .
3. Use the starter trained data to generate final traineed data after lstm training 


If you want a detailed description, I can supply you with a complete documentation of steps.

Chandra Churh Chatterjee
On Tue, Jul 17, 2018, 8:43 PM Ramakant Kushwaha <ramakant...@gmail.com> wrote:
Hi,

Recently I trying to retrain Tesseract 4.0 for recognising handwritten digits. I am following official page but finding it very difficult. It would be great if someone can elaborate below steps

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

chandra churh chatterjee

unread,
Jul 19, 2018, 6:36:45 AM7/19/18
to tesser...@googlegroups.com
Environment : Ubuntu 16.04 LTS
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Check : Running tesseract -v in terminal gives:
________________________________________________

tesseract 4.0.0-beta.1-376-gb1f79
 leptonica-1.74.1
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8

 Found AVX2
 Found AVX
 Found SSE
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DOWNLOAD HANDWRITTEN FONTS FROM fonts.google.com AND TRAIN USING THE GENERAL PROCEDURE.

  THE TEXT CORPUS WAS CREATED BY TWEAKING THE CODE OF create_corpus.py AND STORING THE RESULT IN corpus.txt

  WHICH WAS THEN RENAMED TO [lang].training_text AND REPLACED IN langdata/[lang] DIRECTORY.


[Step 1] Download the required fonts and install them on the system. For Linux Machine, copy the fonts to ~/.fonts directory and run <sudo fc-cache -rv> from there.

[Step 2] Get the fonts you want to train tesseract on by running the following command : 

text2image --find_fonts --fonts_dir /usr/share/fonts --text ./langdata/[lang]/[lang].training_text --min_coverage .9  --outputbase ./langdata/[lang]/[lang] |& grep raw  | sed -e 's/ :.*/@ \\/g'  | sed -e "s/^/  '/"  | sed -e "s/@/'/g" >path/to/langdata/[lang]/fontslist.txt

[Step 3] Go to langdata/[lang]/fontslist.txt, open it and copy the contents. Paste the same in "language-specific.sh" under Latin fonts.
Generate the format of the new fonts according to the convention mentioned in 


and enlist them. Add the same to langdata/font_properties.


 [Step 4] Generate starter traineddata by running the following command.

  training/tesstrain.sh --lang eng --linedata_only   --noextract_font_properties --langdata_dir ~/langdata --output_dir ~/tesstutorial/newoutput

  [Make sure to mention the full path of tesstrain.sh]


[Step 5]  Run lstmtraining on the starter traineddata with the following command :

training/lstmtraining --debug_interval 0   --traineddata ~/tesstutorial/newoutput/eng/eng.traineddata   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]'   --model_output ~/tesstutorial/newoutput/output/base --learning_rate 20e-4   --train_listfile ~/tesstutorial/newoutput/eng.training_files.txt  --max_iterations 10000 &>~/tesstutorial/newoutput/output/basetrain.log

Follow the tesseract 4 official wiki to get details about all parameters that can be specified. This step will take a long time to complete.
--debug_interval should be kept either 0 or -1 if ScrollView.jar was not made. Also make sure the output and input directories are writable and readable, respectively.

[Step 6] Create the final traineddata that is used by the software by running the following command:

training/lstmtraining --stop_training --continue_from ~/tesstutorial/newoutput/output/base_checkpoint --traineddata ~/tesstutorial/newoutput/eng/eng.traineddata --model_output ~/tesstutorial/newoutput/output/eng.traineddata

[Again, make sure the complete path to lstmtraining is given to ensure the proper version is used.]

[Step 7] Rename the eng.traineddata file to digits.traineddata and copy the same to tessdata directory from where tesseract reads the languages.
To integrate with the Reader (in Windows) , copy it to tessdata directory.
 
Run from ~/tesstutorial/digoutput directory :

sudo cp digits.traineddata /usr/share/tesseract-ocr/tessdata/digits.traineddata


ACCURACY ACHIEVED : ~ 90%-95%
HIGHEST ACCURACY : 100%

On Thu, Jul 19, 2018 at 4:02 PM Ramakant Kushwaha <ramakant...@gmail.com> wrote:
Thanks @Chandra, I am beginner for this, Please help me with the complete documentation.

On Thu, Jul 19, 2018 at 3:38 PM, chandra churh chatterjee <chandrachurh...@gmail.com> wrote:
I have already used tesseract 4.0 version for training on hand written digits.
The steps are as follows:
1.The best way to do is use some handwriten fonts from Google or any where else.
2.use the "tesstrain.sh" script to generate the starter trained data using the text corpus containing only 0-9 digits in a random function , create such a text corpus and generate the starter trained .
3. Use the starter trained data to generate final traineed data after lstm training 


If you want a detailed description, I can supply you with a complete documentation of steps.

Chandra Churh Chatterjee
On Tue, Jul 17, 2018, 8:43 PM Ramakant Kushwaha <ramakant...@gmail.com> wrote:
Hi,

Recently I trying to retrain Tesseract 4.0 for recognising handwritten digits. I am following official page but finding it very difficult. It would be great if someone can elaborate below steps

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

hrishikesh kaulwar

unread,
Jun 12, 2019, 8:09:30 AM6/12/19
to tesseract-ocr
can we use jTessBoxEditor for working with tesseract 4.0? 
 I have read we can not use it for tesseract 4.0. 
Reply all
Reply to author
Forward
0 new messages