Automate Tesseract 3.01 language data generation process

542 views
Skip to first unread message

Quan Nguyen

unread,
Mar 27, 2011, 1:21:11 PM3/27/11
to tesseract-ocr
I created a PowerShell script to automate language data generation for
Tesseract 3.01. Save it as train.ps1 and put it in tesseract-3.0
directory.

Any feedback and improvement is welcome.

<#

Automate Tesseract 3.01 language data pack generation process.

@author: Quan Nguyen
@date: 27 Mar 2011

The script file should be placed in the same directory as Tesseract's
binary executables.

Run PowerShell as Administrator and allow script execution by running
the following command:

PS > Set-ExecutionPolicy RemoteSigned

Then execute the script by:

PS > .\train.ps1
or
PS > .\train.ps1 yourlang imageFolder

If imageFolder is not specified, it is default to a yourlang
subdirectory under Tesseract directory.

Windows PowerShell 2.0 Download: http://support.microsoft.com/kb/968929

#>

$lang = $args[0]
if (!$lang) {
$lang = Read-Host "Enter a language code"
}

$langDir = $lang

if ($args[1]) {
$langDir = $args[1]
}

if (!(test-path $langDir))
{
throw "{0} is not a valid path" -f $langDir
}

echo "=== Generating Tesseract language data for language: $lang ==="

$fullPath = [IO.Path]::GetFullPath($langDir)
echo "** Your training images should be in ""$fullPath"" directory."

$al = New-Object System.Collections.ArrayList

echo "Make Box Files"
$boxFiles
Foreach ($entry in dir $langDir) {
If ($entry.name.toLower().endsWith(".tif") -and
$entry.name.startsWith($lang)) {
echo "** Processing image: $entry"
$nameWoExt = [IO.Path]::Combine($entry.DirectoryName,
$entry.BaseName)
$al.Add($nameWoExt)

#Bootstrapping a new character set
$trainCmd = ".\tesseract {0}.tif {0} -l {1} batch.nochop
makebox" -f $nameWoExt, $lang
#Should comment out the next line after done with editing the box
files to prevent them from getting overwritten in repeated runs.
Invoke-Expression $trainCmd
$boxFiles += $nameWoExt + ".box "
}
}
echo "** Box files should be edited before continuing. **"

echo "Generate .tr Files"
$trFiles
Foreach ($entry in $al) {
$trainCmd = ".\tesseract {0}.tif {0} nobatch box.train" -f
$entry
Invoke-Expression $trainCmd
$trFiles += $entry + ".tr "
}

echo "Compute the Character Set"
Invoke-Expression ".\unicharset_extractor -D $langDir $boxFiles"

move-item -force -path $langDir\unicharset -destination $langDir\
$lang.unicharset

echo "Clustering"
Invoke-Expression ".\mftraining -U unicharset -O $trFiles"
Invoke-Expression ".\cntraining $trFiles"

echo "Dictionary Data"
Invoke-Expression ".\wordlist2dawg $langdir\
$lang.frequent_words_list.txt $langdir\$lang.freq-dawg $langdir\
$lang.unicharset"
Invoke-Expression ".\wordlist2dawg $langdir\$lang.words_list.txt
$langdir\$lang.word-dawg $langdir\$lang.unicharset"

echo "The last file (unicharambigs) -- this is to be manually edited"
if (!(test-path $langdir\$lang.unicharambigs)) {
new-item "$langdir\$lang.unicharambigs" -type file
set-content -path $langdir\$lang.unicharambigs -value "v1"
}

echo "Putting it all together"
Invoke-Expression ".\combine_tessdata $langdir\$lang."

Sriranga(78yrsold)

unread,
Mar 27, 2011, 11:10:27 PM3/27/11
to tesser...@googlegroups.com
. kindly whether I have to downaload from Script centre downloads i.e. Download Windows PowerShell 2.0. Whether this will work for tesseract-ocr r-578 version  -due to  issue No: 465?
With regards,
-sriranga(78)


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.


Sriranga(78yrsold)

unread,
Mar 27, 2011, 11:37:30 PM3/27/11
to tesser...@googlegroups.com
Sorry, I tried to download from Download Windows ph2.0 but instead of download it will go back to http://support.microsoft.com/kb/968929  again tried to download did not open download page. I am in confusion from where I have to download. Mine is WinXP(sp3) - which is automatically updated by default.
With regards,
-sriranga(78)

Quan Nguyen

unread,
Mar 29, 2011, 12:52:17 AM3/29/11
to tesseract-ocr
For WinXP, you'd want to click on "Download the Windows Management
Framework Core for Windows XP and Windows Embedded package now." link.

On Mar 27, 10:37 pm, "Sriranga(78yrsold)" <withblessi...@gmail.com>
wrote:
> Sorry, I tried to download from Download Windows ph2.0 but instead of
> download it will go back tohttp://support.microsoft.com/kb/968929 again
> tried to download did not open download page. I am in confusion from where I
> have to download. Mine is WinXP(sp3) - which is automatically updated by
> default.
> With regards,
> -sriranga(78)
>
> On Mon, Mar 28, 2011 at 8:40 AM, Sriranga(78yrsold) <withblessi...@gmail.com
>
>
>
>
>
>
>
> > wrote:
> > . kindly whether I have to downaload from Script centre downloads i.e. *Download
> > Windows PowerShell 2.0* <http://support.microsoft.com/kb/968929>. Whether

Quan Nguyen

unread,
Mar 29, 2011, 1:02:14 AM3/29/11
to tesseract-ocr
I made some corrections and included the latest updates to the process
as outlined in the TrainingTesseract3 wiki. This will correctly build
[lang].traineddata file. Be sure to provide the necessary training
files in the same folder: [lang].[fontname].exp[num].tif,
[lang].font_properties, [lang].words_list.txt,
[lang].frequent_words_list.txt, etc.

<#

Automate Tesseract 3.01 language data pack generation process.

@author: Quan Nguyen
@date: 28 Mar 2011
$boxFiles = ""
Foreach ($entry in dir $langDir) {
If ($entry.name.toLower().endsWith(".tif") -and
$entry.name.startsWith($lang)) {
echo "** Processing image: $entry"
$nameWoExt = [IO.Path]::Combine($langDir, $entry.BaseName)
$al.Add($nameWoExt)

#Bootstrapping a new character set
$trainCmd = ".\tesseract {0}.tif {0} -l {1} batch.nochop
makebox" -f $nameWoExt, $lang
#Should comment out the next line after done with editing the box
files to prevent them from getting overwritten in repeated runs.
Invoke-Expression $trainCmd
$boxFiles += $nameWoExt + ".box "
}
}
echo "** Box files should be edited before continuing. **"

echo "Generate .tr Files"
$trFiles = ""
Foreach ($entry in $al) {
$trainCmd = ".\tesseract {0}.tif {0} nobatch box.train" -f
$entry
Invoke-Expression $trainCmd
$trFiles += $entry + ".tr "
}

echo "Compute the Character Set"
Invoke-Expression ".\unicharset_extractor -D $langDir $boxFiles"
move-item -force -path $langDir\unicharset -destination $langDir\
$lang.unicharset

echo "Clustering"
Invoke-Expression ".\mftraining -F $langDir\$lang.font_properties -U
$langDir\$lang.unicharset $trFiles"
move-item -force -path inttemp -destination $langDir\$lang.inttemp
move-item -force -path pffmtable -destination $langDir\$lang.pffmtable
move-item -force -path Microfeat -destination $langDir\$lang.Microfeat

Invoke-Expression ".\cntraining $trFiles"
move-item -force -path normproto -destination $langDir\$lang.normproto

Mow

unread,
May 23, 2011, 7:20:07 AM5/23/11
to tesseract-ocr
Hi!

I'm getting errors with your script. I'm on win7 x32.
When it has to execute your script line:

N. 60: Invoke-Expression ".\mftraining -F $langDir\
$lang.font_properties -U $langDir\$lang.unicharset $trFiles"

It crashes mftraining. I'm using 2 .tiff files, with 3 boxes in each.

Here's the log:

//////////////////////////////////
LOG //////////////////////////////////
=== Generating Tesseract language data for language: mow ===
** Your training images should be in "C:\Program Files\Tesseract-OCR
\images" directory.
Make Box Files
** Processing image: mow.font1.exp1.tif
0
.\tesseract images\mow.font1.exp1.tif images\mow.font1.exp1 -l mow
batch.nochop makebox
tesseract.exe : Tesseract Open Source OCR Engine with Leptonica
At line:1 char:12
+ .\tesseract <<<< images\mow.font1.exp1.tif images\mow.font1.exp1 -l
mow batch.nochop makebox
+ CategoryInfo : NotSpecified: (Tesseract Open ... with
Leptonica:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError

Number of found pages: 1.
Using substitute bounding box at (0,7)->(28,16)

** Processing image: mow.font1.exp2.tif
1
.\tesseract images\mow.font1.exp2.tif images\mow.font1.exp2 -l mow
batch.nochop makebox
tesseract.exe : Tesseract Open Source OCR Engine with Leptonica
At line:1 char:12
+ .\tesseract <<<< images\mow.font1.exp2.tif images\mow.font1.exp2 -l
mow batch.nochop makebox
+ CategoryInfo : NotSpecified: (Tesseract Open ... with
Leptonica:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError

Number of found pages: 1.
Using substitute bounding box at (0,7)->(28,16)

** Processing image: mow.font1.exp3.tif
2
.\tesseract images\mow.font1.exp3.tif images\mow.font1.exp3 -l mow
batch.nochop makebox
tesseract.exe : Tesseract Open Source OCR Engine with Leptonica
At line:1 char:12
+ .\tesseract <<<< images\mow.font1.exp3.tif images\mow.font1.exp3 -l
mow batch.nochop makebox
+ CategoryInfo : NotSpecified: (Tesseract Open ... with
Leptonica:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError

Number of found pages: 1.
Using substitute bounding box at (0,7)->(28,16)

** Box files should be edited before continuing. **
Generate .tr Files
tesseract.exe : Number of found pages: 1.
At line:1 char:12
+ .\tesseract <<<< images\mow.font1.exp1.tif images\mow.font1.exp1
nobatch box.train
+ CategoryInfo : NotSpecified: (Number of found pages:
1.:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError

0
tesseract.exe : Number of found pages: 1.
At line:1 char:12
+ .\tesseract <<<< images\mow.font1.exp2.tif images\mow.font1.exp2
nobatch box.train
+ CategoryInfo : NotSpecified: (Number of found pages:
1.:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError

1
tesseract.exe : Number of found pages: 1.
At line:1 char:12
+ .\tesseract <<<< images\mow.font1.exp3.tif images\mow.font1.exp3
nobatch box.train
+ CategoryInfo : NotSpecified: (Number of found pages:
1.:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError

2
Compute the Character Set
Extracting unicharset from images\mow.font1.exp1.box
Extracting unicharset from images\mow.font1.exp2.box
Extracting unicharset from images\mow.font1.exp3.box
Wrote unicharset file images/unicharset.
Clustering
Move-Item : Cannot find path 'C:\Program Files\Tesseract-OCR
\pffmtable' because it does not exist.
At C:\Program Files\Tesseract-OCR\auto1.ps1:64 char:10
+ move-item <<<< -force -path pffmtable -destination $langDir\
$lang.pffmtable
+ CategoryInfo : ObjectNotFound: (C:\Program Files
\Tesseract-OCR\pffmtable:String) [Move-Item], ItemNotFoundException
+ FullyQualifiedErrorId :
PathNotFound,Microsoft.PowerShell.Commands.MoveItemCommand

Reading images\mow.font1.exp1.tr ...
Reading images\mow.font1.exp2.tr ...
Reading images\mow.font1.exp3.tr ...
Clustering ...

Writing normproto ...
Dictionary Data
Reading word list from 'images\mow.frequent_words_list.txt'
Reducing Trie to SquishedDawg
Writing squished DAWG to 'images\mow.freq-dawg'
Reading word list from 'images\mow.words_list.txt'
Reducing Trie to SquishedDawg
Writing squished DAWG to 'images\mow.word-dawg'
The last file (unicharambigs) -- this is to be manually edited
Putting it all together
Combining tessdata files
combine_tessdata.exe : TessdataManager combined tesseract data files.
At line:1 char:19
+ .\combine_tessdata <<<< images\mow.
+ CategoryInfo : NotSpecified: (TessdataManager...act
data files.:String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError

Offset for type 0 is -1
Offset for type 1 is 84
Offset for type 2 is 145
Offset for type 3 is 148
Offset for type 4 is 110740
Offset for type 5 is 110760
Offset for type 6 is -1
Offset for type 7 is 111062
Offset for type 8 is -1
Offset for type 9 is 111072

Hi!

I'm getting errors with your script. I'm on win7 x32.
When it has to execute your script line:

N. 60: Invoke-Expression ".\mftraining -F $langDir\
$lang.font_properties -U $langDir\$lang.unicharset $trFiles"

It crashes mftraining. I'm using 2 .tiff files, with 3 boxes in each.

Here's the log:

//////////////////////////////////
LOG //////////////////////////////////

Quan Nguyen

unread,
May 23, 2011, 9:00:04 PM5/23/11
to tesseract-ocr
Looks like it was looking for pffmtable file created by mftraining
command, which probably had failed. Did you have the required
mow.font_properties file in the directory?

Mow

unread,
May 25, 2011, 5:26:33 AM5/25/11
to tesseract-ocr
Hi!
Thx for answering. mow.font_properties is in it's directory, maybe it
isnt in good format or something
It's content is:

font1 1 0 0 1 0

The .tif files are named as following:
mow.font1.exp1.tif,mow.font1.exp2.tif,mow.font1.exp3.tif.

Can it be the problem? Thx!!

Quan Nguyen

unread,
May 25, 2011, 8:12:08 PM5/25/11
to tesseract-ocr
That's the problem -- you'd need an entry for every image file. The
following is excerpted from the TrainingTesseract3 wiki:

"When running mftraining, each .tr filename must match an entry in the
font_properties file, or mftraining will abort."

If they are the same font, you can put them in a single multi-page
TIFF image; otherwise, rename your files so each has a unique name --
but keep the same exp number -- and then add a corresponding entry in
the font_properties file.

Mow

unread,
May 26, 2011, 7:55:44 AM5/26/11
to tesseract-ocr
Thank you very mutch Quan!!

I've succesfully generated my traineddata file, even if there appears
some errors, they seem normal.

The error was the one you mentioned, now, I'm using multi page tiffs
to train tesseract.

Thank you again for your help and script!!
Message has been deleted

Quan Nguyen

unread,
Oct 6, 2012, 8:38:59 PM10/6/12
to tesser...@googlegroups.com
The script has been updated for Tesseract 3.02 training.

http://vietocr.svn.sourceforge.net/viewvc/vietocr/jTessBoxEditor/trunk/tools/

Sriranga(78yrs)

unread,
Oct 8, 2012, 10:20:35 PM10/8/12
to tesser...@googlegroups.com
Rao,
trust you are able to use automate  tdf generation by you. I have not tested so far from the day of first release during March 27,2011 - not interested.

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages