what's the content of fixed-length-dawgs

Xiaohui Zhang

unread,

Oct 16, 2013, 4:48:09 AM10/16/13

to tesser...@googlegroups.com

Dears,

Is there any tips about how to use the file of fixed-length-dawgs? I tried to use dawg2wordlist to extract some sample content from provided chi_sim trained data, but no success, the command will crash while "Reading squished dawg".

Any suggestion about how to use this file?

Thanks very much.

cskau

unread,

Jan 7, 2014, 8:39:19 AM1/7/14

to tesser...@googlegroups.com

I was pondering the same thing this evening. So since there seems to be precious little information out there, allow me to revive this 3 month old thread with a few of my findings.

I too got a crash when I tried extracting the fixed-length-dawgs, and dawg2wordlist doesn't seem to offer any special flags for handling this special composite dawg.

However, wordlist2dawg does have a special mode:

wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset

and says about the option:

-l <short> <long> Produce a file with several dawgs in it, one each for words of length <short>, <short+1>,… <long>

While one could surely just look at the source to figure out the details, I figured the "dawgs" file format is simply a bunch of "dawg"s cat'ed together.

To verify this theory I compared a regular dawg and the fixed-length-dawgs in a hex editor.

The regular dawg appears to use the magic number '2A001D0E', which was suspiciously found several times in the dawgs.

An educated guess tells me the dawgs format is simply:

[4 bytes : number of dawgs] + ([4 bytes : length of words in dawg] + [DAWG ...])*

This makes is very easy to manually extract the individual dawgs, and one could even naively split the file on the headers:

awk 'BEGIN {RS="\x2A\x00\x1D\x0E"; FILENUM=-1} {FILENUM++; if (FILENUM == 0)
  {next}; FILENAME=".fixed-length-dawg-"FILENUM; printf "%s",RS$0 > FILENAME;}' .fixed-length-dawgs

By using the above snippet I successfully managed to "extract" 6 dawgs of various length from the pre-built jpn.traineddata.

You can then run the standard dawg2wordlist and extract the wordlists from them.

On a separate note it is still not clear to me what the exact purpose of these sub dawgs is.

The jpn.traineddata appears to contain a .freq-dawg and the .fixed-length-dawgs but no .word-dawg.

Why it is helpful to split the dictionary into many smaller dictionaries based on word length, I cannot guess.

I hope this will be helpful to someone out there.

Nade Sritanyaratana

unread,

May 19, 2015, 2:25:38 AM5/19/15

to tesser...@googlegroups.com

cskau, thank you for posting this! I would have gotten stuck without it.

The awk command you provided seems to work great on jpn.traineddata. I was just trying the same awk command for chi_sim.traineddata, but unfortunately did not come across similar luck.

Following your suggestion, I used a hex editor to view the dawgs file and a dawg file, both from chi_sim.traineddata. I see that the "magic number" was for some reason slightly different. I noticed instead the magic hexadecimal number "2A00A313".

Fast forward a bit -- the following command worked for me:

awk 'BEGIN {RS="\x2A\x00\xA3\x13"; FILENUM=-1} {FILENUM++; if (FILENUM == 0) {next}; FILENAME="chi_sim.fixed-length-dawg-"FILENUM; printf "%s",RS$0 > FILENAME;}' chi_sim.fixed-length-dawgs

Detailing my steps for others:

Download chi_sim.traineddata from Tesseract's downloads page, untar, CD shell to the directory containing the traineddata file.
combine_tessdata -u chi_sim.traineddata chi_sim.
Execute the awk command shown above.
% dawg2wordlist chi_sim.unicharset chi_sim.fixed-length-dawg-1 fixed-length-1_wordlist
Repeat step 4 for chi_sim.fixed-length-dawg-2, chi_sim.fixed-length-dawg-3.

Cheers,

Nade

wfxi...@gmail.com

unread,

Jul 9, 2015, 5:32:14 AM7/9/15

to tesser...@googlegroups.com

Hi, Nade, thanks for your post.

I've tried your method on chi_sim but got 17 empty sub dawgs. however my fixed-length.dawg is around 600Kb... BTW, do you have any idea what this file is for? Any help to promote the accuracy for Chinese recognition?

-Han

在 2015年5月19日星期二 UTC+8下午2:25:38，Nade Sritanyaratana写道：

ShreeDevi Kumar

unread,

Jul 9, 2015, 10:58:25 AM7/9/15

to tesser...@googlegroups.com

Have you tried with the new traineddata files at

https://github.com/tesseract-ocr/tessdata

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f6946285-b07d-4c69-acf5-6aa9360e3f9b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,

Jul 9, 2015, 11:00:20 AM7/9/15

to tesser...@googlegroups.com

Also see the language training data available at

https://github.com/tesseract-ocr/langdata

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Nade Sritanyaratana

unread,

Jul 23, 2015, 11:59:28 AM7/23/15

to tesseract-ocr, wfxi...@gmail.com

Hello Han,

Sorry about the late response on my end. Did shree's comments help with your inquiries?

Regarding fixed-length.dawg -- this is just one of the dawg files that are typically used for wordlist2dawg:

https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional)

There is some information from the link above. My understanding is that this helps for any languages that have fixed-length characters, such as Chinese. I am not sure if this is the answer you were looking for though -- feel free to re-clarify in case others might have a better idea of how to answer.

- Nade

Tom Morris

unread,

Jul 24, 2015, 2:54:03 PM7/24/15

to tesseract-ocr, wfxi...@gmail.com, nad...@gmail.com

On Thursday, July 23, 2015 at 11:59:28 AM UTC-4, Nade Sritanyaratana wrote:

Regarding fixed-length.dawg -- this is just one of the dawg files that are typically used for wordlist2dawg:
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional)

That link is obsolete. The current link is:

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#dictionary-data-optional

Tom

Reply all

Reply to author

Forward