Tesseract v3.03 and norwegian language

557 views
Skip to first unread message

Ludvig F Aarstad

unread,
Jan 2, 2017, 9:42:24 AM1/2/17
to tesseract-ocr
Greetings and salutations fellow OCR'ers ;).
I have been playing around with various modules in PowerShell for reading text from an image with PowerShell but I have landed on using tesseract directly. It all works fine, and it reads like a dream :). However, it seems it is having problems with at least one of the Norwegian characters. The scanned image has the letter Æ while tesseract reads it like AE.
I have tried looking into how to train it, but I haven't figured it out yet.

I am grateful for any assistance.

Ludvig F. Aarstad

Tom Morris

unread,
Jan 2, 2017, 6:10:35 PM1/2/17
to tesseract-ocr
First, the latest version is 3.04 (although there's also a tag for 3.05).
Second, there will soon (hopefully) be a release for 4.00 which will make 3.x obsolete.

Having said that, it looks like the root cause of your problem is that Tesseract doesn't know Æ is a possible letter for Norwegian. The training text and the character frequencies have lots of occurrences of the lowercase letter, but none of the uppercase. See these three files:


That word list has 360,000 different words without a single one of them containing the character. Just how rare is it? Is it something that would only ever occur in the middle of a word, so you'd need to have some all-caps text to be able to find it?

Note that if 4.0 was trained on the same material as 3.x, it may have the same problem.

Tom

Ludvig F Aarstad

unread,
Jan 3, 2017, 2:14:12 AM1/3/17
to tesseract-ocr
Hm, in Norwegian it isn't that rare. Or at least shouldn't be ;). Æ is the uppercase version of æ, and it would never occur in the middle of a word.
 
I find it strange that it has been left out alltogether. What must I do to get it in there?

Ludvig F Aarstad

unread,
Jan 4, 2017, 10:27:04 AM1/4/17
to tesseract-ocr
If someone feels up to it, any chance of dumbing down the procedure for adding in a missing letter in the norwegian language? I am happy tondl the legwork, just need to understand the concept, and I don't quite understand it when reading the guides.
An easy list containing the steps would do just fine.

Something like:
1. Create an image of the letter to add
2. Update wordlist
3. etc etc
4. build something
5. upload to github

Or am I simply totally off the track?

ShreeDevi Kumar

unread,
Jan 4, 2017, 10:39:01 PM1/4/17
to tesser...@googlegroups.com, Ray Smith
Ray is planning to retrain the languages for the new 4.0.0 version sometime in January. So it would be helpful if you could open an issue on https://github.com/tesseract-ocr/langdata/issues with this information.

Also, if you can provide a sample representative Norwegian text including Æ, I will try the finetune training procedure outlined by Ray in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/27ab99d7-eab0-4b9f-9086-2ccb16292ac9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ludvig F Aarstad

unread,
Jan 5, 2017, 4:24:43 AM1/5/17
to tesseract-ocr, thera...@gmail.com
I can come up with several samples, if that helps.
I also realized that the occurrence of Æ in the beginning of a sentence is quite rare. It will in most cases only be for names of people (surnames mostly) and names of places and streets in addition to some specific Norwegian words that can occur in the beginning of a sentence and thus require the capital Æ.

Some samples (English counterpart added for Reference):
Ærfuglveien 44 er adressen jeg bor på - Ærfuglveien 44 is the address where I live.
Min adresse er Ærfuglgaten 73. - My address is Ærfuglgaten 73.
Ærlighet varer lengst. - Honesty lasts the longest.
Ærfuglen er den største andearten i vårt land. - The eider is the largest duck species in our country.
Ærekrenkelse er en handling som består i å krenke en annens æresfølelse, eller opptre på en måte som er egnet til å skade en annens gode navn og rykte eller til å utsette ham for hat, ringeakt eller tap av den for hans stilling eller næring fornødne tillit. - Defamation is an action that is to violate another's sense of honor, or act in a manner which is likely to harm someone's good name and reputation or to expose him to hatred, contempt, or loss of it for his position or business confidence necessary.
Æsene lå i kamp med en annen gudeslekt, vanene. - Æsir was in fight with another race of gods, the vanes.
Ærgjerrighet har vært viktig for mange av oss. Da vi var småjenter, skjønte vi at det er viktig å arbeide hardt og bli til noe. - Ambition has been important to many of us. When we were little girls, we realized that it is important to work hard and become something.
Det var Æsene som var snille. - It was Æsir who was the nice ones.

Will this suffice?



torsdag 5. januar 2017 04.39.01 UTC+1 skrev shree følgende:
Ray is planning to retrain the languages for the new 4.0.0 version sometime in January. So it would be helpful if you could open an issue on https://github.com/tesseract-ocr/langdata/issues with this information.

Also, if you can provide a sample representative Norwegian text including Æ, I will try the finetune training procedure outlined by Ray in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jan 4, 2017 at 8:57 PM, Ludvig F Aarstad <lud...@aarstad.org> wrote:
If someone feels up to it, any chance of dumbing down the procedure for adding in a missing letter in the norwegian language? I am happy tondl the legwork, just need to understand the concept, and I don't quite understand it when reading the guides.
An easy list containing the steps would do just fine.

Something like:
1. Create an image of the letter to add
2. Update wordlist
3. etc etc
4. build something
5. upload to github

Or am I simply totally off the track?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Jan 5, 2017, 7:05:51 AM1/5/17
to tesser...@googlegroups.com
I will give it a try and let you know.


Ludvig F Aarstad

unread,
Jan 5, 2017, 10:29:49 AM1/5/17
to tesseract-ocr
Fantastic, thanks:).

ShreeDevi Kumar

unread,
Jan 6, 2017, 1:22:02 AM1/6/17
to tesser...@googlegroups.com, Ray Smith
Tried 'Finetune' - that does not help in addition of a character.

Trying 'Add a layer' now.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 5, 2017 at 8:59 PM, Ludvig F Aarstad <lud...@aarstad.org> wrote:
Fantastic, thanks:).

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Peter

unread,
Jan 6, 2017, 6:31:23 AM1/6/17
to tesseract-ocr


Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:
Ray is planning to retrain the languages for the new 4.0.0 version sometime in January. So it would be helpful if you could open an issue on https://github.com/tesseract-ocr/langdata/issues with this information.

Is it possible to contribute training data for this effort? I realise swedish will not be on top of the list but I think it would be easy to involve some of the research community here in contributing training data if it could improve the language model.

/Peter 

ShreeDevi Kumar

unread,
Jan 6, 2017, 7:30:26 AM1/6/17
to tesser...@googlegroups.com
@Peter, Have you tried the 4.0.0alpha version yet?

@Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I will upload the new traineddata so that you can test. You will need 4.0.alpha version for testing.

Here is couple of the training tifs and OCRed text.  

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
nor.Arial_Bold.exp0.txt
nor.Arial_Bold_Italic.exp0.txt
nor.Arial_Bold.exp0.tif
nor.Arial_Bold_Italic.exp0.tif

ShreeDevi Kumar

unread,
Jan 6, 2017, 7:50:38 AM1/6/17
to tesser...@googlegroups.com
I have uploaded modified nor.traineddata at


See attached log and info file for commands used in training. It took about 9 hours on my pc - about 1700 iterations only and then my PC froze so I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e. 0.853 % character error rate at iteration number 1615.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

nor-log-info.txt

Ludvig F Aarstad

unread,
Jan 9, 2017, 2:19:54 AM1/9/17
to tesseract-ocr
Thanks Shree :D. Really appreciate it. Will this work with v3.03 too? I am basing my code on this: https://github.com/jourdant/powershell-paperless and there is a script to initialize the environment that is getting the tesseract files from here: https://nuget.org/api/v2/package/tesseract-ocr. Would you be able to point me in the right direction on how to move this from 3.03 to the 4.0alpha?



fredag 6. januar 2017 13.50.38 UTC+1 skrev shree følgende:
I have uploaded modified nor.traineddata at


See attached log and info file for commands used in training. It took about 9 hours on my pc - about 1700 iterations only and then my PC froze so I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e. 0.853 % character error rate at iteration number 1615.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar <shree...@gmail.com> wrote:
@Peter, Have you tried the 4.0.0alpha version yet?

@Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I will upload the new traineddata so that you can test. You will need 4.0.alpha version for testing.

Here is couple of the training tifs and OCRed text.  

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jan 6, 2017 at 5:01 PM, Peter <pe...@peterkrantz.se> wrote:


Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:
Ray is planning to retrain the languages for the new 4.0.0 version sometime in January. So it would be helpful if you could open an issue on https://github.com/tesseract-ocr/langdata/issues with this information.

Is it possible to contribute training data for this effort? I realise swedish will not be on top of the list but I think it would be easy to involve some of the research community here in contributing training data if it could improve the language model.

/Peter 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Jan 9, 2017, 2:34:18 AM1/9/17
to tesser...@googlegroups.com

Sorry, I am not familiar with powershell and nuget.

If you are on Windows, you can try the experimental binaries for 4.0.0alpha for gimagereader, gui front-end to Tesseract-ocr. You can ocr a pdf directly or load multiple images at the same time.

- excuse the brevity, sent from mobile


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Ludvig F Aarstad

unread,
Jan 9, 2017, 2:48:23 AM1/9/17
to tesseract-ocr
No worries, I will play around and see what I can get working. For now I am using a simple replace in my script to handle the Æ.
How would I go about if I were to compile tesseract 4.0 alpha using git and cmake? The wiki says the 4.0 alpha Source code is available in the master branch of the repository, but I have yet to find it...The compiling part seems straght-forward enough, but I need the source ;).

Tried installing the gimagereader hoping that it would give me the dll for tesseract 4.0, but no.

ShreeDevi Kumar

unread,
Jan 9, 2017, 3:14:46 AM1/9/17
to tesser...@googlegroups.com

Actually postprocessing with replace for AE will be the best bet as 4.0 is slower than the tesseract engine for latin-based scripts.

You can experiment with 4.0.0alpha.

you will also need to compile the latest version of leptonica before that.

Sources are at:

There is no separate src directory for tesseract. 

I used git clone to get the master branch and then use pull origin to update it. You can also download zip with current master.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Ludvig F Aarstad

unread,
Jan 9, 2017, 6:05:11 PM1/9/17
to tesseract-ocr
I think I might stick with the postprocessing for now, too much oddities I need tonlearn to be able to compile it ;). Still, I think this project is awesome and I might take it up a notch and try the same I am doing now just using .net code :)

Des Bw

unread,
Sep 16, 2023, 12:15:45 AM9/16/23
to tesseract-ocr
I have exactly the same problem for Amharic. I find three characters missing; and they are screwing the Ocr result. 
Dear Shree, can you help me please?

On Friday, January 6, 2017 at 3:50:38 PM UTC+3 shree wrote:
I have uploaded modified nor.traineddata at


See attached log and info file for commands used in training. It took about 9 hours on my pc - about 1700 iterations only and then my PC froze so I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e. 0.853 % character error rate at iteration number 1615.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar <shree...@gmail.com> wrote:
@Peter, Have you tried the 4.0.0alpha version yet?

@Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I will upload the new traineddata so that you can test. You will need 4.0.alpha version for testing.

Here is couple of the training tifs and OCRed text.  

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jan 6, 2017 at 5:01 PM, Peter <pe...@peterkrantz.se> wrote:


Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:
Ray is planning to retrain the languages for the new 4.0.0 version sometime in January. So it would be helpful if you could open an issue on https://github.com/tesseract-ocr/langdata/issues with this information.

Is it possible to contribute training data for this effort? I realise swedish will not be on top of the list but I think it would be easy to involve some of the research community here in contributing training data if it could improve the language model.

/Peter 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages