Hm, in Norwegian it isn't that rare. Or at least shouldn't be ;). Æ is the uppercase version of æ, and it would never occur in the middle of a word.
I find it strange that it has been left out alltogether. What must I do to get it in there?
Something like:
1. Create an image of the letter to add
2. Update wordlist
3. etc etc
4. build something
5. upload to github
Or am I simply totally off the track?
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/27ab99d7-eab0-4b9f-9086-2ccb16292ac9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Ray is planning to retrain the languages for the new 4.0.0 version sometime in January. So it would be helpful if you could open an issue on https://github.com/tesseract-ocr/langdata/issues with this information.Also, if you can provide a sample representative Norwegian text including Æ, I will try the finetune training procedure outlined by Ray in https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Jan 4, 2017 at 8:57 PM, Ludvig F Aarstad <lud...@aarstad.org> wrote:
If someone feels up to it, any chance of dumbing down the procedure for adding in a missing letter in the norwegian language? I am happy tondl the legwork, just need to understand the concept, and I don't quite understand it when reading the guides.
An easy list containing the steps would do just fine.
Something like:
1. Create an image of the letter to add
2. Update wordlist
3. etc etc
4. build something
5. upload to github
Or am I simply totally off the track?
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3d46bbdd-bfe4-46be-8bdb-aff48e3f00f1%40googlegroups.com.
Ray is planning to retrain the languages for the new 4.0.0 version sometime in January. So it would be helpful if you could open an issue on https://github.com/tesseract-ocr/langdata/issues with this information.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9788db26-bb8a-4861-b29e-80db2b5a687f%40googlegroups.com.
I have uploaded modified nor.traineddata atSee attached log and info file for commands used in training. It took about 9 hours on my pc - about 1700 iterations only and then my PC froze so I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e. 0.853 % character error rate at iteration number 1615.
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar <shree...@gmail.com> wrote:
@Peter, Have you tried the 4.0.0alpha version yet?@Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I will upload the new traineddata so that you can test. You will need 4.0.alpha version for testing.Here is couple of the training tifs and OCRed text.
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, Jan 6, 2017 at 5:01 PM, Peter <pe...@peterkrantz.se> wrote:
Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:Ray is planning to retrain the languages for the new 4.0.0 version sometime in January. So it would be helpful if you could open an issue on https://github.com/tesseract-ocr/langdata/issues with this information.Is it possible to contribute training data for this effort? I realise swedish will not be on top of the list but I think it would be easy to involve some of the research community here in contributing training data if it could improve the language model./Peter
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Sorry, I am not familiar with powershell and nuget.
If you are on Windows, you can try the experimental binaries for 4.0.0alpha for gimagereader, gui front-end to Tesseract-ocr. You can ocr a pdf directly or load multiple images at the same time.
- excuse the brevity, sent from mobile
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f2ddc038-3409-44e6-8b00-2354a95d3ba6%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b193b0be-f57d-44cf-b2e4-6efc5bb9a361%40googlegroups.com.
I have uploaded modified nor.traineddata atSee attached log and info file for commands used in training. It took about 9 hours on my pc - about 1700 iterations only and then my PC froze so I rebooted and created the traineddata for norlayer0.853_1615.lstm i.e. 0.853 % character error rate at iteration number 1615.ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, Jan 6, 2017 at 5:59 PM, ShreeDevi Kumar <shree...@gmail.com> wrote:
@Peter, Have you tried the 4.0.0alpha version yet?@Ludvig F. Aarstad - Add a layer training worked for adding 'Æ' - I will upload the new traineddata so that you can test. You will need 4.0.alpha version for testing.Here is couple of the training tifs and OCRed text.
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Fri, Jan 6, 2017 at 5:01 PM, Peter <pe...@peterkrantz.se> wrote:
Den torsdag 5 januari 2017 kl. 04:39:01 UTC+1 skrev shree:Ray is planning to retrain the languages for the new 4.0.0 version sometime in January. So it would be helpful if you could open an issue on https://github.com/tesseract-ocr/langdata/issues with this information.Is it possible to contribute training data for this effort? I realise swedish will not be on top of the list but I think it would be easy to involve some of the research community here in contributing training data if it could improve the language model./Peter
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.