--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
combine_lang_model
based on wordlist(langdata). I don't need it at the time but I think it's good idea to clear that out if I'll need to do some training from scratch although I know it's rare case.--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/db6a0582-4372-489b-82ba-8cdd0301dbb8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/279bc21a-199a-43be-b5d6-07bfdd2a833f%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CALtwN-eGJG3MOTm7f-p%3DESRGgU7PtC41SVcBU8OSNMGThYjo5A%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAN557awfgH5F07nyV5iL1o5pN4MfebOvUWsJBLdSbG6QsdCmew%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wxAd4YCEUwnU-bPf9FQ%2BtutmKdwSQXro_eo6cjLkNRHA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAN557awW6ZeHtsXH0uO8AF8QvhEcHjU74w_ycrN-imoHZTvQew%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAN557awW6ZeHtsXH0uO8AF8QvhEcHjU74w_ycrN-imoHZTvQew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAB_aDAf-AQ7eknp86PBqAvZJMGFOZ5ZM3S_kN7O7FKm9JX219Q%40mail.gmail.com.
On Wed, Sep 5, 2018 at 1:55 PM, <kaminski...@gmail.com> wrote:
Hi,(I might butcher English grammar- you have been warned!)For some time I'm trying to teach tesseract to read MRZ codes.Unfortunately it's not going very well. I'm using the latest version of tesseract (4.0) soI'mm trying to train it by lstm method. I've managed to pull it off and got some custom traineddata samples but effects of using them are... let's say slightly unsatisfying. In the matter of fact they are not even remotely close to eng traineddata. I know that there was mrz traineddata in the previous version of tesseract.I'm out of ideas how to improve accuracy, so I'll need your help guys.At first I thought I could use images, .txt files containing already read data and font data to somehow make box files (basically you have image and .txt containing everything read from the image). I was disappointed when I realized that without manual correction of boxes tesseract won't know how to apply them correctly. Of course I need automated method do apply boxes (I can't use any GUI or something).At the moment I'm only using .txt files and these are steps I'm doing (it's also good to mention that I'm trying to make it from scratch):-Using .txt and font (OcrB) to create .tiff and box files using text2image method-Creating unicharset from all box files-(it's optional but for the sake of it) I'm applyingunicharsetproperties-Getting trainneddata from unicharset, langdata and using custom language as parameter-Creating lstmf file by tesseract some .tiff output lstm.train-Creating list of files to train-Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111] and learning rate 20e-4-At the end I'm using last checkpoint to create traineddata for usage. Currently initial .txt files are randomly generated by me in program in form of mrz code (samples included). I also tried to generate files in form of mixed alphabet to get signs variety. I was using about 1000 samples to train it and it doesn't differ from using 100 samples.Also, I disabled dictionary in the OCR process to prevent tesseract from treating whole MRZ code as a word.I might not understand some things despite reading a lot about this topic, but I'm pretty sure that I'm doing training process correctly. Do you have any tips how to improve training process? Consider pointing out even dumbest things I could forget about.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a8ddadfc-ac03-4169-8de3-68da65910ba6%40googlegroups.com.
see https://github.com/Shreeshrii/tessdata_ocrb
Retrained to add missing X
using 3 fonts at 3 exposures and a larger training text compared to previous version.
Both float/best and integer/fast versions are provided.
-l ocrb
-l ocrb-int
.If you can provide another 40-50 lines of training data (text file) I will rerun the training
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.