--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7bffab95-3e6b-4165-929e-a152f1799703%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
For LSTM training, box files need to have an additional line for each text line with the tab character to indicate a new line.If you have existing box/tiff pairs, you can use a box editor (such as jtessboxeditor) and insert a box at end of each line and add a tab character in it.
>On the toolbar, the Character textbox has a built-in conversion function. If you enter U+0009 and hit Enter key or click on the adjacent Tool icon, the escape sequences will be converted to Unicode. You can also enter the tab character via Alt+09 numpad keys on Windows.or add a dummy sequence such as @@@ and then replace to tab character in a text editor.
See attached files as a sample.Then modify tesstrain.sh to copy the box tiff pairs to the training directory before starting trainingmkdir -p ${TRAINING_DIR}tlog "\n=== Starting training for language '${LANG_CODE}'"cp ./*.box "${TRAINING_DIR}/"cp ./*.tif "${TRAINING_DIR}/"
On Tue, Feb 7, 2017 at 8:27 PM, Kay-Michael Würzner <wuer...@gmail.com> wrote:
+1 for this question. The training documentation for Tesseract 4.0 by now only covers training with font files (synthetic materials). What is missing is information on training with real data (i.e. manually aligned ground truth).Any hints on that matter are greatly appreciated.Cheers,Kay
On Wednesday, January 18, 2017 at 12:31:54 AM UTC+1, chen...@huawei.com wrote:I have a bunch of images, containing English words.I would like to generate training data by these images, and do the training.How should I do?Thanks a lot.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ab8bc158-95b1-4c08-bc99-76a7442a919d%40googlegroups.com.
For LSTM training, box files need to have an additional line for each text line with the tab character to indicate a new line.If you have existing box/tiff pairs, you can use a box editor (such as jtessboxeditor) and insert a box at end of each line and add a tab character in it.>On the toolbar, the Character textbox has a built-in conversion function. If you enter U+0009 and hit Enter key or click on the adjacent Tool icon, the escape sequences will be converted to Unicode. You can also enter the tab character via Alt+09 numpad keys on Windows.or add a dummy sequence such as @@@ and then replace to tab character in a text editor.
See attached files as a sample.Then modify tesstrain.sh to copy the box tiff pairs to the training directory before starting trainingmkdir -p ${TRAINING_DIR}tlog "\n=== Starting training for language '${LANG_CODE}'"cp ./*.box "${TRAINING_DIR}/"cp ./*.tif "${TRAINING_DIR}/"
On Tue, Feb 7, 2017 at 8:27 PM, Kay-Michael Würzner <wuer...@gmail.com> wrote:
+1 for this question. The training documentation for Tesseract 4.0 by now only covers training with font files (synthetic materials). What is missing is information on training with real data (i.e. manually aligned ground truth).Any hints on that matter are greatly appreciated.Cheers,Kay
On Wednesday, January 18, 2017 at 12:31:54 AM UTC+1, chen...@huawei.com wrote:I have a bunch of images, containing English words.I would like to generate training data by these images, and do the training.How should I do?Thanks a lot.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For LSTM training, box files need to have an additional line for each text line with the tab character to indicate a new line.If you have existing box/tiff pairs, you can use a box editor (such as jtessboxeditor) and insert a box at end of each line and add a tab character in it.>On the toolbar, the Character textbox has a built-in conversion function. If you enter U+0009 and hit Enter key or click on the adjacent Tool icon, the escape sequences will be converted to Unicode. You can also enter the tab character via Alt+09 numpad keys on Windows.or add a dummy sequence such as @@@ and then replace to tab character in a text editor.See attached files as a sample.Then modify tesstrain.sh to copy the box tiff pairs to the training directory before starting trainingmkdir -p ${TRAINING_DIR}tlog "\n=== Starting training for language '${LANG_CODE}'"cp ./*.box "${TRAINING_DIR}/"cp ./*.tif "${TRAINING_DIR}/"On Tue, Feb 7, 2017 at 8:27 PM, Kay-Michael Würzner <wuer...@gmail.com> wrote:+1 for this question. The training documentation for Tesseract 4.0 by now only covers training with font files (synthetic materials). What is missing is information on training with real data (i.e. manually aligned ground truth).Any hints on that matter are greatly appreciated.Cheers,Kay
On Wednesday, January 18, 2017 at 12:31:54 AM UTC+1, chen...@huawei.com wrote:I have a bunch of images, containing English words.I would like to generate training data by these images, and do the training.How should I do?Thanks a lot.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e5ebc5f9-d59d-49c4-944f-9348999691a6n%40googlegroups.com.