
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/76ed2f78-e10f-4b9f-8d61-30f4b0f333dbn%40googlegroups.com.
At iteration 2689/30000/30013, mean rms=0.244%, delta=0.426%, BCER train=1.425%, BWER train=3.900%, skip ratio=0.000%, New worst BCER = 1.425 wrote checkpoint.Finished! Selected model with minimal training error rate (BCER) = 0.846
On 15 Oct 2023, at 22:20, Zdenko Podobny <zde...@gmail.com> wrote:
Seam like you should put this question to the author of language data "ARYuanB5-MD"...Zdenko
ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):
Running tesseract on a single Chinese character "對" outputs the character, but also the text "xlz".Command line:
tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c preserve_interword_spaces=1The output is two lines:xlz對It used to output "sMz" but after retraining several times with the specific font in use, it now outputs "xlz".Why?I've attached the image file in question...
<sub0089w.png>
(Searching the source code, the file universalambigs.h has a line " xlZ le 1" which is similar, but not exact to the errant text I'm finding)Thank you.--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/76ed2f78-e10f-4b9f-8d61-30f4b0f333dbn%40googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/V7Rqwv2tnOk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y1_y%3Diw8uCEw5Z3km%3DApZ5%2BFFudjqMKV_HO9QJ41FNyw%40mail.gmail.com.
This raises a new issue: the input data (TV subtitles) are a mixture of 1 or 2 line text blocks. And a 1-line text block might be a single character in this case.
So the ideal page segmentation mode should be 6, no? But looking at the debug output, it thinks there are two characters in the input image...
For your reference, closed captions used in US, Canada, and Korea are text based. DVB Subtitles, used in the rest of the world, are bit map pictures.