OCR Output contains "xlz"

Danny Wilson

unread,

Oct 15, 2023, 9:44:08 AM10/15/23

to tesseract-ocr

Running tesseract on a single Chinese character "對" outputs the character, but also the text "xlz".

Command line:

tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c preserve_interword_spaces=1

The output is two lines:

xlz

對

It used to output "sMz" but after retraining several times with the specific font in use, it now outputs "xlz".

Why?

I've attached the image file in question...

(Searching the source code, the file universalambigs.h has a line " xlZ le 1" which is similar, but not exact to the errant text I'm finding)

Thank you.

Zdenko Podobny

unread,

Oct 15, 2023, 10:20:47 AM10/15/23

to tesser...@googlegroups.com

Seam like you should put this question to the author of language data "ARYuanB5-MD"...

Zdenko

ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/76ed2f78-e10f-4b9f-8d61-30f4b0f333dbn%40googlegroups.com.

Danny Wilson

unread,

Oct 15, 2023, 9:08:32 PM10/15/23

to tesser...@googlegroups.com

I guess I am the author... ARYuanB5-MD is the font.

For further background, the stock tessdata_best/chi_tra.traineddata did not do a good job at all on the text I'm trying to recognize.

So I retrained:

- copying the existing Chinese wordlist and added additional characters and sentences (total 47,000 lines)

- rendered ground truth images (with the special font) and box files

- used lang data from "chi_tra" (config, unicharset, Han.xx, Latin.xx, radical-stroke etc)

- ran lstmtraining with 30,000 iterations

lstmtraining completed with BCER of 0.846:

At iteration 2689/30000/30013, mean rms=0.244%, delta=0.426%, BCER train=1.425%, BWER train=3.900%, skip ratio=0.000%, New worst BCER = 1.425 wrote checkpoint.
Finished! Selected model with minimal training error rate (BCER) = 0.846

Then copy the output ARYuanB5-MD.traineddata to tessdata directory.

With that traineddata, OCR is very good on the input text... except for the "對" character, which outputs the extra "xlz".

Neither the ground-truth nor the wordlist has "xlz" anywhere in it.

Any suggestions on how to track this down?

Thanks

On 15 Oct 2023, at 22:20, Zdenko Podobny <zde...@gmail.com> wrote:

Seam like you should put this question to the author of language data "ARYuanB5-MD"...

Zdenko

ne 15. 10. 2023 o 15:44 'Danny Wilson' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

Running tesseract on a single Chinese character "對" outputs the character, but also the text "xlz".

Command line:

tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c preserve_interword_spaces=1

The output is two lines:
xlz
對

It used to output "sMz" but after retraining several times with the specific font in use, it now outputs "xlz".

Why?

I've attached the image file in question...

<sub0089w.png>

(Searching the source code, the file universalambigs.h has a line " xlZ le 1" which is similar, but not exact to the errant text I'm finding)

Thank you.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/76ed2f78-e10f-4b9f-8d61-30f4b0f333dbn%40googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/V7Rqwv2tnOk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8y1_y%3Diw8uCEw5Z3km%3DApZ5%2BFFudjqMKV_HO9QJ41FNyw%40mail.gmail.com.

Danny Wilson

unread,

Oct 16, 2023, 3:34:39 AM10/16/23

to tesser...@googlegroups.com

After running tesseract with various debug switches activated, I've found that it thinks there are two characters in the image and trying OCR on each of them.

Changing the page segmentation mode changes the output:
PSM 6 (single uniform block of text) : outputs garbage plus correct character
PSM 7 (single text line) : works correctly.
PSM 8 (single word) : works correctly

The debug output is below.

This raises a new issue: the input data (TV subtitles) are a mixture of 1 or 2 line text blocks. And a 1-line text block might be a single character in this case.

So the ideal page segmentation mode should be 6, no? But looking at the debug output, it thinks there are two characters in the input image...

That doesn't sound like a training issue but rather some problem with how it identifies glyphs in the input image...

OCR Error.png

Danny Wilson

unread,

Oct 16, 2023, 3:39:37 AM10/16/23

to tesser...@googlegroups.com

The command line did not get included in my last mail. Sending again now.

$ tesseract sub0089w.png debugOut -l ARYuanB5-MD --dpi 72 --psm 6 -c classify_debug_level=1

Processing word with lang ARYuanB5-MD at:Bounding box=(3,45)->(33,56)
Trying word using lang ARYuanB5-MD, oem 1
Best choice: accepted=0, adaptable=0, done=1 : Lang result : Ll : R=10.3645, C=-11.8365, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
str L l
1 new words better than 0 old words: r: 10.3645 v 0 c: -11.8365 v 0 valid dict: 0 v 0

Processing word with lang ARYuanB5-MD at:Bounding box=(3,3)->(56,58)
Trying word using lang ARYuanB5-MD, oem 1
Best choice: accepted=1, adaptable=0, done=1 : Lang result : 對 : R=3.09071, C=-1.8713, F=1, Perm=2, xht=[0,3.40282e+38], ambig=0
str 對
state: 1
1 new words better than 0 old words: r: 3.09071 v 0 c: -1.8713 v 0 valid dict: 0 v 0

$ cat debugOut.txt
Ll
對

Tom Morris

unread,

Oct 16, 2023, 4:29:59 PM10/16/23

to tesseract-ocr

On Monday, October 16, 2023 at 3:34:39 AM UTC-4 Danny wrote:

This raises a new issue: the input data (TV subtitles) are a mixture of 1 or 2 line text blocks. And a 1-line text block might be a single character in this case.

So the ideal page segmentation mode should be 6, no? But looking at the debug output, it thinks there are two characters in the input image...

It's not terribly surprising that "page" segmentation gets confused by a single character, although I'm a little surprised that it came up with overlapping bounding boxes.

Since the TV image capture is presumably fixed resolution and it sounds like you've only got a single font to deal with, it seems like you can tell based on the image bounds whether you've got a single line (PSM 7) or more than one line (PSM 6).

It's been a long time since I looked at it, but closed captioning is usually encoded in the signal digitally in a side band channel, which would be a much simpler way to extract it.

Tom

Danny Wilson

unread,

Oct 16, 2023, 8:46:50 PM10/16/23

to tesser...@googlegroups.com

Hi Tom,

I was hoping not to introduce heuristics before scanning the images but sounds like the page segmentation in tesseract is not smart enough.

So from what you say, if the input image is:

a) "square-ish" : PSM 10 Single Character

b) approx. single-multiple of character height in given font: PSM 6 Single Line

c) approx. Nx character height: PSM 6 Uniform Block

For your reference, closed captions used in US, Canada, and Korea are text based. DVB Subtitles, used in the rest of the world, are bit map pictures.

Danny

Tom Morris

unread,

Oct 26, 2023, 10:39:11 AM10/26/23

to tesseract-ocr

On Monday, October 16, 2023 at 8:46:50 PM UTC-4 Danny wrote:

For your reference, closed captions used in US, Canada, and Korea are text based. DVB Subtitles, used in the rest of the world, are bit map pictures.

Good to know. I guess that's what happens when the standards bodies optimize for BOM cost (character generator ROMs) vs system complexity. :(

Tom

Reply all

Reply to author

Forward