No output when Chinese Traditional followed by dots or ellipsis

176 views
Skip to first unread message

Danny

unread,
Jul 30, 2024, 8:23:38 AM7/30/24
to tesseract-ocr
I have a problem where tesseract produces no output (zero byte output file) when presented with Chinese characters followed by either an ellipsis or three periods.

bad_sub_243.png

If I crop the image in photoshop to remove the dots, the three Chinese characters are recognized perfectly. Feeding the image above, or feeding just the three dots, produces no output.

I've just recompiled with the latest GIT version (see below).  I've also re-trained the chi_tra model several times and added many words with the three dots to the wordlist. The result is the same with both.

Any suggestions?

Command
tesseract bad_sub_243.png  output -l tqChiTra --loglevel TRACE   -c edges_debug=1   -c ambigs_debug_level=10   -c classify_debug_level=10   -c dawg_debug_level=3   -c wordrec_debug_blamer=1   -c tessedit_dump_choices=1   -c tessedit_debug_block_rejection=1   -c textord_noise_debug=1   -c applybox_debug=10

Messages
Warning: Parameter not found: language_model_ngram_on
Warning: Parameter not found: segsearch_max_char_wh_ratio
Warning: Parameter not found: language_model_ngram_space_delimited_language
Warning: Parameter not found: language_model_use_sigmoidal_certainty
Warning: Parameter not found: language_model_ngram_nonmatch_score
Warning: Parameter not found: classify_integer_matcher_multiplier
Warning: Parameter not found: assume_fixed_pitch_char_segment
Warning: Parameter not found: allow_blob_division
Warning: Parameter not found: segsearch_max_char_wh_ratio
Warning: Parameter not found: language_model_ngram_space_delimited_language
Warning: Parameter not found: language_model_use_sigmoidal_certainty
Warning: Parameter not found: language_model_ngram_nonmatch_score
Warning: Parameter not found: classify_integer_matcher_multiplier
Warning: Parameter not found: assume_fixed_pitch_char_segment
Warning: Parameter not found: allow_blob_division
Estimating resolution as 675
Row ending at (221,23.6372): R=9999, dc=3, nc=0, REJECTED
cleanup_blocks: # rows = 0 / 1
cleanup_blocks: # blocks = 0 / 1
Estimating resolution as 675
Row ending at (221,23.6372): R=9999, dc=3, nc=0, REJECTED
cleanup_blocks: # rows = 0 / 1
cleanup_blocks: # blocks = 0 / 1

Version
# tesseract --version
tesseract 5.4.1-11-g46b9
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 1.0.0
 Found AVX
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6 liblz4/1.8.1
 Found libcurl/7.61.1 OpenSSL/1.1.1c zlib/1.2.11 brotli/1.0.6 libidn2/2.2.0 libpsl/0.20.2 (+libidn2/2.0.5) libssh/0.9.0/openssl/zlib nghttp2/1.33.0

Danny

unread,
Aug 2, 2024, 5:13:23 AM8/2/24
to tesseract-ocr
Can any one suggest some debug settings I can activate to try to trace down why I'm getting no output?
Thanks
Danny

Ger Hobbelt

unread,
Aug 5, 2024, 3:44:40 AM8/5/24
to tesseract-ocr
Have you tried running this through a multi-model=multi-language tesseract, e.g. -lang chi+eng ?

The idea behind this question is: using dots as periods (and sets of dots serving as ellipsis) is something that's particular to euro languages, mostly, while Chinese writing uses other means to signal end of sentence (sometimes you see a circle serving as period, f.e.)

While (1) we don't know the training details for the models Ray Smith produced at Google and subsequently published, my bet is the period 'dot' and ... ellipsis symbols did not feature heavily in the Chinese training set (possibly not at all, though that can be checked by inspecting the charset that's defined as part of the training data file), and (2) yes, I see the tesseract internal preprocessing (binarization, noise reduction, ...) stages have a hard time dealing with noisy images which are human-eye perceptionally 'clean' (Jpeg input images and such, e.g. camera- and video-screengrabs, which (nearly) always have traveled through some (hidden) mpeg/jpeg/similar lossy compression stage), my own tests indicate that the preprocessor may have detected the ellipsis and included it as part of the line image, but it MAY be that the subsequent OCR recog stage dropped these due to ratings that turned out too low.
Meanwhile, English, Latin, etc have a much better chance at observing periods and ranking them as highly probable 'period' characters as these symbols must have featured more heavily in their training set by necessity, so it may be useful to run English or Latin or a similar euro language model as a secondary language in order to give tesseract some higher rankings for those dot pixel lines to work with...





(More on tesseract and image noise + text bounding boxes in the next couple of weeks but I'm trying to organize that research as it kinda exploded in my face: instead of one issue, it is several and none of them easy to fix or circumnavigate)


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/11209fd7-65f6-49d1-8153-ae217db71e85n%40googlegroups.com.

Danny

unread,
Aug 5, 2024, 6:03:46 AM8/5/24
to tesseract-ocr
Hi Ger,
Thanks for the reply.

I haven't tried eng+chi yet, good suggestion.  

We did, in fact, do our own training for Traditional Chinese based on the existing traineddata.  We added about 1000 lines to the 48000 line chi_tra.wordlist and generated ground truth with two specific fonts. 

The unicharset file does not have the ellipsis but it does have '.' We'll add the ellipsis in the next training run.

But other errors suggest it is not a training problem but a problem with some earlier part of the image processing or feature identification.
For example the image below converts to "EL".  (yes, latin "E" and "L")
sub_33.png
Because the training/OCR machine is "headless", I rewrote some of the ScrollView code to let the viewer run on a different machine. I don't know what the windows mean yet, but the output of the "Convolve" window is a big hint:

Screenshot 2024-08-05 at 17.53.03.png
This suggests the (what? block extractor?) is not finding the bounds of the character properly.  This matches somewhat the output using an online OCR tool https://www.i2ocr.com/free-online-chinese-traditional-ocr which outputs "PAYy"
That online tool draws a bounding box in green and it includes only the bottom part of the character too! (perhaps it is using Tesseract under the hood)

Screenshot 2024-08-05 at 17.56.12.png

Our input images are consistent and "well formed" : DVB-Subtitles extracted various global TV operators.  So all of the images would've been rendered to image directly from an original text file earlier in the broadcast chain.

Danny

unread,
Aug 5, 2024, 6:09:44 AM8/5/24
to tesseract-ocr
I've tried adding "eng" for both language.  Same result, zero byte output file.  Turning on a ton or random options, I got the output below. I don't yet know enough on how tesseract works to understand what the message implies...

Textord::clean_noise_from_row - Row ending at (221,23.6372): R=9999, dc=3, nc=0, REJECTED

cleanup_blocks: # rows = 0 / 1
cleanup_blocks: # blocks = 0 / 1
Estimating resolution as 675
Textord::clean_noise_from_row - Row ending at (221,23.6372): R=9999, dc=3, nc=0, REJECTED

cleanup_blocks: # rows = 0 / 1
cleanup_blocks: # blocks = 0 / 1

Ger Hobbelt

unread,
Aug 5, 2024, 7:43:59 AM8/5/24
to tesseract-ocr
Heh, re the faulty word/character boxes and redirected scrollview output images (the blue/yellow one): looks like we're duplicating work, as I have a patched tesseract repo that replaces scrollview with image+html log file output (work in progress).
Didn't see your work on GitHub: do you keep it off grid/off public?


Danny

unread,
Aug 5, 2024, 9:05:39 AM8/5/24
to tesseract-ocr
I haven't been working on this for very long.  My objective is quite narrow, to convert subtitles to text, and thought the stock tesseract should be a quick solution.  I was wrong.

I did the scrollView redirection only today because I was at a dead end and needed to visualize what was happening after suspecting the box calculation/assignment wasn't working.  Creating HTML would be way better!

The software is, umm, quite a challenging beast and needs a massive cleanup. 
To support some odd fonts with no linux equivalents (had to buy from a font foundry in Taiwan), I replaced text2image.
So in all I've made a few big changes:
- replaced text2image with new program on MacOs which can use the special licensed fonts and creates the image, box file and .gt.txt files
- replaced the Makefile (!) used for training with a readable, maintainable program that runs the training
- added new text and retrained chi_tra.traineddata
- created a "reviewer" which publishes data to a website that shows the original image, the output text and lets the user make corrections (to generate additional training data)

PastedGraphic-1.png

I'm focusing on Chinese right now but ultimately will test a lot of other languages.  Given the experience so far, I foresee problems with some  extended latin languages. In particular, I worry about Hungarian and Vietnamese; lots of accents. 

I haven't published any of this work so far.  I'm not adverse to doing so, but I'm a bit retentive on code style (I just can't look at the 'blob of barf' style in much of the code); I figured people would be upset/disturbed with changes to the style...

Tom Morris

unread,
Aug 5, 2024, 2:09:32 PM8/5/24
to tesseract-ocr
On Tuesday, July 30, 2024 at 8:23:38 AM UTC-4 Danny wrote:
I have a problem where tesseract produces no output (zero byte output file) when presented with Chinese characters followed by either an ellipsis or three periods.

bad_sub_243.png

If I crop the image in photoshop to remove the dots, the three Chinese characters are recognized perfectly. Feeding the image above, or feeding just the three dots, produces no output.

I've just recompiled with the latest GIT version (see below).  I've also re-trained the chi_tra model several times and added many words with the three dots to the wordlist. The result is the same with both.

Any suggestions?

Command
tesseract bad_sub_243.png  output -l tqChiTra --loglevel TRACE   -c edges_debug=1   -c ambigs_debug_level=10   -c classify_debug_level=10   -c dawg_debug_level=3   -c wordrec_debug_blamer=1   -c tessedit_dump_choices=1   -c tessedit_debug_block_rejection=1   -c textord_noise_debug=1   -c applybox_debug=10

What page segmentation mode are you using? If you're using the default of full automatic page segmentation (designed for pages of uniform text), it's unlikely to work very well for closed captioning texts (a detail not mentioned here, but included later in the thread).

My test with the standard traditional Chinese model from tessdata gave this result:

tesseract image.png - -l chi_tra --psm 13
我 是 說 …

I don't read Chinese, so there may be some subtle differences in the characters, but they look pretty close to my eye.

Tom
 

Danny

unread,
Aug 5, 2024, 8:15:27 PM8/5/24
to tesseract-ocr
Hi Tom,

Thanks for the suggestion!

We've been using PSM 6 (Assume a single uniform block of text) and, for that input image, it outputs nothing for both the stock chi_tra.traineddata and our in-house trained data file.

However... I just tried PSM 13 ("Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific) and do get some output!

WIth chi_tra: same as you got: 我 是 說 …
With in house model: 我是說...」

The characters themselves are correct.  The stock chi_tra model puts an extra space after each character.  I recall reading a bug report about that somewhere.

The spacing with our model is better but it adds an extraneous closing square-quote.  Another difference (not so significant) is that the stock model outputs an ellipsis character while the in-house model outputs three periods.

However, once in a while the subtitle image has two lines of text, which is why we chose PSM 6.  

multiline_sub_16.png

I tried the image above with PSM 13 and the and unfortunately it failed with both the stock chi_tra and our in-house model: m 論
Using PSM 6 works (but again chi_tra adds the extra spaces. Our in-house is better)

So, I'm thinking the issue is with the preprocessing, segmentation, and glyph identification more than the model itself.  

Tom Morris

unread,
Aug 7, 2024, 1:27:59 PM8/7/24
to tesseract-ocr
On Monday, August 5, 2024 at 8:15:27 PM UTC-4 Danny wrote:

So, I'm thinking the issue is with the preprocessing, segmentation, and glyph identification more than the model itself.  

I agree with that and  I suspect you can do a better job of line segmentation than Tesseract can since you have more information available to you about font size, context, etc.

Tom

Danny

unread,
Aug 10, 2024, 2:44:26 AM8/10/24
to tesseract-ocr
Yeah, that could be true.  But still trying to figure out where in the code to put any new segmentation and glyph identification.

BTW, to generate additional training data, I wrote a program on the Mac to scrap text from the subtitle images.  The resulting OCR output from Apple's Vision framework is leagues ahead of tesseract.  Too bad it is not an open source solution and/or runs on Linux.

Reply all
Reply to author
Forward
0 new messages