Recognizing blurred dots as CJK characters

Jeetendra Ahuja

unread,

Nov 25, 2019, 9:43:43 AM11/25/19

to tesseract-ocr

So before processing a document, we want to rejects ones which are CJK so I've used Tesseract for this.. It does pretty good job but some times when document quality is low then from "Table of Contents" page, most of the dots are recognized as "CJK" characters. I am planning to create own training data but wanted to get advice from experts.

Config:

Tesseract 4.0
instance.setLanguage("chi_simB+chi_traB+korB+jpnB+engB");
instance.setOcrEngineMode(1);

Image is zoomed to 600% in Adobe PDF reader.

Please let me know.

image001.png

Shree Devi Kumar

unread,

Nov 25, 2019, 9:48:08 AM11/25/19

to tesseract-ocr

have you tried `osd` - orientation and script detection?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/95138faa-307f-4417-b72c-648ab84993d9%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Jeetendra Ahuja

unread,

Nov 25, 2019, 11:15:20 AM11/25/19

to tesseract-ocr

Nopes, I will do it. Thanks.

On Monday, November 25, 2019 at 9:48:08 AM UTC-5, shree wrote:

have you tried `osd` - orientation and script detection?

On Mon, Nov 25, 2019 at 8:13 PM Jeetendra Ahuja <jeetendr...@gmail.com> wrote:

So before processing a document, we want to rejects ones which are CJK so I've used Tesseract for this.. It does pretty good job but some times when document quality is low then from "Table of Contents" page, most of the dots are recognized as "CJK" characters. I am planning to create own training data but wanted to get advice from experts.

Config:
Tesseract 4.0
instance.setLanguage("chi_simB+chi_traB+korB+jpnB+engB");
instance.setOcrEngineMode(1);

Image is zoomed to 600% in Adobe PDF reader.

Please let me know.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/95138faa-307f-4417-b72c-648ab84993d9%40googlegroups.com.

Shree Devi Kumar

unread,

Nov 25, 2019, 11:36:48 AM11/25/19

to tesseract-ocr

Also try with 300 dpi

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2656ebd0-6116-4f5b-9a8e-975730ba44c1%40googlegroups.com.

Jeetendra Ahuja

unread,

Nov 25, 2019, 11:42:29 AM11/25/19

to tesseract-ocr

I tried with 400 DPI and had set page segmentation mode to 1 - AUTO_OSD

No improvement, problem is PDF itself is of low quality.

On Monday, November 25, 2019 at 11:36:48 AM UTC-5, shree wrote:

Also try with 300 dpi

On Mon, Nov 25, 2019 at 9:45 PM Jeetendra Ahuja <jeetendr...@gmail.com> wrote:

Nopes, I will do it. Thanks.

On Monday, November 25, 2019 at 9:48:08 AM UTC-5, shree wrote:
have you tried `osd` - orientation and script detection?

On Mon, Nov 25, 2019 at 8:13 PM Jeetendra Ahuja <jeetendr...@gmail.com> wrote:
So before processing a document, we want to rejects ones which are CJK so I've used Tesseract for this.. It does pretty good job but some times when document quality is low then from "Table of Contents" page, most of the dots are recognized as "CJK" characters. I am planning to create own training data but wanted to get advice from experts.

Config:
Tesseract 4.0
instance.setLanguage("chi_simB+chi_traB+korB+jpnB+engB");
instance.setOcrEngineMode(1);

Image is zoomed to 600% in Adobe PDF reader.

Please let me know.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/95138faa-307f-4417-b72c-648ab84993d9%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2656ebd0-6116-4f5b-9a8e-975730ba44c1%40googlegroups.com.

shree

unread,

Nov 26, 2019, 4:06:23 AM11/26/19

to tesseract-ocr

tesseract image001.png - --psm 0

Warning: Invalid resolution 0 dpi. Using 70 instead.

Estimating resolution as 625

Warning. Invalid resolution 0 dpi. Using 70 instead.

Page number: 0

Orientation in degrees: 0

Rotate: 0

Orientation confidence: 5.30

Script: Latin

Script confidence: 3.64

Reply all

Reply to author

Forward