Recognizing blurred dots as CJK characters

72 views
Skip to first unread message

Jeetendra Ahuja

unread,
Nov 25, 2019, 9:43:43 AM11/25/19
to tesseract-ocr
So before processing a document, we want to rejects ones which are CJK so I've used Tesseract for this.. It does pretty good job but some times when document quality is low then from "Table of Contents" page, most of the dots are recognized as "CJK" characters. I am planning to create own training data but wanted to get advice from experts.

Config:
  • Tesseract 4.0
  • instance.setLanguage("chi_simB+chi_traB+korB+jpnB+engB");
  • instance.setOcrEngineMode(1);

Image is zoomed to 600% in Adobe PDF reader.

Please let me know.

image001.png

Shree Devi Kumar

unread,
Nov 25, 2019, 9:48:08 AM11/25/19
to tesseract-ocr
have you tried `osd` - orientation and script detection?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/95138faa-307f-4417-b72c-648ab84993d9%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Jeetendra Ahuja

unread,
Nov 25, 2019, 11:15:20 AM11/25/19
to tesseract-ocr
Nopes, I will do it. Thanks.


On Monday, November 25, 2019 at 9:48:08 AM UTC-5, shree wrote:
have you tried `osd` - orientation and script detection?

On Mon, Nov 25, 2019 at 8:13 PM Jeetendra Ahuja <jeetendr...@gmail.com> wrote:
So before processing a document, we want to rejects ones which are CJK so I've used Tesseract for this.. It does pretty good job but some times when document quality is low then from "Table of Contents" page, most of the dots are recognized as "CJK" characters. I am planning to create own training data but wanted to get advice from experts.

Config:
  • Tesseract 4.0
  • instance.setLanguage("chi_simB+chi_traB+korB+jpnB+engB");
  • instance.setOcrEngineMode(1);

Image is zoomed to 600% in Adobe PDF reader.

Please let me know.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
Nov 25, 2019, 11:36:48 AM11/25/19
to tesseract-ocr
Also try with 300 dpi

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2656ebd0-6116-4f5b-9a8e-975730ba44c1%40googlegroups.com.

Jeetendra Ahuja

unread,
Nov 25, 2019, 11:42:29 AM11/25/19
to tesseract-ocr
I tried with 400 DPI and had set page segmentation mode to 1 - AUTO_OSD
No improvement, problem is PDF itself is of low quality.


On Monday, November 25, 2019 at 11:36:48 AM UTC-5, shree wrote:
Also try with 300 dpi

On Mon, Nov 25, 2019 at 9:45 PM Jeetendra Ahuja <jeetendr...@gmail.com> wrote:
Nopes, I will do it. Thanks.

On Monday, November 25, 2019 at 9:48:08 AM UTC-5, shree wrote:
have you tried `osd` - orientation and script detection?

On Mon, Nov 25, 2019 at 8:13 PM Jeetendra Ahuja <jeetendr...@gmail.com> wrote:
So before processing a document, we want to rejects ones which are CJK so I've used Tesseract for this.. It does pretty good job but some times when document quality is low then from "Table of Contents" page, most of the dots are recognized as "CJK" characters. I am planning to create own training data but wanted to get advice from experts.

Config:
  • Tesseract 4.0
  • instance.setLanguage("chi_simB+chi_traB+korB+jpnB+engB");
  • instance.setOcrEngineMode(1);

Image is zoomed to 600% in Adobe PDF reader.

Please let me know.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/95138faa-307f-4417-b72c-648ab84993d9%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

shree

unread,
Nov 26, 2019, 4:06:23 AM11/26/19
to tesseract-ocr
 tesseract image001.png - --psm 0

Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 625
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 5.30
Script: Latin
Script confidence: 3.64
Reply all
Reply to author
Forward
0 new messages