In Spanish language, character ‘o’ is recognized incorrectly as some round symbol

67 views
Skip to first unread message

Subrato Namata

unread,
Sep 22, 2017, 3:24:08 AM9/22/17
to tesseract-ocr

Environment

Windows Setup: tesseract-ocr-setup-4.0.0-alpha.20170804.exe
Spanish Trained Data: https://github.com/tesseract-ocr/tessdata/raw/4.00/spa.traineddata
Command Used to OCR:
tesseract.exe ImageDoc.png output --oem 1 -l spa
Where ImageDoc.png is a Spanish Scanned Document
output is the text file output of OCRed text

  • Tesseract Version: 4.0
  • Platform: Windows version 64 Bit

Current Behavior:

In Spanish, character ‘o’ is recognized incorrectly as some round symbol. Attached input file is ImageDoc.png and Error screenshot

spanish
imagedoc





Expected Behavior:

Character ‘o’ should be recognized correctly.

Quan Nguyen

unread,
Sep 22, 2017, 2:32:51 PM9/22/17
to tesseract-ocr

Subrato Namata

unread,
Sep 23, 2017, 1:38:46 PM9/23/17
to tesseract-ocr
Thanks Quan Nguyen. My initial results show that the issue is gone. Let me try with few more samples.
Additionally, are these the best trained data of tesseract available for all the other languages and we must be using these only ?

Quan Nguyen

unread,
Sep 24, 2017, 10:56:29 AM9/24/17
to tesseract-ocr
It depends on your needs. There are also fast traineddata:


It looks that many languages are represented.

Quan Nguyen

unread,
Sep 26, 2017, 9:38:54 AM9/26/17
to tesseract-ocr
Reply all
Reply to author
Forward
0 new messages