Groups

In Spanish language, character ‘o’ is recognized incorrectly as some round symbol

67 views

Skip to first unread message

Subrato Namata

unread,

Sep 22, 2017, 3:24:08 AM9/22/17

to tesseract-ocr

Environment

Windows Setup: tesseract-ocr-setup-4.0.0-alpha.20170804.exe
Spanish Trained Data: https://github.com/tesseract-ocr/tessdata/raw/4.00/spa.traineddata
Command Used to OCR:
tesseract.exe ImageDoc.png output --oem 1 -l spa
Where ImageDoc.png is a Spanish Scanned Document
output is the text file output of OCRed text

Tesseract Version: 4.0
Platform: Windows version 64 Bit

Current Behavior:

In Spanish, character ‘o’ is recognized incorrectly as some round symbol. Attached input file is ImageDoc.png and Error screenshot

Expected Behavior:

Character ‘o’ should be recognized correctly.

Quan Nguyen

unread,

Sep 22, 2017, 2:32:51 PM9/22/17

to tesseract-ocr

Try best traineddata:

https://github.com/tesseract-ocr/tessdata_best

Subrato Namata

unread,

Sep 23, 2017, 1:38:46 PM9/23/17

to tesseract-ocr

Thanks Quan Nguyen. My initial results show that the issue is gone. Let me try with few more samples.

Additionally, are these the best trained data of tesseract available for all the other languages and we must be using these only ?

Quan Nguyen

unread,

Sep 24, 2017, 10:56:29 AM9/24/17

to tesseract-ocr

It depends on your needs. There are also fast traineddata:

https://github.com/tesseract-ocr/tessdata_fast

It looks that many languages are represented.

Quan Nguyen

unread,

Sep 26, 2017, 9:38:54 AM9/26/17

to tesseract-ocr

The Wiki page offers more info:

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu