Difficulties attempting OCR of a grid of single letters

59 views
Skip to first unread message

Sabbasofa

unread,
Aug 20, 2022, 7:07:27 AM8/20/22
to tesseract-ocr
Hey all,

I'm trying to extract the letters for a word search. 

Here is my input image: https://i.imgur.com/7zEEx1b.jpg

and this is my best try so far:

tesseract input.jpg out --psm 6 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ

this is the output it generated:

FATSITNEDNACIUREMATITLGTOAS
JTUNSMARROZXOIFRRIHWKTILZZC
SRNENDEPBAYWSEIJNSMDTITITISR
OEAAIRIARACHMVMIJIODTFUPSAIJE
UMLVFUAEOVMLUPYJKGAIJQHTIREW
TELRGIIWNPLPMEGGYVYAXIZSTTSAH
HDIGNNEBEIJDTIZEIFGRIMGAOODO
ARHEBTGWYIKNVURIMZOETIVMR
FQOAPODEDTIRTVEEMTILIJASOTFN
RRGOSWRDETINIWHTITETTIOGEREA
IEDNSNAULRREOATABEGQPNRTIN
CLNRAATTEATBRHSTNIGTIBTT
ALDOKTSNOCBSSOAERTFXATILIOOE
VIUVCUAPHPNINUNGPXXUOBL
YMMEAOBHAEDAEZATNEZETVEODO
OBARIJNTIUPREETILIKGAREEAZWSIBP
GAOTSOREDTTFPSBAKIKTSIBAENE
RSJEWEOLSAFQUUPZTAUIRIBZTILNII
QIYITYKEKHUMMPDETILOEUEIZBM
TJDYLSUHIPUCNUBDDVITLIKRST
JTKFEBIKNKMMOITFUMUEVDBUPGD
FADIUZALBUFIJLIFMMOLTWNZU
OOODMONMOPSREPMUIJDUOLTCTCP


its not bad, but it sure isn't good enough.

Any idea what i could try? Thank you.


Nikhil Fande

unread,
Aug 20, 2022, 7:36:01 AM8/20/22
to tesser...@googlegroups.com
Hi,

If some of the letters not detected then you can again run tesseract on image by whitening detected letters using their co-ordinates.

mostly for such sparse text images, PSM 11 workes good.

Regards,
Nikhil

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c69a3e99-d211-45f2-aecd-2259b413824en%40googlegroups.com.

Sabbasofa

unread,
Aug 20, 2022, 8:32:26 AM8/20/22
to tesseract-ocr
It is the oppsite actually, they are detected, but more then once. For example, the third row the original image shows "I I I", but tesseract identifies it as "TITITI".

i tired every psm and 6 gave the best results, psm 11 doesn't understand that there are 23 rows and is unable to recognize about 50%.

Example output by psm 11:

FA

TS

N

AC

REM

LGT

O

NSM

...and so on. 
Reply all
Reply to author
Forward
0 new messages