OCR of C code

115 views
Skip to first unread message

Stuart

unread,
Sep 11, 2013, 8:24:41 PM9/11/13
to tesser...@googlegroups.com
Hi,

I'm trying to convert some old C code I only have printouts of back to source. I expected to have to do a little editing, but Tesseract is having serious problems.

I scanned the images in at 800 DPI, it looks clean and I tried some of the imagemagic scripts to cleanup, it looks a bit cleaner on the screen but did not help the OCR accuracy.

Searches on this topic yield loads of refernces on how ot link tesseract libraries into your own C but nothing about actually processing C code.

I have tried adding user words for things like fprintf etc... and common variable names in the code, but it does not help (although I'm not entirely convinced I did it right).

Does anyone have any advice ?

Should it work ok, maybe its the proportional spaced times roman font its in thats causing problems.

Thanks,

Stuart

Sven Pedersen

unread,
Sep 12, 2013, 7:08:25 AM9/12/13
to tesser...@googlegroups.com
Try downscaling to about 300-400 dpi. Check the documentation for ideal character height. I think such high resolution would be out of range.
Sven
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

Robert Komar

unread,
Sep 12, 2013, 2:23:26 PM9/12/13
to tesser...@googlegroups.com
I suspect the problem is more with the dictionary checking
phase than the character recognition. Since most of the
C code wouldn't show up as valid entries in the default
dictionary, it would end up being 'corrected' by
tesseract. I'm not sure if you can disable that phase,
but I think it would be worth looking into.

Since the font is proportionally spaced, perhaps you could
automatically subdivide each image into character cells
and try to OCR each character separately. I don't know
if it would work, but it might be worth a try.

Rob Komar

Stuart

unread,
Sep 12, 2013, 9:06:08 PM9/12/13
to tesser...@googlegroups.com
I experimented this evening and now am prety sure my user word list is being used and I reduced the scanning resolution and did tests at 300 and 400 dpi. There is a little improvement. Looking properly at the page I'm woring on, the double spaced text OCR's correctly, Tesseract works fine.however a lot of the letters in the proportional font are touching each other, I think thats whats causing the problems.

Automatically subdividing each image into character cells and OCR'ing each character separately sems like the only way out of this. I am experimenting with makebox to define the boxes first.

Any better ideas ?

Thanks,,

Stuart

Robert Komar

unread,
Sep 12, 2013, 10:18:44 PM9/12/13
to tesser...@googlegroups.com
On Thu, 12 Sep 2013, Stuart wrote:

> Automatically subdividing each image into character cells
> and OCR'ing each character separately sems like the only
> way out of this. I am experimenting with makebox to define
> the boxes first.

Argh! When I read "proportional font" I thought
"monospace font", assuming that that was what the code
had been printed in. That was why I suggested creating
the character cells, because it would be easy then.
I'm not sure it's worth trying to figure out where
the bounds of each character are, in your case.
Sorry, for reading the problem incorrectly.

Rob

Stuart

unread,
Sep 13, 2013, 8:46:37 PM9/13/13
to tesser...@googlegroups.com
HI,

I played with textcleaner :

http://www.fmwconcepts.com/imagemagick/textcleaner/

These options :

textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10 -t 80 page_0003.jpg page_0003_clean.jpg

The "-t 80" :

-t .... threshold ....... text smoothing threshold; 0<=threshold<=100;
......................... nominal value is about 50; default is no smoothing

thins the lines enough to make a difference in the run together characters for tesseract.

I played with several settings from 50 to 100 and 80 was the best for me.

Its still only about 75% way below what tesseract handles on normal text I have but its going to work out.

Thanks for the help,

Stuart

Robert Komar

unread,
Sep 13, 2013, 8:59:24 PM9/13/13
to tesser...@googlegroups.com
Hi Stuart,
if the characters that touch do so consistently, then
maybe you can train your own "language", including
in it the pairs of characters that usually connect.
I'm pretty sure that Google already does this for
cases like "fi" and "fl". You can then tell tesseract
to use both "english" and your new "language" when
doing OCR. I've never trained myself, and usually
consider it to be a waste of time for English, but
in this case, it may be worth trying if correcting
by hand is going to take a really long time.

Cheers,
Rob

Stuart Hodges

unread,
Sep 13, 2013, 9:10:45 PM9/13/13
to tesser...@googlegroups.com
Thanks for the idea.

I have scanned the first 40 pages and have a little script running cleaning and OCRing.

I will see how much of a pain it is tomorow... I cant face it this evening.

Thanks,

Stuart



Rob

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to

For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/_3fvIpG-TPI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages