Tesseract on python code

69 views
Skip to first unread message

J S

unread,
Nov 22, 2021, 12:55:00 AM11/22/21
to tesseract-ocr
Hi all,
I am trying to OCR some code wrote in Python. I ve read the Tesseract doc many times and applied 3 pre processing script with Image Magick. The result image is attached.
I then send it to Tesseract with ```--psm 4``` which seems to be the more adapted segmentation mode for what I am trying to do. The result is quite ok but I don't have indentations and I think it could be still improved.

I would be glad to have some adivce to improve the result. Thanks a lot 

Best, 
IMG_20211108_141234.py
IMG_20211108_141234.jpg

Zdenko Podobny

unread,
Nov 22, 2021, 6:42:23 AM11/22/21
to tesser...@googlegroups.com
OCR of source code with tesseract is a problem: 
  • tesseract is not focused on keeping spaces/indentation - you have to reconstruct it by yourself (e.g. by parsing horcr output)
  • tesseract is focused more on "real" text, while source code is more symbolic with a lot of extra character, case sensitive etc. So  I am quite sure you will need to correct the tesseract output manually.

Zdenko


po 22. 11. 2021 o 6:54 J S <jszal...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c07b4f66-7e6e-4634-a4ee-b8a8db003f20n%40googlegroups.com.

J S

unread,
Nov 22, 2021, 4:06:15 PM11/22/21
to tesseract-ocr

Thanks a lot Zdenko, I am disappointed but th'as life :-( 
Reply all
Reply to author
Forward
0 new messages