Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Using Tesseract 5.5.0 to recognize source code, but need a way to maintain original indentation.

48 views
Skip to first unread message

Jay S

unread,
Apr 22, 2025, 11:34:41 PMApr 22
to tesseract-ocr
I'm using PowerShell 7 right now to automate Tesseract.

This is my input image:

Code.png

I'm able to accurately get the code recognized using this:

& tesseract.exe "D:\Dev\OCR\Images\Code.png" stdout --psm 3 --oem 1 -l eng

The above command outputs:

Code_2ABLEus0Tk.png
Which is great. But I have no idea how to actually reconstruct the original indentation of the code from the input image.

Questions:
1. Is there a known and simple process that I can follow to reconstruct the indentation?
2.
-c preserve_interword_spaces=1 doesn't seem to do anything.
3. Would tsv or hocr output be applicable here? And if so, which format would be the best for this task?
4. It seems that hocr is generally HTML with bounding box information... is there a way to convert this to the original indentation from the image somehow?

Has anyone here arrived at a workable solution for extracting code from an image and keeping its alignment?

There is an AI powered app called Pieces that seems to do this perfectly (
https://pieces.app/)
I dug into their source code and found references to tesseract, so I think they are using it under the hood for OCR. But I have no clue how they are reconstructing the indentation.

Any help or direction would be greatly appreciated.

TheComplete BookOfMormon

unread,
Apr 23, 2025, 5:50:31 AMApr 23
to tesser...@googlegroups.com
I expect you will need to use the bounding boxes

tesseract.exe "input.jpg" stdout -l eng --psm 6 -c tessedit_char_whitelist=" abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789[]{}()=+-*/\|&$%^#@!~`';:,<>.?_" tsv

Will output
image.png

The level is page/line/word etc. Level 5 gives us the actual text of the word. So, we select all level 5 data and then group it by line number.
Finally, we deduce the indent size as the first "left" that is > 0, then we can add indents.

ps1 attached as txt file

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/29ecbb32-5704-4396-8de4-47bf59c158bbn%40googlegroups.com.
tess.txt
Reply all
Reply to author
Forward
0 new messages