Symbols and words segmentation with their location and line numbers from text image having C++ code

94 views
Skip to first unread message

Pramit Mazumdar

unread,
Jul 6, 2015, 6:52:14 AM7/6/15
to tesser...@googlegroups.com

I have a C++ program stored in a text image. What I need is to extract the symbols, alphanumerics and tokens from the text image, along with their dimensions. Here dimension means the start row, end row, start column, end column pixel number, from the text image. Here I am citing an example: If there exists a text image with C++ code (image is in *.png format),

#include<iostreme>
using namespace std; 

I have to write a matlab code which will read the above image, and generate the following dataset:

+-------------+-----------+-----------+---------+--------------+------------+
| Line Number |   Item    | Start_Row | End_Row | Start_Column | End_Column |
+-------------+-----------+-----------+---------+--------------+------------+
|           1 | #         | ---       | ---     | ---          | ---        |
|           1 | include   | ---       | ---     | ---          | ---        |
|           1 | <         | ---       | ---     | ---          | ---        |
|           1 | stdio.h   | ---       | ---     | ---          | ---        |
|           1 | >         | ---       | ---     | ---          | ---        |
|           2 | using     | ---       | ---     | ---          | ---        |
|           2 | namespace | ---       | ---     | ---          | ---        |
|           2 | std       | ---       | ---     | ---          | ---        |
|           2 | ;         | ---       | ---     | ---          | ---        |
+-------------+-----------+-----------+---------+--------------+------------+

I feel the entire objective can be segregated in to three parts: firstly, the word segmentation from text image. Secondly, identification of the coordinates. Thirdly, the counting of line numbers.

For the first two objective I have used the Tesseract-OCR which identifies words as well their respective co-ordinates. Below is the way I am extracting the words and respective coordinates. [I have manually converted the image from ONG to TIF format, as described in tesseract manual].

<Path to Tesseract-OCR folder>\tesseract.exe "image.tif" output \*extracts words*\
<Path to Tesseract-OCR folder>\tesseract.exe "image.tif" output makebox \*extracts word dimensions*\

As an output I am getting the words extracted into a text file named output.txt. But, the makeboxcommand is finding the coordinates of each single character in the image. Whereas I need to find coordinates of each single word (in this case symbols and tokens separately).

So, my question is how could I generate such a text file which would show coordinates of each symbol, alphanumeric and tokens separately, instead of each characters.

Is there any option in tesseract which can extract each word and its coordinates directly from the image file, instead of each character. I doubt whether I would need a lexical analyzer for performing this. If yes, then how could I be using it along with tesseract?

This is how I have approached the problem. If there exists any other simple way out to accomplish the goal, then please share it to me. Thank You.

Reply all
Reply to author
Forward
0 new messages