Bitmap subtitles are not detected properly

61 views
Skip to first unread message

Anshul Maheshwari

unread,
Aug 12, 2015, 10:53:47 AM8/12/15
to tesseract-ocr
Hi

I am trying to detect character from attached image, but I am unable to detect it using tesseract.

Tesseract detects fine when there is lot of data, but when there is very less data means only 2 or 3 words it fails.

It detects fine in file_5.png but fails for file_0.png and file_6.png

Thanks
Anshul Maheshwari
file_0.png
file_6.png
file_5.png

Anshul Maheshwari

unread,
Aug 12, 2015, 11:22:25 AM8/12/15
to tesseract-ocr
I checked after cropping the image, results are successfully shown.
I gave attached image to tesseract.

just now I checked manually croping image using GIMP, now I have to think of some algorithm.
which can do it automatically.

-Anshul
file_6.png

Anshul Maheshwari

unread,
Aug 12, 2015, 11:26:19 AM8/12/15
to tesseract-ocr
Does Tesseract  needs some space from where word start, I was thinking to go move cursor horizontally and if data has no change then remove those vertical lines.

Thanks

Anshul

On Wednesday, August 12, 2015 at 8:23:47 PM UTC+5:30, Anshul Maheshwari wrote:

Anshul Maheshwari

unread,
Aug 12, 2015, 11:32:04 AM8/12/15
to tesseract-ocr
Or may be my I have not mapped cluts of image correctly to map alpha channels
130         for ( i = 0; i < h; i++)
131         {      
132                 ppixel = data + i * wpl;
133                 for (j = 0; j < w; j++)
134                 {      
135                         index = indata[i * w + (j)];
136                         composeRGBPixel(palette[index].red, palette[index].green,palette[index].blue, ppixel);
137                         SET_DATA_BYTE(ppixel, L_ALPHA_CHANNEL,alpha[index]);
138                         ppixel++;
139                 }
140         }


On Wednesday, August 12, 2015 at 8:23:47 PM UTC+5:30, Anshul Maheshwari wrote:

zdenko podobny

unread,
Aug 12, 2015, 11:52:35 AM8/12/15
to tesser...@googlegroups.com
Quick reply ;-): have a look at TessBaseAPIGetComponentImages. There is python example[1] for C-API, so you could be able to follow if you are familiar with tesseract C-API. 
Just change
    tesseract.TessBaseAPISetPageSegMode(api, PSM_AUTO_OSD) 
to
    tesseract.TessBaseAPISetPageSegMode(api, PSM_SINGLE_LINE)

and adjust filename and TESSDATA_PREFIX...

I got output like this:

image width: 720
image height: 36
Found 1 textline image components.
Box[0]: x=78, y=0, w=148, h=36, confidence: 89, text: downpours


Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a5e06019-3417-4fec-b5af-390239596fc4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Anshul Maheshwari

unread,
Aug 12, 2015, 12:10:15 PM8/12/15
to tesseract-ocr
But I am not sure that there will always be one line in bitmap subtitle, there could be 4 lines, It looks like correcting this case will break
other case of subtitles.

I am not sure but will your solution also works on multi-line input.

-Anshul

zdenko podobny

unread,
Aug 12, 2015, 12:34:28 PM8/12/15
to tesser...@googlegroups.com
Well you never know until you try ;-) 

I did not intend to provide you final solution - I just wanted to point out that there is GetComponent function that could at least help you to provide bounding box for text areas. 

There are also other tools for detecting text (e.g. opencv [1]) - you can do layout detection ourside tesseract and than to use tesseract just for OCR...


Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages