Detection on complex images

249 views
Skip to first unread message

Paolo Giannoccaro

unread,
Oct 13, 2017, 9:54:39 AM10/13/17
to tesseract-ocr
Hi,
I need to detect a fixed set of words in the attached image, not all are part of canonical english dictionary (for example words could be acronyms).

I tried detection on full image or iterating on splitted sub-images, but quality of detection is low.

I use Tess4J and the most important part of my code are:

//initialize
ITesseract instance = new Tesseract();
instance.setTessVariable(VAR_CHAR_WHITELIST, WHITELIST_DEFAULT);

//detect
int pageIteratorLevel = TessPageIteratorLevel.RIL_WORD;
List<Word> result = instance.getWords(image, pageIteratorLevel);

Any help ? 
Thanks a lot
43007108190000_sample.tif

Dmitri Silaev

unread,
Oct 14, 2017, 4:29:29 PM10/14/17
to tesser...@googlegroups.com
What are you unhappy with: detection rate or recognition accuracy? All in all, there's a ton of reasons why Tess can work poorly here. Some kind of preprocessing is definitely needed. What kind? It depends.

I personally would say that I need to know:
- 5-10 concrete examples of words you are going to look for,
- their bounding boxes within your sample image.

Once I have it, I might be able to help.

Best regards,
Dmitri Silaev
www.CustomOCR.com





--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/90295194-26a9-4f31-bd9d-63d61d7bd592%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

zbgns

unread,
Oct 16, 2017, 7:13:39 AM10/16/17
to tesseract-ocr

I understand that the aim is to obtain searchable file in order to be able to identify places where some specific words occur in the document. I would try to do this by creating searchable pdf and afterwards by using “find” in a pdf reader.


However I identified two main problems with the file attached by you.


First of all the image is too large for tesseract to process it (it may be limitation set by pdf specification – the image is 128 inches high, whereas the limit is probably 45 inches). So the image needs to be cut into 3 pieces before it may be turned into pdf with tesseract.


You may try to open the file with gImageReader and try to perform ocr on parts containing letters by using rectangle selection(s). I tried it (using tesseract 4.00 alpha engine) and it gives a text in output, but the quality is rather not satisfying. This is the second issue. The quality of the image is not sufficient to perform effective recognition (shapes of some letters are hardly readable) and I don’t think it may be improved in any easy way.

Art Rhyno.

unread,
Oct 16, 2017, 8:02:02 AM10/16/17
to tesser...@googlegroups.com

The height of the sample is definitely challenging, if I use a portion of it, Olena might be able to do a viable job of picking out the text [1]. I am not even sure it’s a proper font, though, it might make more sense to use something like template matching rather than OCR. There seems to be lots of instances where the characters touch or overlap with each other.

 

art

---

1. https://drive.google.com/file/d/0B-PK1n92dlzwWmRReVYzdVdBU2M/view?usp=sharing

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Paolo Giannoccaro

unread,
Oct 16, 2017, 2:01:21 PM10/16/17
to tesseract-ocr
Thank Art for your contribution.
The words that I have to extract from the attached sample are: ost, stain, stn, resd, o stn (they occur several times, in total there are 20 words).
I am currently working with OpenCV to preprocess the image and find a raw detection of rectangles that contain text. Then I use Tesseract to check each rectangle and make ocr. Till now I am able to get 10 of 20 words.

Of course if I already could have bounding boxes for each word, I would already solved the problem.


On Saturday, October 14, 2017 at 10:29:29 PM UTC+2, Dmitri Silaev wrote:
What are you unhappy with: detection rate or recognition accuracy? All in all, there's a ton of reasons why Tess can work poorly here. Some kind of preprocessing is definitely needed. What kind? It depends.

I personally would say that I need to know:
- 5-10 concrete examples of words you are going to look for,
- their bounding boxes within your sample image.

Once I have it, I might be able to help.

Best regards,
Dmitri Silaev
www.CustomOCR.com





On Fri, Oct 13, 2017 at 9:05 AM, Paolo Giannoccaro <pa.gian...@gmail.com> wrote:
Hi,
I need to detect a fixed set of words in the attached image, not all are part of canonical english dictionary (for example words could be acronyms).

I tried detection on full image or iterating on splitted sub-images, but quality of detection is low.

I use Tess4J and the most important part of my code are:

//initialize
ITesseract instance = new Tesseract();
instance.setTessVariable(VAR_CHAR_WHITELIST, WHITELIST_DEFAULT);

//detect
int pageIteratorLevel = TessPageIteratorLevel.RIL_WORD;
List<Word> result = instance.getWords(image, pageIteratorLevel);

Any help ? 
Thanks a lot

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Dmitri Silaev

unread,
Oct 16, 2017, 2:35:12 PM10/16/17
to tesser...@googlegroups.com
I asked for few bounding boxes to let us all locate the required words inside the image. Depending on what they are, various methods can work or not. Your image is 135 megapixels in size. You should give as much information as possible to make life easier for people who are willing to help, shouldn't you?



To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Message has been deleted

Paolo Giannoccaro

unread,
Oct 16, 2017, 3:27:54 PM10/16/17
to tesseract-ocr
Yes I also think that the only way is ti look for text portions of the image. For this reason I am working with OpenCV to detect text zones and then to apply tesseract fro text extraction. The problem is that I can find just 1/3 of total number of words.

Paolo Giannoccaro

unread,
Oct 16, 2017, 6:45:59 PM10/16/17
to tesseract-ocr
Sorry, I posted wrong data.
This is the correct words position inside the image

43007108190000_sample.tif,stain,304,4643,389,4679
43007108190000_sample.tif,stain,555,4685,634,4717
43007108190000_sample.tif,ost,1037,17303,1135,17341
43007108190000_sample.tif,o stn,910,24353,1049,24395
43007108190000_sample.tif,stn,960,30230,1066,30280
43007108190000_sample.tif,stn,997,31693,1095,31731
43007108190000_sample.tif,resd,749,33140,872,33187
43007108190000_sample.tif,resd,756,33543,873,33585
43007108190000_sample.tif,resd,778,33625,894,33666
43007108190000_sample.tif,resd,774,35233,894,35281
43007108190000_sample.tif,resd,881,38096,1004,38134
43007108190000_sample.tif,stn,1115,39344,1209,39384
43007108190000_sample.tif,resd,1066,39674,1189,39710
43007108190000_sample.tif,resd,883,39751,1001,39791
43007108190000_sample.tif,stn,765,40758,856,40797
43007108190000_sample.tif,stn,765,41079,852,41112
43007108190000_sample.tif,resd,977,42652,1093,42698
43007108190000_sample.tif,resd,885,42976,1011,43024
43007108190000_sample.tif,resd,908,43544,1024,43588
43007108190000_sample.tif,resd,1028,43665,1151,43711

Each row has image name, word, rect coordinates

thanks

Tom Morris

unread,
Oct 17, 2017, 7:00:43 PM10/17/17
to tesseract-ocr
I don't suppose this has anything to do with the Top Coder Mud Logger OCR contest, does it?

How will our team divide its winnings?

Tom

Dmitri Silaev

unread,
Oct 18, 2017, 1:38:02 PM10/18/17
to tesser...@googlegroups.com
Wow, we are being taken advantage of. Smart move Paolo but not fair. Heck, I almost started writing the answer.




--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Paolo Giannoccaro

unread,
Oct 18, 2017, 6:33:45 PM10/18/17
to tesseract-ocr
Yes it does, but posting a question and getting some anwsers (maybe) doesn't mean having a solution, neither making a team :)

Paolo Giannoccaro

unread,
Oct 18, 2017, 6:39:34 PM10/18/17
to tesseract-ocr
Why not fair ? Having technical advise from any kind of forum is just an ordinary work (think of stackoverflow, is it unfair to find an idea or a piece of code from there ?). Developing a full solution it's a different thing and it is what I will try to do.

thanks for your time.


On Wednesday, October 18, 2017 at 7:38:02 PM UTC+2, Dmitri Silaev wrote:
Wow, we are being taken advantage of. Smart move Paolo but not fair. Heck, I almost started writing the answer.
On Tue, Oct 17, 2017 at 7:00 PM, Tom Morris <tfmo...@gmail.com> wrote:
I don't suppose this has anything to do with the Top Coder Mud Logger OCR contest, does it?

How will our team divide its winnings?

Tom

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Dmitri Silaev

unread,
Oct 19, 2017, 5:50:07 PM10/19/17
to tesser...@googlegroups.com
Oh, come on, please no more speechmaking.

Contest hosts know that there are troubles getting words out of the images, and you know it. And you know that there's no much of a "full solution" to build once you've managed to get the OCR work right. That's the point of the problem. That way, you just ask people to solve the problem, to get your job fully done. You're going to get paid, people won't get anything.

That's. Not. Fair. Period.




To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Paolo Giannoccaro

unread,
Oct 20, 2017, 5:13:33 AM10/20/17
to tesseract-ocr
You say "That way, you just ask people to solve the problem". Maybe I am confused, but this is a forum. Someone asks for something, some one answers (if HE WANTS TO and if HE IS ABLE TO).
You say "to get your job fully done". You don't have minimum idea of what is the overall job and never, never never an answer on a public forum is a "job fully done" (for my experience, maybe you have different feeling, it depends on the complexity you are able to manage).

You say "You're going to get paid" Wow, I see the touch of a genius. So all other people asking in this forum from all parts of the world, just ask for joking. No one works with OCR and Tesseract and you just answer as pretty free time hobby.

You say "people won't get anything". To get something you have to produce something. People able to produce always get something. That said if your answers are under payment, you can just say that or not answer, or (be careful I am going to make a present) you could write half answer and ask for money for the other half.

Now I have just given a business idea to you while you have given nothing to me 
Reply all
Reply to author
Forward
0 new messages