--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/90295194-26a9-4f31-bd9d-63d61d7bd592%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I understand that the aim is to obtain searchable file in order to be able to identify places where some specific words occur in the document. I would try to do this by creating searchable pdf and afterwards by using “find” in a pdf reader.
However I identified two main problems with the file attached by you.
First of all the image is too large for tesseract to process it (it may be limitation set by pdf specification – the image is 128 inches high, whereas the limit is probably 45 inches). So the image needs to be cut into 3 pieces before it may be turned into pdf with tesseract.
You may try to open the file with gImageReader and try to perform ocr on parts containing letters by using rectangle selection(s). I tried it (using tesseract 4.00 alpha engine) and it gives a text in output, but the quality is rather not satisfying. This is the second issue. The quality of the image is not sufficient to perform effective recognition (shapes of some letters are hardly readable) and I don’t think it may be improved in any easy way.
The height of the sample is definitely challenging, if I use a portion of it, Olena might be able to do a viable job of picking out the text [1]. I am not even sure it’s a proper font, though, it might make more sense to use something like template matching rather than OCR. There seems to be lots of instances where the characters touch or overlap with each other.
art
---
1. https://drive.google.com/file/d/0B-PK1n92dlzwWmRReVYzdVdBU2M/view?usp=sharing
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to
tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/870fa717-09f7-421d-8654-680088001d9d%40googlegroups.com.
What are you unhappy with: detection rate or recognition accuracy? All in all, there's a ton of reasons why Tess can work poorly here. Some kind of preprocessing is definitely needed. What kind? It depends.I personally would say that I need to know:
- 5-10 concrete examples of words you are going to look for,- their bounding boxes within your sample image.Once I have it, I might be able to help.
On Fri, Oct 13, 2017 at 9:05 AM, Paolo Giannoccaro <pa.gian...@gmail.com> wrote:
Hi,I need to detect a fixed set of words in the attached image, not all are part of canonical english dictionary (for example words could be acronyms).I tried detection on full image or iterating on splitted sub-images, but quality of detection is low.I use Tess4J and the most important part of my code are://initializeITesseract instance = new Tesseract();instance.setTessVariable(VAR_CHAR_WHITELIST, WHITELIST_DEFAULT);//detectint pageIteratorLevel = TessPageIteratorLevel.RIL_WORD;List<Word> result = instance.getWords(image, pageIteratorLevel);Any help ?Thanks a lot
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a4e7de3-3ff3-4085-80f4-6fb2767a6938%40googlegroups.com.
| 43007108190000_sample.tif,stain,304,4643,389,4679 |
| 43007108190000_sample.tif,stain,555,4685,634,4717 |
| 43007108190000_sample.tif,ost,1037,17303,1135,17341 |
| 43007108190000_sample.tif,o stn,910,24353,1049,24395 |
| 43007108190000_sample.tif,stn,960,30230,1066,30280 |
| 43007108190000_sample.tif,stn,997,31693,1095,31731 |
| 43007108190000_sample.tif,resd,749,33140,872,33187 |
| 43007108190000_sample.tif,resd,756,33543,873,33585 |
| 43007108190000_sample.tif,resd,778,33625,894,33666 |
| 43007108190000_sample.tif,resd,774,35233,894,35281 |
| 43007108190000_sample.tif,resd,881,38096,1004,38134 |
| 43007108190000_sample.tif,stn,1115,39344,1209,39384 |
| 43007108190000_sample.tif,resd,1066,39674,1189,39710 |
| 43007108190000_sample.tif,resd,883,39751,1001,39791 |
| 43007108190000_sample.tif,stn,765,40758,856,40797 |
| 43007108190000_sample.tif,stn,765,41079,852,41112 |
| 43007108190000_sample.tif,resd,977,42652,1093,42698 |
| 43007108190000_sample.tif,resd,885,42976,1011,43024 |
| 43007108190000_sample.tif,resd,908,43544,1024,43588 |
| 43007108190000_sample.tif,resd,1028,43665,1151,43711 |
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7fcf2544-9e05-4114-a089-743af8b3df91%40googlegroups.com.
Wow, we are being taken advantage of. Smart move Paolo but not fair. Heck, I almost started writing the answer.
On Tue, Oct 17, 2017 at 7:00 PM, Tom Morris <tfmo...@gmail.com> wrote:
I don't suppose this has anything to do with the Top Coder Mud Logger OCR contest, does it?How will our team divide its winnings?Tom
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bf0c8d4e-a3cd-4dd5-9746-c56d8c79cb0d%40googlegroups.com.