Tesseract-ocr image not able to read the exact data .. Please reply me as soon as possible.

146 views
Skip to first unread message

YOGESH KUMBHARE

unread,
May 28, 2020, 1:50:14 AM5/28/20
to tesseract-ocr
Hi Team,

I am planning to used tesseract OCR engine to rendering the image extraction data library ...
but some image not able to extract the data in proper formate, what is the solution for that.
how to resolve that? 
Please, guys, anyone can help me with those images what should I have to do, any config is needed for that in tesseract OCR library.

Please let me know as soon as possible.

sample code ...
public class Test {

public static void main(String[] args) {

try {
File imageFile = new File("Sample1_3.png");

ITesseract instance = new Tesseract(); // JNA Interface Mapping
System.out.print(imageFile.canRead());

instance.setDatapath("tessdata");
instance.setTessVariable("user_defined_dpi", "300");
instance.setLanguage("eng");
//instance.setDatapath(tessDataFolder.getPath());;
String text = instance.doOCR(imageFile);
// path of your image file

} catch (TesseractException e) {
e.printStackTrace();

}
}
}

Sample1_3.png
sample1_3.png

Piyush Chandra

unread,
May 28, 2020, 2:22:16 AM5/28/20
to tesseract-ocr
1. You need to work on pre processing the images.

2. The first image I tried, 180 rotation was required.

tesseract Sample1_3.png sam1 -l osd --psm 0 

Result: Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 0.96
Script: Latin
Script confidence: 11.67

3. After rotation, tried OCR with --psm 6: (more about psm : https://github.com/tesseract-ocr/tesseract/issues/434#issuecomment-561010796)
tesseract Sample1_3.png sam1 -l eng --psm 6

Result : We are Jorwarding the mesons and syllabus Sor P.G.Diplome in
Yoga (E & T.M). The Subjects that You have to Study ang the number of
lessons/units i each Subject are Mentioneg in the SYllabus Please
Compare the lessons / Units with Syllabus. In case You fing any
discrepancy Please tYorm the Directoy by name ™Mediately,
W'Sh You all succes
DIRECTOR

**Try to fine-tune tesseract for the font you using.https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00

Lakshay Saini

unread,
May 28, 2020, 2:23:38 AM5/28/20
to tesseract-ocr
Hi Yogesh,

First and foremost, the image samples that you are using are not good enough to extract all the data.

Then, coming to your problem statement, Are you modifying the images before providing it to tesseract?

If no:

1. You need to rotate the images before providing them to the engine. You can get the rotation information by using the PSM mode "0". It creates an ".osd" which has the information about the rotation. 
2. Deskew the image to align the text angle properly.
3. Convert the image to greyscaled one and play with the image contrast that the text is more visible. (Optional: you can invert the colors too. Ex: Convert text to white color and background to black)
4. Increase the DPI of the image as it can increase the accuracy of the detected text.

Note: If the image quality is high, it will also increase the accuracy of the detected text.

Furthermore, you can read about the Page Segmentation Modes (PSM) and Optical Engine Modes (OEM) modes in the official documentation. They can help you a lot too.

Also, if you can, you can test Google Cloud Vision too. The accuracy is way more than tesseract. Although it's a paid API but you can create a free account and each month you can OCR up to 1000 pages for free of cost. After that you will be charged but it's affordable. And upon signing up for free account you will get 300 dollars for an year from Google itself. 

Regards
Lakshay Saini

YOGESH KUMBHARE

unread,
May 28, 2020, 2:37:53 AM5/28/20
to tesseract-ocr

Hi Piyush,

Thanks for the immediate response.

can you please send me the sample Java code to rotate the image example.
That will really help me, i try to found but not able to get it .

Please send me the Java sample code for pre-processing the images.

Thanks,
Piyush

Piyush Chandra

unread,
May 28, 2020, 2:54:20 AM5/28/20
to tesseract-ocr
Hi Yogesh,

I am not a Java developer, but you can use OpenCv for pre processing the image. As mentioned by Lakshya (#3) you can achieve using OpenCv and is available for Java.

Rotating the image can also be achieved using OpenCv. There are loads of examples available online (Google it!)

Lakshay Saini

unread,
May 28, 2020, 2:59:48 AM5/28/20
to tesseract-ocr
Hi Yogesh,

Same is the case with me. I'm a python developer. What Piyush saying is right. For your reference:


This is the guide to rotate the image. You can explore OpenCV for java on google. You can find a solution for everything which we have mentioned earlier. OpenCV is the major library that is used to modify the images.

Regards
Lakshay Saini

YOGESH KUMBHARE

unread,
May 28, 2020, 4:30:35 AM5/28/20
to tesseract-ocr
Hi Lakshay,

Thanks for the response, 
But in OpenCV, I explore the java example but it seems it does not detect the angle automatically, rotate it.

My Requirement is we don't know the user can upload the image in any direction, that quite difficult. 
In OpenCv library we have to specify the angle .. due you know any idea about that, to detect automatically and based on that give the correct image orientation.

Thanks, Please reply ...

Lakshay Saini

unread,
May 28, 2020, 4:55:07 AM5/28/20
to tesseract-ocr
Hi Yogesh,

As I mentioned earlier, you need to get the orientation information by using psm 0. You will get the orientation angle from there, and then use that to rotate the image to 0 degree.

Regards
Lakshay

Reply all
Reply to author
Forward
0 new messages