improve image so i can better OCR

85 views
Skip to first unread message

eliav schmulewitz

unread,
Jun 7, 2017, 2:19:20 AM6/7/17
to tesseract-ocr

Hi

I posted this on stackoverflow but got no response...


I am trying to read subtitles from an image taken from the news using tesserract on python. 
for some reasons I get better results when saving the file using plt and using tesseract reading it from there

  1. Why is that?
  2. How can I refine my results using cv2?
import urllib3
import requests
import numpy as np
import pytesseract
import matplotlib.pyplot as plt
from  PIL import Image
def downloadFile():
    url = 'https://drive.google.com/uc?export=download&id=0B7t_yZLolnbiaVpicnEwbDRjTmc'
    http = urllib3.PoolManager()
    r = http.request('GET',url)
    f = open('testing.npy', 'wb')
    f.write(r.data)

downloadFile()
frame = np.load('testing.npy')
new_frame = frame[170:210,8:195]
plt.imshow(new_frame)
plt.axis('off')
plt.savefig('plt.png')
print('from array: ' + pytesseract.image_to_string(Image.fromarray(new_frame),lang = 'eng'))
print( 'from plt: ' + pytesseract.image_to_string(Image.open('plt.png'),lang = 'eng'))
Thank you!
Reply all
Reply to author
Forward
0 new messages