pyocr to use multi-language

205 views

Skip to first unread message

JImCurry

unread,

Nov 21, 2016, 3:29:23 AM11/21/16

to Paperwork

The question i wanna to ask is how can i use pyocr to use multi-language .

I make a data set for myself and i want to combine it with google train data. (chi_tra.traindata+ new.traindata)

how to make it ?

i know

tesseract input.tif output -l eng+newfont

can make it .

but how about using pyocr?

below is my code.

--------------------------------------------------------

from wand.image import Image

from PIL import Image as PI

import pyocr

import pyocr.builders

import io

tool = pyocr.get_available_tools()[0]

langs = tool.get_available_languages()[1] #[ 1 ] is the google traindata

req_image = []

final_text = []

image_pdf = Image(filename="xxx.pdf", resolution=300)

image_jpeg = image_pdf.convert('jpeg')

for img in image_jpeg.sequence:

img_page = Image(image=img)

req_image.append(img_page.make_blob('jpeg'))

for img in req_image:

txt = tool.image_to_string(

PI.open(io.BytesIO(img)),

lang=lang,

builder=pyocr.builders.TextBuilder()

)

#final_text.append(txt）

print(txt)

Jerome Flesch

unread,

Nov 21, 2016, 11:54:07 AM11/21/16

to paperw...@googlegroups.com, a0953...@gmail.com

Hello,

You can use explicitly the module 'tesseract', in which case
specifying something like 'eng+newfont' as language should work:

```
from pyocr import tesseract

txt = tesseract.image_to_string(
Image.open("whatever.jpg"),
lang="eng+newfont",
builder=pyocr.builders.TextBuilder()
)
```

This is a bit of a hack as I cannot guarantee it would work with other
OCR tools, not even libtesseract, but it should do the trick.

> --
> You received this message because you are subscribed to the Google Groups
> "Paperwork" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to paperwork-gu...@googlegroups.com.
> Visit this group at https://groups.google.com/group/paperwork-gui.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages