pyocr to use multi-language

205 views
Skip to first unread message

JImCurry

unread,
Nov 21, 2016, 3:29:23 AM11/21/16
to Paperwork
The question  i wanna to ask is how can i use pyocr to use  multi-language .

I make a data set for myself and i want to combine it with google train data. (chi_tra.traindata+ new.traindata)

how to make it ?


i know 
tesseract input.tif output -l eng+newfont
can make it .


but how about using pyocr?
below is my code.
--------------------------------------------------------
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io


tool = pyocr.get_available_tools()[0]
langs = tool.get_available_languages()[1]   #[ 1 ] is the google traindata

req_image = []
final_text = []

image_pdf = Image(filename="xxx.pdf", resolution=300)
image_jpeg = image_pdf.convert('jpeg')


for img in image_jpeg.sequence:
    img_page = Image(image=img)
    req_image.append(img_page.make_blob('jpeg'))

for img in req_image:
    txt = tool.image_to_string(
        PI.open(io.BytesIO(img)),
        lang=lang,
        builder=pyocr.builders.TextBuilder()
    )
    #final_text.append(txt)
    print(txt)



 

Jerome Flesch

unread,
Nov 21, 2016, 11:54:07 AM11/21/16
to paperw...@googlegroups.com, a0953...@gmail.com
Hello,


You can use explicitly the module 'tesseract', in which case
specifying something like 'eng+newfont' as language should work:

```
from pyocr import tesseract

txt = tesseract.image_to_string(
Image.open("whatever.jpg"),
lang="eng+newfont",
builder=pyocr.builders.TextBuilder()
)
```

This is a bit of a hack as I cannot guarantee it would work with other
OCR tools, not even libtesseract, but it should do the trick.
> --
> You received this message because you are subscribed to the Google Groups
> "Paperwork" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to paperwork-gu...@googlegroups.com.
> Visit this group at https://groups.google.com/group/paperwork-gui.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages