[How-to] OpenRefine and GoogleTranslate

270 views
Skip to first unread message

hpiedcoq

unread,
Apr 17, 2018, 9:21:40 AM4/17/18
to OpenRefine
Hi,

Here is a little tutorial on how to use OpenRefine and the API of GoogleTranslate.

My Usecase : a directory with 700 png screenshots af arabic text. I need to get a spreadsheet with the translation of the text in a column, where every row corresponds to a png file.

Pre-requisits : 
  • OpenRefine, with Jython installed. Please follow Thad's brilliant tutorial here.

install the package google-api-python-client


sudo jython pip install --upgrade google-api-python-client

Source : This page


  • A googletranslateAPI key. you can test the API by creating an account on the developper platform



I use tesseract to create a txt file for each png file (bash script for Ubuntu) : 

#!/bin/bash
Files = path_to_png_files
for f in $Files
do
    echo 
"Processing $f file..."
    tesseract 
-l ara $f $f.txt
done

Then I create a csv file as a list of all these txt files, with the complete path.

ls --l $PWD/*.txt > listing.csv


1- The Refine job

Import this csv file in OpenRefine, with no separators. You should obtain a One Column project with local path to your png files.

[
  
{
    
"op": "core/text-transform",
    
"description": "Text transform on cells in column Column 1 using expression grel:'file://'+value",
    
"engineConfig": {
      
"mode": "row-based",
      
"facets": []
    
},
    
"columnName": "Column 1",
    
"expression": "grel:'file://'+value",
    
"onError": "keep-original",
    
"repeat": false,
    
"repeatCount": 10
  
},
  
{
    
"op": "core/column-addition-by-
fetching-urls",
    
"description": "Create column content at index 1 by fetching URLs based on column Column 1 using expression grel:value",
    
"engineConfig": {
      
"mode": "row-based",
      
"facets": []
    
},
    
"newColumnName": "content",
    
"columnInsertIndex": 1,
    
"baseColumnName": "Column 1",
    
"urlExpression": "grel:value",
    
"onError": "store-error",
    
"delay": 200,
    
"cacheResponses": true
  
}
]
This Refine script will create a second column with the content of the text, getting it from you local txt files.


2- Jythonizing GoogleTranslate

I strongly recommend to test this script, first, on a selection of 2/3 rows as every refresh on the function consumes some API calls.


Create a column based on this column, and use this jython script.

# coding: utf8
import sys
sys.path.append(r'/opt/jython2.7.0/Lib/site-packages/') #replace the path by your path to jython
import json
key 
='YOUR_KEY_HERE'
from apiclient.discovery import build
target_language 
= 'fr'
service 
= build('translate','v2',developerKey=key)
collection 
= service.translations()
request 
= collection.list(q=value, source='ar', target=target_language)
response 
= request.execute()
response_json 
= json.dumps(response)
ascii_translation 
= ((response['translations'][0])['translatedText']).encode('utf-8').decode('ascii', 'ignore')
utf_translation 
= ((response['translations'][0])['translatedText']).encode('utf-8')
return(utf_translation)


Target_Language is the language output.
Source is the orginal language. 

You should get a parseable/readable/searchable translation on your text. \o/

Thad Guidry

unread,
Apr 17, 2018, 10:28:39 AM4/17/18
to openr...@googlegroups.com
Nice going !
Glad my tutorial helped you out !

-Thad

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages