Importing UTF-8 Data

161 views
Skip to first unread message

Dan Bravender

unread,
Apr 10, 2008, 10:13:49 AM4/10/08
to Google App Engine
I'm looking at a way around this, but for the time being UTF-8 isn't
working with bulk_client.py:

Traceback (most recent call last):
File "/Users/dbravender/Desktop/google_appengine/google/appengine/
ext/webapp/__init__.py", line 486, in __call__
handler.post(*groups)
File "/Users/dbravender/Desktop/google_appengine/google/appengine/
ext/bulkload/__init__.py", line 287, in post
self.request.get(constants.CSV_PARAM))
File "/Users/dbravender/Desktop/google_appengine/google/appengine/
ext/bulkload/__init__.py", line 357, in Load
for columns in reader:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)

The csv file that I'm trying to update has Korean and Chinese
characters. I found some code for reading a utf-8 cvs file on
python.org:

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')

But I'm still baffled about utf-8 in python. Any help would be
appreciated!

Dan

gearhead

unread,
Apr 11, 2008, 1:17:51 AM4/11/08
to Google App Engine
Dan,

Yeah, this is a common problem, see this article on solving it:
http://www.amk.ca/python/howto/unicode

In short, you need to specify 'ignore', 'strict', etc in the error
parameter, e.g.,
val = unicode('\x80abc', errors='ignore')
I believe that the default is to be strict.

Brian

Dan Bravender

unread,
Apr 11, 2008, 5:42:56 AM4/11/08
to Google App Engine
Brian,

Thanks a ton. I almost slammed my head through my computer last night.
^^



On Apr 11, 2:17 pm, gearhead <brianob...@gmail.com> wrote:
> Dan,
>
> Yeah, this is a common problem, see this article on solving it:
> http://www.amk.ca/python/howto/unicode
>
> In short, you need to specify 'ignore', 'strict', etc in the error
> parameter, e.g.,
> val = unicode('\x80abc', errors='ignore')
> I believe that the default is to be strict.
>
> Brian
>
> On Apr 10, 7:13 am, Dan Bravender <dan.braven...@gmail.com> wrote:
>
> > I'm looking at a way around this, but for the time beingUTF-8isn't
> > working with bulk_client.py:
>
> > Traceback (most recent call last):
> > File "/Users/dbravender/Desktop/google_appengine/google/appengine/
> > ext/webapp/__init__.py", line 486, in __call__
> > handler.post(*groups)
> > File "/Users/dbravender/Desktop/google_appengine/google/appengine/
> > ext/bulkload/__init__.py", line 287, in post
> > self.request.get(constants.CSV_PARAM))
> > File "/Users/dbravender/Desktop/google_appengine/google/appengine/
> > ext/bulkload/__init__.py", line 357, in Load
> > for columns in reader:
> > UnicodeEncodeError: 'ascii' codec can't encode characters in position
> > 0-2: ordinal not in range(128)
>
> > The csv file that I'm trying to update has Korean and Chinese
> > characters. I found some code for reading autf-8cvs file on
> > python.org:
>
> > def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
> > # csv.py doesn't do Unicode; encode temporarily asUTF-8:
> > csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
> > dialect=dialect, **kwargs)
> > for row in csv_reader:
> > # decodeUTF-8back to Unicode, cell by cell:
> > yield [unicode(cell, 'utf-8') for cell in row]
>
> > def utf_8_encoder(unicode_csv_data):
> > for line in unicode_csv_data:
> > yield line.encode('utf-8')
>
> > But I'm still baffled aboututf-8in python. Any help would be
> > appreciated!
>
> > Dan

Dan Bravender

unread,
Apr 11, 2008, 10:19:44 AM4/11/08
to Google App Engine
I 've finally gotten utf-8 data to import. Here's the diff:

--- google_appengine/google/appengine/ext/bulkload/__init__.py
2008-04-03 09:05:25.000000000 +0900
+++ google_appengine-fixed/google/appengine/ext/bulkload/__init__.py
2008-04-11 23:10:43.000000000 +0900
@@ -225,7 +225,7 @@

entity = datastore.Entity(self.__kind)
for (name, converter), val in zip(self.__properties, values):
- entity[name] = converter(val)
+ entity[name] = converter(val.decode('utf-8'))

entities = self.HandleEntity(entity)

@@ -349,7 +349,7 @@
output.append('Error: no Loader defined for kind %s.' % kind)
return (httplib.BAD_REQUEST, ''.join(output))

- buffer = StringIO.StringIO(data)
+ buffer = StringIO.StringIO(data.encode('utf-8'))
reader = csv.reader(buffer, skipinitialspace=True)
entities =

On Apr 11, 6:42 pm, Dan Bravender <dan.braven...@gmail.com> wrote:
> Brian,
>
> Thanks a ton. I almost slammed my head through my computer last night.
> ^^
>
> 댄
>
> On Apr 11, 2:17 pm, gearhead <brianob...@gmail.com> wrote:
>
> > Dan,
>
> > Yeah, this is a common problem, see this article on solving it:
> > http://www.amk.ca/python/howto/unicode
>
> > In short, you need to specify 'ignore', 'strict', etc in the error
> > parameter, e.g.,
> > val =unicode('\x80abc', errors='ignore')
> > I believe that the default is to be strict.
>
> > Brian
>
> > On Apr 10, 7:13 am, Dan Bravender <dan.braven...@gmail.com> wrote:
>
> > > I'm looking at a way around this, but for the time beingUTF-8isn't
> > > working with bulk_client.py:
>
> > > Traceback (most recent call last):
> > > File "/Users/dbravender/Desktop/google_appengine/google/appengine/
> > > ext/webapp/__init__.py", line 486, in __call__
> > > handler.post(*groups)
> > > File "/Users/dbravender/Desktop/google_appengine/google/appengine/
> > > ext/bulkload/__init__.py", line 287, in post
> > > self.request.get(constants.CSV_PARAM))
> > > File "/Users/dbravender/Desktop/google_appengine/google/appengine/
> > > ext/bulkload/__init__.py", line 357, in Load
> > > for columns in reader:
> > > UnicodeEncodeError: 'ascii' codec can't encode characters in position
> > > 0-2: ordinal not in range(128)
>
> > > The csv file that I'm trying to update has Korean and Chinese
> > > characters. I found some code for reading autf-8cvs file on
> > > python.org:
>
> > > def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
> > > # csv.py doesn't doUnicode; encode temporarily asUTF-8:
> > > csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
> > > dialect=dialect, **kwargs)
> > > for row in csv_reader:
> > > # decodeUTF-8back toUnicode, cell by cell:

Dan Bravender

unread,
Apr 11, 2008, 11:30:55 AM4/11/08
to Google App Engine
I hope this change makes it into App Engine release 2. I suppose I
could work around it by building my own bulk load mechanism into my
app for now...

Traceback (most recent call last):
File "/base/python_lib/versions/1/google/appengine/ext/webapp/
__init__.py", line 486, in __call__
handler.post(*groups)
File "/base/python_lib/versions/1/google/appengine/ext/bulkload/
__init__.py", line 287, in post
self.request.get(constants.CSV_PARAM))
File "/base/python_lib/versions/1/google/appengine/ext/bulkload/
__init__.py", line 357, in Load
for columns in reader:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)

Reply all
Reply to author
Forward
0 new messages