how to put in json utf-8 symbols, not their codes?

2,331 views
Skip to first unread message

Kazimir

unread,
Oct 27, 2011, 4:03:46 AM10/27/11
to scrapy-users
Hello.

Assume whe have a following code in spider:
item['name'] = hxs.select('blabla').extract()[0]
# now name = u'amnist\xeda'

and the we starting it:
scrapy crawl spider --set FEED_URI=out.json --set FEED_FORMAT=json

now out.json look like:
[ {"name": u"amnist\xeda"}]

is there a possibility to make it look like:
[ {"name": "amnistía"} ]




lookfwd

unread,
Oct 28, 2011, 4:27:16 AM10/28/11
to scrapy-users
Maybe converting from:

item['name'] = hxs.select('blabla').extract()[0]

to:

item['name'] = hxs.select('blabla').extract()[0].encode( "utf-8" )

would work.

Scotty Allen

unread,
Oct 28, 2011, 4:16:20 PM10/28/11
to scrapy...@googlegroups.com
I'm hitting this as well.

It looks to me like this is actually an issue with the json exporter,
rather than the spider. Under the covers, the scrapy json exporter is
using json.JSONEncoder. It has an argument to it's constructor called
ensure_ascii. From
http://docs.python.org/library/json.html#json.JSONEncoder :

"If ensure_ascii is True (the default), the output is guaranteed to be
str objects with all incoming unicode characters escaped. If
ensure_ascii is False, the output will be a unicode object."

Scrapy is not setting this parameter, so it's defaulting to escaping
unicode. It looks like most of the underlying code supports passing
in arguments to the JSONEncoder, all the way up to
FeedExporter._get_exporter in scrapy/contrib/feedexport.py. However,
that method is only called from FeedExporter.open_spider, which
doesn't pass in any arguments other than the temporary file.

It looks like a quick workaround would be to implement your own feed
exporter to replace JsonItemExporter or JsonLinesItemExporter, and
override the builtin one using the FEED_EXPORTERS setting.

A more longterm solution would be to provide a way to pass in feed
exporter parameters via the settings file.

-Scotty

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
>
>

Scotty Allen

unread,
Oct 28, 2011, 4:27:27 PM10/28/11
to scrapy...@googlegroups.com
I got this working. Here's the replacement for JsonLinesItemExporter:

import json
from scrapy.contrib.exporter import BaseItemExporter

class UnicodeJsonLinesItemExporter(BaseItemExporter):

def __init__(self, file, **kwargs):
self._configure(kwargs)
self.file = file
self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)

def export_item(self, item):
itemdict = dict(self._get_serialized_fields(item))
self.file.write(self.encoder.encode(itemdict) + '\n')

-Scotty

Максим Горковский

unread,
Oct 30, 2011, 11:08:44 AM10/30/11
to scrapy...@googlegroups.com
I did like you said and nothing changed, strings in output file are still with escape characters:
{"data": ["TOMER URWICZ", "Tanto silencio les llama la atenci\u00f3n."]}

Maybe you just got another problem?

2011/10/29 Scotty Allen <sco...@scottyallen.com>



--
С уважением,
Максим Горковский

Scotty Allen

unread,
Oct 30, 2011, 5:12:33 PM10/30/11
to scrapy...@googlegroups.com
Did you override the builtin one via FEED_EXPORTERS? Are you using jsonlines?

-Scotty

2011/10/30 Максим Горковский <ragzo...@gmail.com>:

Reply all
Reply to author
Forward
0 new messages