Best approach to parse JSON responses in scrapy.

2,988 views
Skip to first unread message

Bruno Lima

unread,
Apr 8, 2013, 2:44:41 PM4/8/13
to scrapy...@googlegroups.com
Hello everybody,

I need to scrap 2 websites that returns a huge JSON (1.2MB) and uses the JSON + JavaScript to build the HTML. I wanna know what is the best approach ?

1 - In the spider itself parse the JSON and create the items.
     |-> In this case, which library do you guys recommend  ?
     |-> Is this scalable ? Since it will be restrict to domain requests.

2 - Save the JSON itself to item and build pipelines to create the items.
     
3 - Save the JSON to a NoSQL or Queue and use other script to create items.

Thank you all.



Bruno Seabra Mendonça Lima
--------------
http://about.me/bruno.seabra

Rolando Espinoza La Fuente

unread,
Apr 8, 2013, 4:05:20 PM4/8/13
to scrapy...@googlegroups.com
On Mon, Apr 8, 2013 at 2:44 PM, Bruno Lima <bsli...@gmail.com> wrote:
Hello everybody,

I need to scrap 2 websites that returns a huge JSON (1.2MB) and uses the JSON + JavaScript to build the HTML. I wanna know what is the best approach?


Doing cpu-intensive processing within scrapy will not scale, because processing each huge file
will take a few seconds and the engine will not be able to do anything else.

If you are doing this processing for very few websites it will be ok. Otherwise, moving the processing to a pool of workers will be a good choice.

Hope this helps. Regards,

Rolando

Bruno Lima

unread,
Apr 8, 2013, 4:31:49 PM4/8/13
to scrapy...@googlegroups.com
I think I'll use scrapy to fetch the JSONs and use RabbitMQ or Redis to distribute the JSON among workers.

Bruno Seabra Mendonça Lima
--------------
http://about.me/bruno.seabra


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Shane Evans

unread,
Apr 8, 2013, 6:07:34 PM4/8/13
to scrapy...@googlegroups.com
Yes, many people use scrapy for scraping APIs returning JSON. It's a good option and should  work fine.

Bruno's question is specifically mentioning parsing large JSON documents & the same advice applies to HTML, XML, etc.



On 8 April 2013 22:12, Andres Douglas <andres....@gmail.com> wrote:
I actually have a similar use case. I'd like to query a bunch of REST APIs returning JSON, and was wondering if I could leverage Scrapy to parse the JSON responses. I would have to do some transformations to the field names and structures to normalize them into a common format, and then store them in Django models. Has anyone used scrapy for this purpose?

Andres Douglas

unread,
Apr 8, 2013, 6:46:52 PM4/8/13
to scrapy...@googlegroups.com
Yup, thanks. Moved to its own question specific to REST APIs - https://groups.google.com/forum/?fromgroups=#!topic/scrapy-users/gSc2rZN6js0

Bruno Lima

unread,
Apr 9, 2013, 7:14:21 AM4/9/13
to scrapy...@googlegroups.com
Shane,
Taking for example Andres question, where is the best location to parse and transform the JSONs ?
In the parse method on Spider itself ? Or in pipelines ?

Bruno Seabra Mendonça Lima
--------------
http://about.me/bruno.seabra


Shane Evans

unread,
Apr 9, 2013, 8:07:24 AM4/9/13
to scrapy...@googlegroups.com
The spider itself. I don't see why parsing JSON is all that different to parsing HTML. The spider takes the raw response data and outputs parsed items and requests.

Bruno Lima

unread,
Apr 9, 2013, 8:15:12 AM4/9/13
to scrapy...@googlegroups.com
I agree with you, but from docs and examples seem like scrapy only treats HTML/XML.
Including the item loader that is xpath driven.


Bruno Seabra Mendonça Lima
--------------
http://about.me/bruno.seabra


Andres Douglas

unread,
Apr 17, 2013, 8:45:19 PM4/17/13
to scrapy...@googlegroups.com
Bruno, any luck with this?

郭冬冬

unread,
Apr 17, 2013, 9:05:39 PM4/17/13
to scrapy...@googlegroups.com
You can parse json response in spider.parse with simplejson, or something like that, 
instead of using HXSelector, and then return parsed Item.

在 2013年4月9日星期二UTC+8下午8时15分12秒,Bruno Lima写道:

Andres Douglas

unread,
Apr 17, 2013, 9:09:42 PM4/17/13
to scrapy...@googlegroups.com, scrapy...@googlegroups.com

Thanks for the answer. I was looking for something a bit more solid, where you could specify the paths of the elements to be extracted wit something like XPath. The idea would be to create a spider that was specific for JSON and easy to extend for any schema. 


Sent from Mailbox for iPhone


You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/WknO5Uf6NxY/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

郭冬冬

unread,
Apr 17, 2013, 9:18:43 PM4/17/13
to scrapy...@googlegroups.com
Great idea!   
I search google :).  and found this , try it  http://goessner.net/articles/JsonPath/,
the python package is here https://pypi.python.org/pypi/jsonpath/.

在 2013年4月18日星期四UTC+8上午9时09分42秒,Andres Douglas写道:

Bruno Lima

unread,
Apr 18, 2013, 8:14:58 AM4/18/13
to scrapy...@googlegroups.com
Hi everyone, I tried parsed the JSON with simplejson but the performance was quite ridiculous, so I googled a little bit and found UltraJson (https://github.com/esnme/ultrajson), blazing fast. 
So now, I use ultrajson in the parse of the spider to build the item and just 1 pipeline to save into MongoDB, and I have a multiple process in background that retrieve from MongoDB and make the necessary adjustments to save into MySQL. 
Its working like a charm =D
UltraJSON and MongoDB are really, really fast.

  

Bruno Seabra Mendonça Lima
--------------
http://about.me/bruno.seabra


Reply all
Reply to author
Forward
0 new messages