GSoC: HTTP API & Visual Scrapy Browser Plugin

179 views
Skip to first unread message

Ruben Vereecken

unread,
Feb 17, 2014, 5:46:37 AM2/17/14
to scrapy...@googlegroups.com
Hi,

I've been hanging around the IRC channel at Freenode for over a week now and thought I'd introduce myself as a GSoC participant as well.
It was never my plan to work around Scrapy for GSoC (as this is your first year participating) but since I am currently using Scrapy, I want the chance to contribute at the same time.

About me personally, I'll have my undergrad by the summer as a student at the University of Antwerp, Belgium. My greatest passions have always been the various web technologies around and more recently Artificial Intelligence. I'm working on getting a blog up but due to a terrible host I won't disclose the URL for the moment.

The ideas page lists an intermediate task about an HTTP API for Scrapy Spiders, a task that probably fits me best. The mentor listed is Shane Evans, which is just my bad luck as he seems to be a busy guy. I've got some ideas around this project so if anything would be willing to (informally) talk them over it would be greatly appreciated. 

As soon as I started using Scrapy I had this short but vivid dream of simply anyone having access to Srapy through a browser plugin that would interactively and visually construct spiders for the user, without the user ever having to touch any Python code. Later I thought I'd found exactly this idea on one of your ideas page but I can't seem to find it again. This is a project I would love to work on even more than the aforementioned one but I'm still investigating the feasibility. But let's be honest, it would be really cool if you could just select some text on a page as a certain 'thing', click links for the crawler to investigate and make it crawl all that for you. This is even more of a shout-out to any interested users or developers to discuss this subject with me, because this is what I'd love to focus on. Be it technical or simply talking ideas, just give me a holla.

For the next couple of weeks you'll see me at #scrapy as Randomaniac and around the mailing lists. I'm planning on delving into the scrapy code to familiarise myself and hopefully manage a couple of patches where needed at the same time.

Cheers!

Ruben


PS: If the mentor of the original browser plugin idea reads this, please get back to me so we can compare our visions on the matter.

Shane Evans

unread,
Feb 17, 2014, 10:16:33 AM2/17/14
to scrapy...@googlegroups.com
The ideas page lists an intermediate task about an HTTP API for Scrapy Spiders, a task that probably fits me best. The mentor listed is Shane Evans, which is just my bad luck as he seems to be a busy guy. I've got some ideas around this project so if anything would be willing to (informally) talk them over it would be greatly appreciated. 
I got in touch with Ruben and we're discussing it. We plan to put our IRC details on the wiki, which should help.

 

As soon as I started using Scrapy I had this short but vivid dream of simply anyone having access to Srapy through a browser plugin that would interactively and visually construct spiders for the user, without the user ever having to touch any Python code. Later I thought I'd found exactly this idea on one of your ideas page but I can't seem to find it again. This is a project I would love to work on even more than the aforementioned one but I'm still investigating the feasibility. But let's be honest, it would be really cool if you could just select some text on a page as a certain 'thing', click links for the crawler to investigate and make it crawl all that for you. This is even more of a shout-out to any interested users or developers to discuss this subject with me, because this is what I'd love to focus on. Be it technical or simply talking ideas, just give me a holla.

The idea that was there was a browser extension that would help with spider generation. The reason we pulled it was that it wasn't well developed enough in time for the proposal. We should (and probably will) put it back on the draft ideas in some form.

Did you see the Scrapinghub autoscraping tool?
We're currently working on an open source version of the UI (the back end is https://github.com/scrapy/slybot ), which will be available soon. The UI will generate slybot spiders, but we also had an idea for the GSoC to generate python code or files for parslepy (https://github.com/redapple/parslepy ). If this is area is your biggest interest, we can try and come up with something interesting here for you.

 

For the next couple of weeks you'll see me at #scrapy as Randomaniac and around the mailing lists. I'm planning on delving into the scrapy code to familiarise myself and hopefully manage a couple of patches where needed at the same time.
great

Ruben Vereecken

unread,
Feb 18, 2014, 11:15:36 AM2/18/14
to scrapy...@googlegroups.com
Thanks for the great answer, Scrapinghub looks really promising by the way. Generating Parsley sounds interesting, but I feel you've basically got that covered with slybot and an UI on top of that.

I'm currently back to looking in the direction of an HTTP API, yet I feel the project as we discussed it before is a bit immature on its own. If anyone has had any uses for an HTTP API for their Scrapy spiders before that required some more intricate functionality, please get back to me so we could discuss how such an HTTP API could be extended beyond communicating with a simple spider. In the meanwhile, I'll be looking on on myself.



Op maandag 17 februari 2014 16:16:33 UTC+1 schreef shane:

Shane Evans

unread,
Feb 18, 2014, 4:13:31 PM2/18/14
to scrapy...@googlegroups.com

Thanks for the great answer, Scrapinghub looks really promising by the way. Generating Parsley sounds interesting, but I feel you've basically got that covered with slybot and an UI on top of that.
Sure. I think there is a lot of interesting work here, but it's not well defined yet. There are many cases where slybot will not do exactly what you want, so I like the idea of then generating python and continuing coding from there. It's also better than browser addons for working with xpaths (due to the fact it uses scrapy).
 

I'm currently back to looking in the direction of an HTTP API, yet I feel the project as we discussed it before is a bit immature on its own. If anyone has had any uses for an HTTP API for their Scrapy spiders before that required some more intricate functionality, please get back to me so we could discuss how such an HTTP API could be extended beyond communicating with a simple spider. In the meanwhile, I'll be looking on on myself.

I agree, as it stands it's a bit light. I welcome some suggestions, I'll think about it some more too.

One addition I thought about was instead of a single spider, wrap a project and dispatch to any spider. Either based on spider name passed, or have some domain -> spider mapping. This has come up before and would be useful.




Ruben Vereecken

unread,
Feb 21, 2014, 10:23:22 AM2/21/14
to scrapy...@googlegroups.com
Ever since last we exchanged ideas, I thought about it quite a lot but sadly found it hard to find some extra real-world uses for the project.
In here I'll try to collect the results of my monologuous brainstorm sessions and invite anyone and everyone for open discussion on the subject.

I post this to the scrapy-users list because any user, developer or no, is invited to read this and share their thoughts on my ramblings.

The basic idea is to control spiders through an HTTP API (I'll use REST API from now on, correct me on this if you like). Slightly similar to the currently present WebService.
  • As Shane said before, it would be even more helpful if the same API would allow access to all spiders encapsulated by one and the same project. So, recapped, spiders are mapped by their name on a parameter or on domain. I like the former better as it comes more natural to approach the spider in the same place where you tell it what to do. Both are possible.
  • One would be able to dispatch jobs to spiders, both by sending start URLs (cfr. start_urls) or sending Request objects (cfr. start_requests).
  • The user would have full control over the results of these scrapings. The standard case would be for the spider to return Items (cfr. parse). However, the user could also opt for a more interactive approach where he would intercept Responses as well, effectively bypassing the regular parse method. This allows the user to approach the spider more interactively.
  • The user can choose to control the pipeline items will go through. Pipelines are most often used for cleaning and saving to different formats. Since the user is remote, saving might not make just as much sense as when the user expects results to appear locally. Cleaning items however is quite different.
  • The API supports authentication. This I should look into more properly but I would like at least support for API keys. Generally these are strings that a user supplies to gain access to the API. These keys could have some rules tied to them, like rated admission or max amount of uses, expiry dates,...
  • More vague brainstorm stuff, more akin to the currently existing WebService: The user can influence CrawlerSpider Rules, i.e. get, add and delete them.
The API is useful for those who want to remotely access their spiders, be this for testing, practical or demo purposes. A cool addition would therefore be to add a small, clean and self-explanatory user web interface. This should then allow viewing "raw" requests and responses as they are passed and gotten from the API, but also clean representations of these messages. This could be really basic by just supplying a visual tree-like representation of such objects, or really advanced like allowing a user to define widgets for how to represent each field. Again, this is just a brainstorm and depends completely on where emphasis of the project lies.

I'll close this monologue by taking into account the already existing projects around Scrapy that supply similar functionality. These are Scrapyd and WebService, at least one of which most users have already glanced at.
Scrapyd allows starting any spider, and that's basically its greatest trump. WebService on the other hand is automatically enabled for one spider and allows monitoring and controlling that spider, though especially the former. This project is somewhere in between: it should preferably enable access to multiple spiders (inside one project) at the same time, while laying emphasis on taking interactive control.


--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/dJRFIA46MT4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Shane Evans

unread,
Feb 23, 2014, 1:11:34 PM2/23/14
to scrapy...@googlegroups.com
I think you've added a lot of useful details to the project and the new ideas look good to me. I can think of some projects in the past that I could have used them, and that's always a good sign :)


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages