Is this a valid use case?

35 views
Skip to first unread message

Jim Priest

unread,
Oct 31, 2015, 9:50:40 PM10/31/15
to scrapy-users
I'm just getting started with Scrapy and trying to figure out if we could use it for a project at work...

We have a large site and have a lot of content in Akamai for failover.

Our problem however is the failover content gets stale over time.

I was thinking I could spider our site, pull down a page, and do a quick comparison with the  same page in failover.

Same size? Same content? Etc. 

If yes - continue to next file, if not - refresh the failover content.

Scrapy seems like a good starting point and between extensions, signals and middleware it seems like I could create the workflow I'm after unless I'm missing something.

Anyone have any feedback or opinions of if this would work or not :)

Thanks!
Jim

Jakob de Maeyer

unread,
Nov 2, 2015, 7:28:39 AM11/2/15
to scrapy-users
Hey Jim,

Scrapy is great at two things:
1. downloading web pages, and
2. extracting unstructured data.

In your case, you should have already have access to the raw files (via FTP, etc.), as well as to the data in a structured format. It would be possible to do what you're aiming at with Scrapy, but it doesn't seem to be the most elegant solution. What speaks against setting up an rsync cronjob or similar to keep the failover in sync?


Cheers,
-Jakob

Jim Priest

unread,
Nov 2, 2015, 4:16:56 PM11/2/15
to scrapy-users
I should have provided a bit more info on our use case :)

We have a lot of dynamic content in Drupal, blogs, etc.   The failover content is static versions of this dynamic content.  Currently this is done via a rather clunky Akamai tool which we're hoping to replace.

Another goal is to more immediately update this content - ie someone updates a Drupal page, it is immediately spidered (via API call or something) and that content is then saved to failover. 

I could probably cobble something together with wget or some other tool but trying to not reinvent the wheel here as much as possible.

Thanks!
Jim

Travis Leleu

unread,
Nov 2, 2015, 4:55:27 PM11/2/15
to scrapy-users
Jim, I'd probably add a hook to the on_save event in your blogs that pushes the URL into a queue.  Have a simple script that saves the content to your static failover.  No need for a spider/crawler when you just want to grab one page's content on an event trigger.

Perhaps I'm not understanding why you'd need something heavy like scrapy, you could write a 30 line python program to monitor the queue, requests.get() the page, then save to static location.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Jim Priest

unread,
Nov 2, 2015, 5:07:11 PM11/2/15
to scrapy...@googlegroups.com
We would like to implement something like that moving forward.

In the meantime we have a lot of pages currently cached we'd like to check (these may never get updated so would never see the on_save hook), and we also have a lot of static resources we need to check as well that have no 'save now' hook available.

Ideally we'd have something that ran on a schedule for a broad update (once a week?) and then via implementing hooks where we can - that would cover everything else.

Jim


Jakob de Maeyer

unread,
Nov 3, 2015, 8:06:25 AM11/3/15
to scrapy...@googlegroups.com
Hey Jim,

it still seems unintuitive that you need to go through http requests
when you have full access to everything. Have you looked at Drupal's
static generator?

However, if making an HTTP request is your only (simple) way of
generating the page that you want in the failover, Scrapy might indeed
be an option. If you know (i.e. can generate a list of) all your URLs
you could simply put them in a Spider's `start_urls` or
`start_requests()`, and I would prefer that over the requests library
because you get Scrapy's throttling, error handling, etc. If the URLs
are unknown, you can make use of CrawlSpider and spider rules.


Cheers,
-Jakob
> <mailto:scrapy-users...@googlegroups.com>.
> To post to this group, send email to
> scrapy...@googlegroups.com
> <mailto:scrapy...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to scrapy-users...@googlegroups.com
> <mailto:scrapy-users...@googlegroups.com>.
>
> To post to this group, send email to scrapy...@googlegroups.com
> <mailto:scrapy...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/lmmJAIT42NI/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> scrapy-users...@googlegroups.com
> <mailto:scrapy-users...@googlegroups.com>.
> To post to this group, send email to scrapy...@googlegroups.com
> <mailto:scrapy...@googlegroups.com>.

Jim Priest

unread,
Nov 3, 2015, 9:26:21 AM11/3/15
to scrapy...@googlegroups.com
Thanks for the feedback!  Going to go through the Scrapy tutorial today and then see if I can hack up a quick proof of concept to see if this will work.  While we have control over 'most' things its the edge cases we don't have control over which is why we're exploring this approach.

Jim

To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages