Quickest way to scrape the page content out of a smallish (qty 150) set of Google search results?

19 views
Skip to first unread message

wade

unread,
Jun 23, 2010, 5:58:26 PM6/23/10
to pnwcode4lib
Looking for a kick start idea or a simple solution for a particular
task, and thought of this pnwcode4lib group as a good resource.

I want to search our local web site for a particular text string, and
then scrape the content of the pages referenced by the search results
into a single, or multiple, text file(s).

More specifically, I want to search our local site for "faculty cv"
and / or "faculty resume", etc, grab any CV data posted (of which
there appears to be lots), and massage that data into a rough starting
point for a bibliography of faculty publications.

I've done this kind of thing on individual pages via PHP.

But almost seems like there would be a ready-made Google API-based
tool for doing something like this.

Looking at the Google API now.

Meanwhile, if you have any suggestions, I'd love to hear them.

Wade Guidry
University of Puget Sound

Kyle Banerjee

unread,
Jun 25, 2010, 7:20:05 AM6/25/10
to pnwco...@googlegroups.com
Are you focusing on pages that are just within a particular area of the UPS domain, or do you want to get CV's from personal websites and other areas as well? If the latter is the case, the google method will probably work best. Having said that, the API doesn't give you any data access so getting significant amounts of info out will be hard, and you never know what you'll find/miss with google.

If you could get local website admins to provide a list of pages on the servers, parsing filenames for the right expressions might direct you straight to what you need. I would avoid crawling local sites -- just searching for these pages without cooperation of local admins since that kind of activity could be misinterpreted.

kyle


--
You received this message because you are subscribed to the Google Groups "pnwcode4lib" group.
To post to this group, send email to pnwco...@googlegroups.com.
To unsubscribe from this group, send email to pnwcode4lib...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pnwcode4lib?hl=en.




--
----------------------------------------------------------
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
bane...@uoregon.edu / 503.999.9787

wade

unread,
Jun 28, 2010, 11:25:09 AM6/28/10
to pnwcode4lib
Thanks, Kyle. I was sort of coming to these same conclusions.
> > pnwcode4lib...@googlegroups.com<pnwcode4lib%2Bunsubscribe@googlegr­oups.com>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/pnwcode4lib?hl=en.
>
> --
> ----------------------------------------------------------
> Kyle Banerjee
> Digital Services Program Manager
> Orbis Cascade Alliance
> baner...@uoregon.edu / 503.999.9787- Hide quoted text -
>
> - Show quoted text -
Reply all
Reply to author
Forward
0 new messages