From there it should be possible to grab the epub from the data contained in the Atom feed. But this requires a query to dip into the data. I was wondering if any of the data mungers on get-theinfo have found a good way for getting actual access to the ~1 million books that Google are making available?
Some people have been uploading these to the Internet Archive, but there's no easy way to get them from Google in bulk -- they block IPs that hit it too frequently and put up captchas after a couple downloads.
On Mon, Aug 30, 2010 at 3:11 PM, Ed Summers <e...@pobox.com> wrote: > I imagine y'all have seen the various press releases about Google > releasing a million public domain books as epubs, e.g.
> From there it should be possible to grab the epub from the data > contained in the Atom feed. But this requires a query to dip into the > data. I was wondering if any of the data mungers on get-theinfo have > found a good way for getting actual access to the ~1 million books > that Google are making available?
On Mon, Aug 30, 2010 at 2:13 PM, Aaron Swartz <m...@aaronsw.com> wrote: > Some people have been uploading these to the Internet Archive, but > there's no easy way to get them from Google in bulk -- they block IPs > that hit it too frequently and put up captchas after a couple > downloads.
Has Google advanced any rationale for not allowing download of works that are clearly out of copyright?
If not -- sweat of the brow is not legal defense, so how about using a decentralized system to download from a large number of IPs?
>> Some people have been uploading these to the Internet Archive, but >> there's no easy way to get them from Google in bulk -- they block IPs >> that hit it too frequently and put up captchas after a couple >> downloads.
> Has Google advanced any rationale for not allowing download of works > that are clearly out of copyright?
> If not -- sweat of the brow is not legal defense, so how about using a > decentralized system to download from a large number of IPs?
People interested in donating IPs to such a project should email me off-list.
On Mon, Aug 30, 2010 at 10:12 PM, Jeremy Dunck <jdu...@gmail.com> wrote: > On Mon, Aug 30, 2010 at 2:13 PM, Aaron Swartz <m...@aaronsw.com> wrote: >> Some people have been uploading these to the Internet Archive, but >> there's no easy way to get them from Google in bulk -- they block IPs >> that hit it too frequently and put up captchas after a couple >> downloads.
> Has Google advanced any rationale for not allowing download of works > that are clearly out of copyright?
> If not -- sweat of the brow is not legal defense, so how about using a > decentralized system to download from a large number of IPs?
Could this be done with a simple 2 step html/json Web page for sympathisers to use?
1. download a file or few 2. upload it somewhere more communal...
The threshold for "too frequently" appears to be incredibly low for books.google.com. I just got blocked after a half dozen searches using variants of a single pair of terms resulting in ~65 hits for books that I *manually* added to a Google Books bookshelf one by one. That's without even downloading any of the books! If they are being that aggressive about blocking, it's going to take a massive number of IPs to do anything useful.
BTW, if anyone has their eye on the little "Export as XML" feature for Google Books bookshelves, be forewarned that it includes a minuscule amount of information. It just has the title, author, and Google ID, no publication info or anything else to help disambiguate or identify the volume.
On Mon, Aug 30, 2010 at 3:13 PM, Aaron Swartz <m...@aaronsw.com> wrote: > Some people have been uploading these to the Internet Archive, but > there's no easy way to get them from Google in bulk -- they block IPs > that hit it too frequently and put up captchas after a couple > downloads.
> On Mon, Aug 30, 2010 at 3:11 PM, Ed Summers <e...@pobox.com> wrote: >> I imagine y'all have seen the various press releases about Google >> releasing a million public domain books as epubs, e.g.
>> From there it should be possible to grab the epub from the data >> contained in the Atom feed. But this requires a query to dip into the >> data. I was wondering if any of the data mungers on get-theinfo have >> found a good way for getting actual access to the ~1 million books >> that Google are making available?
On Mon, Aug 30, 2010 at 4:56 PM, Tom Morris <tfmor...@gmail.com> wrote: > BTW, if anyone has their eye on the little "Export as XML" feature for > Google Books bookshelves, be forewarned that it includes a minuscule > amount of information. It just has the title, author, and Google ID, > no publication info or anything else to help disambiguate or identify > the volume.
I'm not sure if you've noticed it, but the Google Books API includes some useful DC metadata like:
<dc:creator>Richard Ambrosini</dc:creator> <dc:creator>Richard Dury</dc:creator> <dc:date>2006</dc:date> <dc:description>As the editors point out in their Introduction, Stevenson reinvented the “personal essay” and the “walking tour essay,” in texts of ironic stylistic ...</dc:description> <dc:format>377 pages</dc:format> <dc:format>book</dc:format> <dc:identifier>z2Yf1FX02EkC</dc:identifier> <dc:identifier>ISBN:0299212246</dc:identifier> <dc:identifier>ISBN:9780299212247</dc:identifier> <dc:publisher>Univ of Wisconsin Pr</dc:publisher> <dc:subject>Literary Criticism</dc:subject> <dc:title>Robert Louis Stevenson</dc:title> <dc:title>writer of boundaries</dc:title>
I like the idea of some coordinated effort to get this public domain content replicated somehow. There are already 902,788 Google Book titles on Internet Archive, which is a damn fine start:
On Mon, Aug 30, 2010 at 5:02 PM, Ed Summers <e...@pobox.com> wrote: > On Mon, Aug 30, 2010 at 4:56 PM, Tom Morris <tfmor...@gmail.com> wrote: >> BTW, if anyone has their eye on the little "Export as XML" feature for >> Google Books bookshelves, be forewarned that it includes a minuscule >> amount of information. It just has the title, author, and Google ID, >> no publication info or anything else to help disambiguate or identify >> the volume.
> I'm not sure if you've noticed it, but the Google Books API includes > some useful DC metadata like:
> <dc:creator>Richard Ambrosini</dc:creator> > <dc:creator>Richard Dury</dc:creator> > <dc:date>2006</dc:date> > <dc:description>As the editors point out in their Introduction, > Stevenson reinvented the “personal essay” and the “walking tour > essay,” in texts of ironic stylistic ...</dc:description> > <dc:format>377 pages</dc:format> > <dc:format>book</dc:format> > <dc:identifier>z2Yf1FX02EkC</dc:identifier> > <dc:identifier>ISBN:0299212246</dc:identifier> > <dc:identifier>ISBN:9780299212247</dc:identifier> > <dc:publisher>Univ of Wisconsin Pr</dc:publisher> > <dc:subject>Literary Criticism</dc:subject> > <dc:title>Robert Louis Stevenson</dc:title> > <dc:title>writer of boundaries</dc:title>
> I like the idea of some coordinated effort to get this public domain > content replicated somehow. There are already 902,788 Google Book > titles on Internet Archive, which is a damn fine start:
The Google Book Search API Terms of Service say "The Google Book Search APIs are limited to allowing you to display Google Book Search Content on your site, and are not intended to provide you with the ability to access other Google services or data." and also include the rather strange "2.9 Your implementation of the Google Book Search APIs must be made freely accessible to users." which I can't even parse well enough to figure out what it means.
Perhaps the Internet Archive has been granted an exception, but my reading is that anyone who wanted to help would probably need an exception too.
On Tue, Aug 31, 2010 at 2:07 PM, Tom Morris <tfmor...@gmail.com> wrote: > "2.9 Your implementation of the Google Book Search APIs > must be made freely accessible to users." which I can't even parse > well enough to figure out what it means.
I think it must mean that if your application is using Google Book Search API, you cannot charge for it. It has to be free.
The other interpretation could be that it has to be available on public internet, rather than under password and/or on private network only.
The core concept seems to be that if you are getting this stuff for free, don't hoard or abuse it.