google books

Ed Summers

unread,

Aug 30, 2010, 3:11:34 PM8/30/10

to get-t...@googlegroups.com

I imagine y'all have seen the various press releases about Google
releasing a million public domain books as epubs, e.g.

http://booksearch.blogspot.com/2009/08/download-over-million-public-domain.html

If you've seen Google's Book Search before, it looks possible to
construct a query of pre-1922 books:

curl 'http://books.google.com/books/feeds/volumes?tbs=cd_max:Jan%2001_2%201923&q=Stevenson'
| xmllint --format - | less

From there it should be possible to grab the epub from the data
contained in the Atom feed. But this requires a query to dip into the
data. I was wondering if any of the data mungers on get-theinfo have
found a good way for getting actual access to the ~1 million books
that Google are making available?

//Ed

Aaron Swartz

unread,

Aug 30, 2010, 3:13:29 PM8/30/10

to get-t...@googlegroups.com

Some people have been uploading these to the Internet Archive, but
there's no easy way to get them from Google in bulk -- they block IPs
that hit it too frequently and put up captchas after a couple
downloads.

> --
> [from the http://groups.google.com/group/get-theinfo mailing list]
>

Jeremy Dunck

unread,

Aug 30, 2010, 4:12:12 PM8/30/10

to get-t...@googlegroups.com

On Mon, Aug 30, 2010 at 2:13 PM, Aaron Swartz <m...@aaronsw.com> wrote:
> Some people have been uploading these to the Internet Archive, but
> there's no easy way to get them from Google in bulk -- they block IPs
> that hit it too frequently and put up captchas after a couple
> downloads.

Has Google advanced any rationale for not allowing download of works
that are clearly out of copyright?

If not -- sweat of the brow is not legal defense, so how about using a
decentralized system to download from a large number of IPs?

Aaron Swartz

unread,

Aug 30, 2010, 4:14:19 PM8/30/10

to get-t...@googlegroups.com

>> Some people have been uploading these to the Internet Archive, but
>> there's no easy way to get them from Google in bulk -- they block IPs
>> that hit it too frequently and put up captchas after a couple
>> downloads.
>
> Has Google advanced any rationale for not allowing download of works
> that are clearly out of copyright?
>
> If not -- sweat of the brow is not legal defense, so how about using a
> decentralized system to download from a large number of IPs?

People interested in donating IPs to such a project should email me off-list.

Dan Brickley

unread,

Aug 30, 2010, 4:17:18 PM8/30/10

to get-t...@googlegroups.com

Could this be done with a simple 2 step html/json Web page for
sympathisers to use?

1. download a file or few
2. upload it somewhere more communal...

Dan

Tom Morris

unread,

Aug 30, 2010, 4:56:21 PM8/30/10

to get-t...@googlegroups.com

The threshold for "too frequently" appears to be incredibly low for
books.google.com. I just got blocked after a half dozen searches
using variants of a single pair of terms resulting in ~65 hits for
books that I *manually* added to a Google Books bookshelf one by one.
That's without even downloading any of the books! If they are being
that aggressive about blocking, it's going to take a massive number of
IPs to do anything useful.

BTW, if anyone has their eye on the little "Export as XML" feature for
Google Books bookshelves, be forewarned that it includes a minuscule
amount of information. It just has the title, author, and Google ID,
no publication info or anything else to help disambiguate or identify
the volume.

Tom

Ed Summers

unread,

Aug 30, 2010, 5:02:25 PM8/30/10

to get-t...@googlegroups.com

On Mon, Aug 30, 2010 at 4:56 PM, Tom Morris <tfmo...@gmail.com> wrote:
> BTW, if anyone has their eye on the little "Export as XML" feature for
> Google Books bookshelves, be forewarned that it includes a minuscule
> amount of information. It just has the title, author, and Google ID,
> no publication info or anything else to help disambiguate or identify
> the volume.

I'm not sure if you've noticed it, but the Google Books API includes
some useful DC metadata like:

<dc:creator>Richard Ambrosini</dc:creator>
<dc:creator>Richard Dury</dc:creator>
<dc:date>2006</dc:date>
<dc:description>As the editors point out in their Introduction,
Stevenson reinvented the “personal essay” and the “walking tour
essay,” in texts of ironic stylistic ...</dc:description>
<dc:format>377 pages</dc:format>
<dc:format>book</dc:format>
<dc:identifier>z2Yf1FX02EkC</dc:identifier>
<dc:identifier>ISBN:0299212246</dc:identifier>
<dc:identifier>ISBN:9780299212247</dc:identifier>
<dc:publisher>Univ of Wisconsin Pr</dc:publisher>
<dc:subject>Literary Criticism</dc:subject>
<dc:title>Robert Louis Stevenson</dc:title>
<dc:title>writer of boundaries</dc:title>

I like the idea of some coordinated effort to get this public domain
content replicated somehow. There are already 902,788 Google Book
titles on Internet Archive, which is a damn fine start:

http://www.archive.org/details/googlebooks

//Ed

Tom Morris

unread,

Aug 31, 2010, 2:07:35 PM8/31/10

to get-t...@googlegroups.com

The Google Book Search API Terms of Service say "The Google Book
Search APIs are limited to allowing you to display Google Book Search
Content on your site, and are not intended to provide you with the
ability to access other Google services or data." and also include the
rather strange "2.9 Your implementation of the Google Book Search APIs
must be made freely accessible to users." which I can't even parse
well enough to figure out what it means.

Perhaps the Internet Archive has been granted an exception, but my
reading is that anyone who wanted to help would probably need an
exception too.

Tom

Alexandre Rafalovitch

unread,

Sep 29, 2010, 9:40:47 AM9/29/10

to get-t...@googlegroups.com

On Tue, Aug 31, 2010 at 2:07 PM, Tom Morris <tfmo...@gmail.com> wrote:
> "2.9 Your implementation of the Google Book Search APIs
> must be made freely accessible to users." which I can't even parse
> well enough to figure out what it means.

I think it must mean that if your application is using Google Book
Search API, you cannot charge for it. It has to be free.

The other interpretation could be that it has to be available on
public internet, rather than under password and/or on private network
only.

The core concept seems to be that if you are getting this stuff for
free, don't hoard or abuse it.

Regards,
Alex.

Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)

Reply all

Reply to author

Forward