Torrent and Python Script: Download the Whole Library at Once

184 views
Skip to first unread message

Jacob Press

unread,
Apr 26, 2021, 7:22:03 PM4/26/21
to Standard Ebooks
Here's a quick and dirty fetch script. It fetches urls for 'Recommended compatible epub' for every book in the collection. If you have very basic knowledge of (Python3) scripting, you can tweak it.

For bandwidth reasons the torrent linked below is more considerate, until it becomes wildly out of date.

I don't think it would be hard to run this every 6 months and compile a torrent of the files. If people want newer files they can browse manually and download the newer ones, or run the script. A more advanced script could compare the files on the server with the local files and only download the changed files. I believe most download managers these days ( aria2?, JDownloader?, wget?) can do so.

With torrents, there's always a tradeoff between having an updated listing and having a healthy swarm of seeders. Updating the torrent with every new/updated book would kill the seed base.

Continued from previous conversation: https://groups.google.com/g/standardebooks/c/KrXyzZVZig4/m/VKKCp0B5BAAJ

Torrent magnet link:

magnet:?xt=urn:btih:07ce25a12d5801461b488e3b04db6845c5f279b3&dn=Standard%20Ebooks%20-%20Recommended%20compatible%20epub%20-%202021-04-25&tr=udp%3A%2F%2Ftracker.internetwarriors.net%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A6969%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&tr=udp%3A%2F%2Fretracker.lanta-net.ru%3A2710%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Fwww.torrent.eu.org%3A451%2Fannounce

I also wrote a scraper for this. I'm omitting it here because there are better programs out there. Here's a guide on aria2 for example.

https://gist.github.com/amrza/f0534f0015c4e76a826baef8199ba6c3

Finally, the script:

```
import requests

def get_xml():
    url = 'https://standardebooks.org/opds/all'
    r = requests.get(url)
    return(r.text)

def my_filter(mystr):
    etype = 'Recommended compatible epub'
    return (etype in mystr)

def parse_xml(text):
    mysplit = text.split("\n")
    filtered = filter(my_filter, mysplit)
    # no '/' at the end
    base_url = "https://standardebooks.org"
    urls = []
    for f in filtered:
        # get only the url, without regex
        url = f.split('<link href="')[1]
        url = url.split('"')[0]
        url = base_url + url
        urls.append(url)
    return urls

text = get_xml()
urls = parse_xml(text)
with open('urls.txt', 'w+') as g:
    g.write('\n'.join(urls))
print("Now download the URLs with your own program")
```

Alex Cabal

unread,
Apr 26, 2021, 7:32:56 PM4/26/21
to standar...@googlegroups.com
Yes, but the ultimate problem which I pointed out earlier is that 6
months is far too long of an interval. Not only are new ebooks released
frequently but the existing corpus is also updated frequently. It's not
helpful for anyone--the readers, or us--to have very outdated ebooks
served via official channels. We want our readers to get the latest
ebook whenever possible.

The OPDS "all" feed basically solves this problem, for people who can
figure out how to parse and script XML. Not everyone can do that, but
then again not everyone needs to download an entire corpus of ebooks all
at once. That's already more reading than can be done in a decade!
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/e618186c-ab11-4afa-bf25-194447471496n%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/e618186c-ab11-4afa-bf25-194447471496n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Jacob Press

unread,
Apr 26, 2021, 8:23:32 PM4/26/21
to Standard Ebooks
It's the kind of thing that can be automated. 1 month? 2 weeks? Daily? You said to be considerate with bandwidth, so I was. If your infrastructure can support seeding torrents then you can generate a new torrent every second. I get that this project is focused on quality, but am highly skeptical that a work of classic literature is useless because one emdash or typo evaded notice. If I was in charge of this I would probably automate a 2 week or monthly update, since the files are so small that a huge seeder base is not needed.

I do see that the entire collection has been updated in the last month. In the two stories I looked at, I didn't see any changes in content in the last year, only metadata and layout. It appears people survived on the old metadata just fine though?

I see value in the collection as a collection, not just a set of individual works. Correct, nobody can read it all. But for an educator or librarian, it could be a fun factoid or download link. There are many places online where you can download one-off ebooks, but Standard Ebooks is dedicated to a higher bar of quality than most public domain repositories. There is a curatorial aspect to the set as well - both with explicit rules (no modern religious texts, no cookbooks) and implicit selection (every volunteer chooses one book over another, for whatever reason). If you hope to attract additional volunteers, you could evangelize the project more, and to people who don't think like you. Presumably your existing approach is doing a great job of attracting fans of the one-off model.

Alex Cabal

unread,
Apr 26, 2021, 8:31:32 PM4/26/21
to standar...@googlegroups.com
On 4/26/21 7:23 PM, Jacob Press wrote:
> I see value in the collection as a collection, not just a set of
> individual works. Correct, nobody can read it all. But for an educator
> or librarian, it could be a fun factoid or download link.

Educators and librarians use OPDS feeds, that's one of the major
audiences for whom ODPS was designed for. I don't think librarians
aren't too interested in distributing vast zip files/torrents of
hundreds/thousands of ebooks, otherwise we'd see torrents of library
catalogs far more often than we do in reality. Instead librarians take a
curatorial approach, like we do.

I appreciate your effort, however I remain unconvinced that a torrent of
the corpus would be useful for the general reader; it would freeze the
corpus at a point of time that is likely months, which is not what we
want as a project; and it would be yet another maintenance burden for
us, because I for sure don't have time to maintain it, and like all OSS
projects, interested parties leave just as often as they arrive.

The problem of rude scrapers is much more easily solved at the level of
our servers, and those who are both technical enough to use Bittorrent,
and interested enough to want the whole corpus all at once, can scrape
the OPDS feed.
Reply all
Reply to author
Forward
0 new messages