Recovering OLS scans and archiving scans of past projects

546 views
Skip to first unread message

Alex Cabal

unread,
Sep 21, 2021, 5:29:45 PM9/21/21
to standar...@googlegroups.com
Now that PGDP has shut down the Open Library System without warning,
we've been left without reference scans for the ebooks we used OLS for.
This amounts to 65 scans from 17 ebooks, many of them compilations of
pulp sci fi.

This raises two important points:

1. We must now go back to books that used OLS scans and see if we can
dig up the equivalent scans from archive.org. In the past few years,
more and more pulp sci fi magazines have been appearing in archive.org
so it may be possible to track down a significant percentage of these.
Then we can update the ebook metadata, and proceed to...

2. I think it would make sense to keep copies of scans for our own
purposes, in case something like this happens again in the future. For
example, if archive.org or hathitrust.org receive a bogus DMCA complaint
and we lose access to scans, or the URL changes, or something like that.
Most (all?) of these services allow PDF downloads.

Can anyone help with item 1? An easy way to do that would be to use the
`sync-ebooks` script in the web repo to download the entire corpus, then
search for `/ols/` in content.opf. Then, go to archive.org and search
text data for a random unique string in the story/book.

For item 2, I don't think we should store the PDFs in the ebook repos
themselves, because they can be big and Git isn't good at tracking large
files, and we're keeping them in case of emergency; day-to-day work can
still occur at the scan URLs. For now we can simply keep the PDFs
somewhere else on the SE server.

We would have to go back through the corpus and download scans for our
completed ebooks. Can anyone help with that? That is probably something
that is mostly scriptable.

B Keith

unread,
Sep 21, 2021, 5:45:29 PM9/21/21
to Standard Ebooks
I do have the scans for OLS Airlords of Han if you want them, but none of the others that I have worked on. I will take a look at my projects at least and see if I can find equivalents

Bruce
_________

Guadeamus igitur iuvenes dum sumus
> --
> You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/2f2e7cad-c77f-d908-e9a1-46528e5015ad%40standardebooks.org.

Vince

unread,
Sep 21, 2021, 5:49:43 PM9/21/21
to standar...@googlegroups.com
I have a copy of the corpus, I can start on #1.

I regularly download PDF’s from Archive (and Google when I have to use them) for proofing, but Hathitrust usually only allows a page at a time without a login, and I’ve never investigated what it takes to get one.

Alex Cabal

unread,
Sep 21, 2021, 6:02:53 PM9/21/21
to standar...@googlegroups.com
It might just take a free account?

Vince

unread,
Sep 21, 2021, 6:18:54 PM9/21/21
to standar...@googlegroups.com
No, since I brought it up, I decided to go check it out. There is allegedly more you can do with a free account (which I created), but to download full PDF’s, you have to have a partner institution account. (For HathiTrust exclusives, obviously; if their scans are from Google, you can go to Google and get one there.)

Alex Cabal

unread,
Sep 21, 2021, 7:06:45 PM9/21/21
to standar...@googlegroups.com
Some books appear to have whole downloads even while not logged in, like
https://babel.hathitrust.org/cgi/pt?id=uc2.ark:/13960/t5m902w1t&view=1up&seq=7&skin=2021

For books that they took from Google, looks like there's a direct link
to the Google Books page under the 'get this item' header

So far, a big percentage of what I've looked at has a Google Books
equivalent.

So, we could at least get a sizeable portion of them as PDFs, and mark
the rest for further meditation

Vince

unread,
Sep 21, 2021, 7:33:42 PM9/21/21
to standar...@googlegroups.com
Right, there just seems to be no rhyme or reason (that I’ve been able to figure out, anyway) as to when a full PDF is available and when it’s page only.
Yes, sorry, that’s what I meant by “you can go to Google and get one there,” as in “go to from the link”.

Alex Cabal

unread,
Sep 21, 2021, 9:43:40 PM9/21/21
to standar...@googlegroups.com
So, can anyone take charge of working on this? I just don't have the time.

There's not a huge rush, so it's something that can be done over time.
The first step would be creating a spreadsheet, which would be fairly
easy, and then we can start working on collecting the PDFs.

Vince

unread,
Sep 21, 2021, 9:55:06 PM9/21/21
to Standard Ebooks

Here’s what I’ve found so far; it’s all I have time for today. Poul Anderson doesn’t have which story goes with which scan, so I’m going to leave that for last. The links are to the page, so you can look at them if you want/need to.


Voodoo Planet—nothing found anywhere
Stand by for Mars—nothing
Brood of the Witch Queen—nothing
The Dark Other—nothing
The Night Land—nothing

Fritz Leiberthe three I can’t find I noted where they were originally published, in case anyone else can come up with them

The Night of the Long Knives—Fantastic Fiction 1960 maybe?
The 64-Square Madhousehttps://archive.org/details/1962-05_IF/page/n63/mode/2up (I love that they started the story on page 64.)
X Marks the Pedwalk—4/1963 Worlds of Tomorrow
A Hitch in Space—8/1963 Worlds of Tomorrow (Archive has a Philip K. Dick story from this issue, but nothing else.)

maticstric

unread,
Sep 21, 2021, 11:39:55 PM9/21/21
to Standard Ebooks
Here are the other short stories:

X Marks the Pedwalkhttps://archive.org/details/Worlds_of_Tomorrow_v01n01_1963-04_LennyS-EXciter/page/n55/mode/2up
A Hitch in Spacehttps://archive.org/details/Worlds_of_Tomorrow_v01n03_1963-08_dtsg0318.Anon/page/n77/mode/2up

Also about Poul Anderson: when I added the new story nothing was in publication order so I redid the toc and also reorganized the scans to be in publication order. However, some of the scans were missing and I didn't add them. If I have some time tomorrow I might.

maticstric

unread,
Sep 22, 2021, 1:25:57 PM9/22/21
to Standard Ebooks
I made a spreadsheet to keep track of this: https://docs.google.com/spreadsheets/d/1Ma0YHguM03xv826Gg82GRUomP3GCtB7OYeBF19qrpl0/edit?usp=sharing. Feel free to edit anything not sure what the best way to format it is.

I'll start the Poul Anderson stories right now.

Vince

unread,
Sep 22, 2021, 1:35:50 PM9/22/21
to standar...@googlegroups.com
There are two pieces here:
  1. Finding new scans where we’ve used OLS scans, and updating the metadata with the new scans links, and
  2. Once that’s done, downloading all the scans to the SE server.

I’m working on #1. If you want to help with Anderson’s stories, I’m all for it. But I’ll handle the rest, and put together the PR’s when I’m finished.

The big need is #2.
  • For each scan, identifying the URL for the PDF version of that scan (is that determinable from the scan URL, perhaps for each provider, or does it have to be identified manually?). That’s what I would suggest for the spreadsheet—each book in the corpus, all of the scan URLs for that book, and the corresponding PDF URL.
  • I don’t think we want to bother with downloading the PDFs; we’d just have to turn around and upload them to the SE server. Once the spreadsheet is complete, we can just take all the PDF URL’s and run a script on the SE server itself and download them directly.

That is an unofficial take, of course; this is Alex’s party.

maticstric

unread,
Sep 22, 2021, 1:45:02 PM9/22/21
to Standard Ebooks
Sure, I can create a separate spreadsheet for the pdf urls and try to figure out if I can automate it somehow from the scan url. Hopefully at least the archive.org ones.

I just added all the archive.org scans for Anderson that were already in the content.opf. All the others are either pgdp, comicbooksplus, luminst, or just missing. You can work on replacing those if you want. If you can't find any, let me know and I can give it a go.

Vince

unread,
Sep 22, 2021, 2:33:20 PM9/22/21
to standar...@googlegroups.com
Thanks! And great work, btw, on finding those other three Leiber stories. I searched Archive numerous ways, but apparently my search skills don’t make the grade. :)

B Keith

unread,
Sep 22, 2021, 3:46:13 PM9/22/21
to Standard Ebooks
Hey Vince,

I am digging up what I can for the Mack Reynold collection. So far I am batting 90%. Did you want links for the page number or just the magazine itself?

Bruce
_________

Guadeamus igitur iuvenes dum sumus
-- 
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Vince

unread,
Sep 22, 2021, 3:52:45 PM9/22/21
to standar...@googlegroups.com
Just the mag itself is fine, that’s all that’s going in the metadata; I only included the page #’s in my initial email in case Alex wanted to look at one or more of them to make sure they were OK.

B Keith

unread,
Sep 22, 2021, 5:13:31 PM9/22/21
to Standard Ebooks
This is a tab separated file. Let me know if you want something else :-)  Not much in 1961-62 unfortunately

Title Year Date Magazine New url Old url
Dogfight—1973 Mack Reynolds 1953 jul-aug Imagination https://archive.org/details/Imagination_v04n06_1953-07_cape1736/ https://www.pgdp.org/ols/tools/biblio.php?id=projectID4aa2556c5cbaf
Happy Ending Mack Reynolds and Fredric Brown 1957 sept Fantastic Universe https://archive.org/details/Fantastic_Universe_v08n03_Sep_1957_Wilddog-DCP https://www.pgdp.org/ols/tools/biblio.php?id=projectID49c82d6189993
Unborn Tomorrow Mack Reynolds 1959 june Astounding https://archive.org/details/Astounding_v63n04_1959-06_EXciter-LennyS https://www.pgdp.org/ols/tools/biblio.php?id=projectID46917350e5c63
Summit Mack Reynolds 1960 feb Astounding https://archive.org/details/sim_analog-science-fiction-fact_1960-02_64_6 https://www.pgdp.org/ols/tools/biblio.php?id=projectID4682eef623947
Revolution Mack Reynolds 1960 may Astounding https://archive.org/details/sim_analog-science-fiction-fact_1960-05_65_3 https://www.pgdp.org/ols/tools/biblio.php?id=projectID469171f14fd3f
Adaptation Mack Reynolds 1960 aug Astounding https://archive.org/details/sim_analog-science-fiction-fact_1960-08_65_6 https://www.pgdp.org/ols/tools/biblio.php?id=projectID46916599916da
Combat Mack Reynolds 1960 oct Analog https://archive.org/details/sim_analog-science-fiction-fact_1960-10_66_2 https://www.pgdp.org/ols/tools/biblio.php?id=projectID4914cf86f4221
I'm a Stranger Here Myself Mack Reynolds 1960 dec Amazing https://www.pgdp.org/ols/tools/biblio.php?id=projectID48c445cc111b3
Gun for Hire Mack Reynolds 1960 dec Analog https://archive.org/details/sim_analog-science-fiction-fact_1960-12_66_4 https://www.pgdp.org/ols/tools/biblio.php?id=projectID46d4c9827eadb
Freedom Mack Reynolds 1961 feb Analog https://archive.org/details/sim_analog-science-fiction-fact_1961-06_17_6 https://www.pgdp.org/ols/tools/biblio.php?id=projectID4914d22f3bd3e
Ultima Thule Mack Reynolds 1961 mar Analog https://archive.org/details/sim_analog-science-fiction-fact_1961-03_67_1 https://www.pgdp.org/ols/tools/biblio.php?id=projectID4914ddd4f0ab1
Status Quo Mack Reynolds 1961 aug Analog https://archive.org/details/sim_analog-science-fiction-fact_1961-08_67_6 https://www.pgdp.org/ols/tools/biblio.php?id=projectID4914e24d73ccb
Mercenary Mack Reynolds 1962 apr Analog https://www.pgdp.org/ols/tools/biblio.php?id=projectID46e33170e5708
Subversive Mack Reynolds 1962 dec Analog https://www.pgdp.org/ols/tools/biblio.php?id=projectID4671dcb9cec18
The Common Man Mack Reynolds 1963 jan Analog https://www.pgdp.org/ols/tools/biblio.php?id=projectID4671d4f5aee30
Frigid Fracas Mack Reynolds 1963 mar-apr Analog https://www.pgdp.org/ols/tools/biblio.php?id=projectID49443df3561e2
Spaceman on a Spree Mack Reynolds 1963 jun Worlds of Tomorrow https://archive.org/details/Worlds_of_Tomorrow_v01n02_1963-06_LennyS-Exciter https://www.pgdp.org/ols/tools/biblio.php?id=projectID57cb212a1c5ba



Guadeamus igitur iuvenes dum sumus
-- 
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Vince

unread,
Sep 22, 2021, 5:21:36 PM9/22/21
to standar...@googlegroups.com
Excellent, thanks, Bruce!


Message has been deleted
Message has been deleted

Vince

unread,
Sep 23, 2021, 1:34:09 AM9/23/21
to standar...@googlegroups.com
I have to take back The Night Land; the scans I found were for an abridged edition. I haven’t found anything else.

I submitted PR’s for Jacob’s Room and the Leiber and Reynolds short story collections.

Alex Cabal

unread,
Sep 23, 2021, 2:43:38 PM9/23/21
to standar...@googlegroups.com
Great, thanks. Though we'll also need to change the imprint/colophon to
mention IA and not OLS. Can you do that too?

On 9/23/21 12:34 AM, Vince wrote:
> I have to take back /The Night Land/; the scans I found were for an
> abridged edition. I haven’t found anything else.
>
> I submitted PR’s for Jacob’s Room and the Leiber and Reynolds short
> story collections.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/65B4433B-AAE5-41D3-8B65-3CD2F5BA24B1%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/65B4433B-AAE5-41D3-8B65-3CD2F5BA24B1%40letterboxes.org?utm_medium=email&utm_source=footer>.

Vince

unread,
Sep 23, 2021, 2:44:51 PM9/23/21
to standar...@googlegroups.com
Grrr, sorry, forgot about that. Yes, definitely.

Vince

unread,
Sep 23, 2021, 3:08:30 PM9/23/21
to standar...@googlegroups.com
OK, I believe that is done. I just added them to the existing commit and force pushed all three. I ran lint on all of them and it was clean.

Alex Cabal

unread,
Sep 23, 2021, 3:21:58 PM9/23/21
to standar...@googlegroups.com
OK great, thanks! That knocked out a significant portion of the OLS
links. The last big one is Poul Anderson and then the rest are probably
going to be harder/impossible.

Here is one of them though, for Fritz Leiber's Four Day Planet:
https://archive.org/details/fourdayplanetlon00pipe/page/n3/mode/2up

Alex Cabal

unread,
Sep 23, 2021, 3:27:03 PM9/23/21
to standar...@googlegroups.com
Vince, if you want to start work on item #1 (archiving PDFs of all
existing projects) sometime, I can give you access to the SE server so
you can upload them to a folder for now. If you want to do that then
send me your public SSH key and the username you'd like on the server.

On 9/23/21 2:08 PM, Vince wrote:

maticstric

unread,
Sep 23, 2021, 3:32:49 PM9/23/21
to Standard Ebooks
I actually just made a spreadsheet with all the books and their scan urls. I don't have the time to do the pdf urls right now but I might in the upcoming week(s). Here's the link if anyone else wants to get started. Separating the <dc:sources> between the transcript and the scans required quite a bit of manual work. And there are a couple of links which I didn't understand and highlighted red.

Vince

unread,
Sep 23, 2021, 3:42:27 PM9/23/21
to standar...@googlegroups.com
Excellent, great work, thanks!

The Grant links are to some maps that were included in the book. They are valid.
The Tolstoy link to mobirereads is to an epub that has the text of a story. It doesn’t belong in the scans, as it isn’t one. It’s a transcription, if anything.

If someone or preferably multiple someones would like to pitch in and start updating this with the PDF URL’s, all help would be greatly appreciated!

Vince

unread,
Sep 23, 2021, 3:47:10 PM9/23/21
to standar...@googlegroups.com
It’s possible that each provider is consistent in their PDF URL naming such that they can be derived from the scan URL. I have not investigated that, so that’s something else someone could help with. If they’re not consistent, then we’re going to have to manually copy the PDF link for every one of these scans, which will obviously take a while, and why we could use multiple someones help with it.

Alex Cabal

unread,
Sep 23, 2021, 4:46:23 PM9/23/21
to standar...@googlegroups.com
At least for archive.org it could be scriptable. Download the HTML of
the book's page, then see if there's an xpath match:

a[normalize-space(text()) = 'PDF']

If so then you have a link to the PDF and the script can continue to
download it.

Something similar can probably be done with Google Books

On 9/23/21 2:47 PM, Vince wrote:
> It’s /possible/ that each provider is consistent in their PDF URL naming
> such that they can be derived from the scan URL. I have not investigated
> that, so that’s something else someone could help with. If they’re not
> consistent, then we’re going to have to manually copy the PDF link for
> every one of these scans, which will obviously take a while, and why we
> could use multiple someones help with it.
>
>> On Sep 23, 2021, at 2:42 PM, Vince <vr_se...@letterboxes.org
>> <mailto:vr_se...@letterboxes.org>> wrote:
>>
>> Excellent, great work, thanks!
>>
>> The Grant links are to some maps that were included in the book. They
>> are valid.
>> The Tolstoy link to mobirereads is to an epub that has the text of a
>> story. It doesn’t belong in the scans, as it isn’t one. It’s a
>> transcription, if anything.
>>
>> If someone or preferably multiple someones would like to pitch in and
>> start updating this with the PDF URL’s, all help would be greatly
>> appreciated!
>>
>>
>>> On Sep 23, 2021, at 2:32 PM, maticstric <matic...@gmail.com
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/D3DB3C30-D694-4807-AE76-6FA1F7E67CEB%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/D3DB3C30-D694-4807-AE76-6FA1F7E67CEB%40letterboxes.org?utm_medium=email&utm_source=footer>.

Anthony J. Bentley

unread,
Sep 24, 2021, 2:47:18 AM9/24/21
to standar...@googlegroups.com
Hi,

IA provides a command-line tool suitable for scripting downloads: https://archive.org/services/docs/api/internetarchive/cli.html

It’s also worth noting that IA provides torrents, and people could consider seeding.

--
Anthony J. Bentley

To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/3350ff4f-2d97-dd05-e313-09002091e18e%40standardebooks.org.

Alex Cabal

unread,
Sep 24, 2021, 11:23:17 AM9/24/21
to standar...@googlegroups.com
Great find, I think this is going to be the way to go to script this!
> <https://groups.google.com/d/msgid/standardebooks/D3DB3C30-D694-4807-AE76-6FA1F7E67CEB%40letterboxes.org?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/standardebooks/D3DB3C30-D694-4807-AE76-6FA1F7E67CEB%40letterboxes.org?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to standardebook...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/3350ff4f-2d97-dd05-e313-09002091e18e%40standardebooks.org
> <https://groups.google.com/d/msgid/standardebooks/3350ff4f-2d97-dd05-e313-09002091e18e%40standardebooks.org>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/3DE468DC-23BE-49F1-B6E8-5F633CD1A88D%40roadrunner.page
> <https://groups.google.com/d/msgid/standardebooks/3DE468DC-23BE-49F1-B6E8-5F633CD1A88D%40roadrunner.page?utm_medium=email&utm_source=footer>.

Vince Rice

unread,
Sep 24, 2021, 11:48:30 AM9/24/21
to standar...@googlegroups.com
They both came through at the same time. The first one was probably waiting to be sent and went out when you sent the second one.

If they have to have an account and can't be downloaded, I don't think they're going to work. We need publicly available scans, and, now, we ones we can download ourselves.

I'll check on Luminist for them.

> On Sep 24, 2021, at 9:31 AM, maticstric <matic...@gmail.com> wrote:
>
> I though I already sent this message so sorry if there's a duplicate but I don't see it on my end:
>
> I found the other four Analog stories. They are from the same uploader as the others Bruce found, but for some reason they have to be borrowed. All it takes is a free account. Sadly you can't download them.
>
> Mercenary: https://archive.org/details/sim_analog-science-fiction-fact_1962-04_69_2
> Subversive: https://archive.org/details/sim_analog-science-fiction-fact_1962-12_70_4
> The Common Man: https://archive.org/details/sim_analog-science-fiction-fact_1963-01_70_5
> Frigid Fracas: https://archive.org/details/sim_analog-science-fiction-fact_1963-03_71_7 & https://archive.org/details/sim_analog-science-fiction-fact_1963-04_71_8
>
> However, you can find all of these, with downloads, including "I'm a Stranger Here Myself", at Luminist (these ones are in color too unlike the Archive ones). It's a great resource for pulp scifi/fantasy scans.
>
> http://www.luminist.org/archives/SF/AS.htm
> http://www.luminist.org/archives/SF/AN.htm

Alex Cabal

unread,
Sep 24, 2021, 11:51:56 AM9/24/21
to standar...@googlegroups.com
I think we can use IA *links* even if one must have an account to
"borrow" the book; all that means is that IA has not been convinced of
the copyright status of the book, it doesn't reflect the *actual*
copyright status of the book. I would certainly prefer IA links over a
hobbyist website because we can be reasonably sure that IA will still be
here in 10 years.

Now as far as downloading a PDF for our own archives goes, obviously in
these cases we have to download the Luminist copy, which is fine. But
even if we have the Luminist PDF, we should still link to IA in the
ebook metadata.

Vince

unread,
Sep 24, 2021, 1:27:44 PM9/24/21
to Standard Ebooks
Wow, yes, that command-line tool is fantastic. Thanks for the heads up!

Vince

unread,
Sep 24, 2021, 4:38:15 PM9/24/21
to Standard Ebooks
Alex, when we get there, how do you want to arrange and name the scan files on the server? I assume you want them in a separate directory structure.

If the directory for the scans was scanpdfs, do you want them all just dumped in that directory, or do you want separate directories for each book, e.g.
an a-a-milne_the-red-house-mystery subdirectory with all the scans for that book in it, and
an a-merritt_the-moon-pool subdirectory with all the scans for that book in it,
and so forth?

Do you want the scans to be named as Archive/Google/Hathi have them, e.g. redhousemystery00milngoog.pdf, The_Cream_of_the_Jest.pdf, etc.?

Vince

unread,
Sep 25, 2021, 4:10:55 PM9/25/21
to Standard Ebooks
I’ve submitted PR’s for all I and others here have found to replace the OLS scans. I have not been able to find anything for these.
Andre Norton, Voodoo Planet
Carey Rockwell, Stand by for Mars
H. Beam Piper, The Cosmic Computer
John Campbell, The Black Star Passes
Jules Verne, The Mysterious Island
Sax Rohmer, Brood of the Witch Queen
Stanley Weinbaum, The Dark Other
William Hope Hodgson, The Night Land

I did find one for The Night Land, but it’s an abridged edition. The are plenty of scans for The Mysterious Island, but they’re all the Kingston translation, but none (that I can find) for the Stephen White translation we’re using. (Searching for “the mysterious island verne stephen white” on archive.org does bring up our edition, so there’s that.)

I’ll start working on the downloads next, but that is going to take a while, and I’m going to take a break to get some other work done.

Alex Cabal

unread,
Sep 25, 2021, 5:37:31 PM9/25/21
to standar...@googlegroups.com
Excellent, thanks Vince and Matic. That's honestly a pretty good outcome
I think, I was expecting more to be missing!

Next project is to get those PDFs for our archives. I think scripting
archive.org will be the natural starting point as that would be the easiest.

On 9/25/21 3:10 PM, Vince wrote:
> I’ve submitted PR’s for all I and others here have found to replace the
> OLS scans. I have not been able to find anything for these.
>
> Andre Norton, Voodoo Planet
> Carey Rockwell, Stand by for Mars
> H. Beam Piper, The Cosmic Computer
> John Campbell, The Black Star Passes
> Jules Verne, The Mysterious Island
> Sax Rohmer, Brood of the Witch Queen
> Stanley Weinbaum, The Dark Other
> William Hope Hodgson, The Night Land
>
>
> I did find one for /The Night Land/, but it’s an abridged edition. The
> are plenty of scans for /The Mysterious Island/, but they’re all the
> Kingston translation, but none (that I can find) for the Stephen White
> translation we’re using. (Searching for “the mysterious island verne
> stephen white” on archive.org <http://archive.org> does bring up our
> edition, so there’s that.)
>
> I’ll start working on the downloads next, but that is going to take a
> while, and I’m going to take a break to get some other work done.
>
>
>> On Sep 24, 2021, at 3:38 PM, Vince <vr_se...@letterboxes.org
>> <mailto:vr_se...@letterboxes.org>> wrote:
>>
>> Alex, when we get there, how do you want to arrange and name the scan
>> files on the server? I assume you want them in a separate directory
>> structure.
>>
>> If the directory for the scans was scanpdfs, do you want them all just
>> dumped in that directory, or do you want separate directories for each
>> book, e.g.
>>
>> an a-a-milne_the-red-house-mystery subdirectory with all the scans
>> for that book in it, and
>> an a-merritt_the-moon-pool subdirectory with all the scans for
>> that book in it,
>> and so forth?
>>
>>
>> Do you want the scans to be named as Archive/Google/Hathi have them,
>> e.g. redhousemystery00milngoog.pdf, The_Cream_of_the_Jest.pdf, etc.?
>>
>>> On Sep 24, 2021, at 12:27 PM, Vince <vr_se...@letterboxes.org
>>> <mailto:vr_se...@letterboxes.org>> wrote:
>>>
>>> Wow, yes, that command-line tool is fantastic. Thanks for the heads up!
>>>
>>>
>>>> On Sep 24, 2021, at 10:23 AM, Alex Cabal <al...@standardebooks.org
>>>> <mailto:al...@standardebooks.org>> wrote:
>>>>
>>>> Great find, I think this is going to be the way to go to script this!
>>>>
>>>> On 9/24/21 1:47 AM, 'Anthony J. Bentley' via Standard Ebooks wrote:
>>>>> Hi,
>>>>> IA provides a command-line tool suitable for scripting downloads:
>>>>> https://archive.org/services/docs/api/internetarchive/cli.html
>>>>> <https://archive.org/services/docs/api/internetarchive/cli.html>
>>>>> <https://archive.org/services/docs/api/internetarchive/cli.html
>>>>> <https://archive.org/services/docs/api/internetarchive/cli.html>>
>>>>> It’s also worth noting that IA provides torrents, and people could
>>>>> consider seeding.
>>>
>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/D5398759-C6D8-4A5B-9AEC-ED235406324C%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/D5398759-C6D8-4A5B-9AEC-ED235406324C%40letterboxes.org?utm_medium=email&utm_source=footer>.

Vince

unread,
Sep 25, 2021, 5:41:06 PM9/25/21
to Standard Ebooks
And Bruce. And more would be missing if it was just up to me; I still don’t know how Matic found some of those Analog’s, I searched ten ways from Sunday on IA and never did find them. Kudos to him!

Yes, that’s what I’ll start with next week.

maticstric

unread,
Sep 25, 2021, 9:30:49 PM9/25/21
to Standard Ebooks
Haha, thanks! I've been looking through IA for pulp magazines scans for quite a while now, both for SE and personal stuff. The main thing to note is that IA pulp scans are a complete mess and nothing is organized and everything is hard to find. If you go to the ones Bruce found (like this one https://archive.org/details/sim_analog-science-fiction-fact_1960-12_66_4) and scroll down you'll see there's a "next issue" link. So I just clicked it a bunch till I got the ones I wanted lol.

It also says they're in the collection "pub_analog-science-fiction-fact" but when you click it they're not there... I don't know why. But still, collections can also be super helpful. It's how I found the Amazing Stories one for Leiber.

Vince

unread,
Sep 27, 2021, 3:20:50 PM9/27/21
to Standard Ebooks
Anyone know the difference between the various formats IA offers? I’ve posted a question on their Text Archive forum as well.

They have a half-dozen or more different PDF formats. When it exists, I believe we want “Text PDF” (it’s also often the only PDF format present for an item). But in many cases the item only has “Image Container PDF” and “Additional Text PDF,” and I haven’t been able to find anything (on IA or via web search) on what the difference is. And what the difference between “Text PDF” and “Additional Text PDF” is.


Alex Cabal

unread,
Sep 27, 2021, 3:47:19 PM9/27/21
to standar...@googlegroups.com
AFAIK "text PDF" means it is images and accompanying OCR'd plain text
behind them, which is what provides the PDF with text search and
selection capabilities for images of page scans. I assume "image
container PDF" is just the images without the text.

Vince

unread,
Sep 27, 2021, 3:56:41 PM9/27/21
to Standard Ebooks
That’s my guess as well, but what about “Additional Text PDF”? It’s clearly not the same as “Text PDF,” but I can’t find anything that indicates what it is.

Vince

unread,
Sep 27, 2021, 4:00:52 PM9/27/21
to Standard Ebooks
Alex, in case this slipped past you (sorry, I know you’re crazy busy)…

Alex Cabal

unread,
Sep 27, 2021, 4:04:00 PM9/27/21
to standar...@googlegroups.com
I thought about this a little more and let's do it this way:

One dir per book, named after the SE identifier

Within each dir will be the scans. Filenames will be:
{ia,gb,ht}-{pdf_identifier}.pdf where the first var is the abbreviation
of the source (I think we'll only have those 3 ones basically) and the
2nd var is whatever the source's book ID is. This can usually be figured
out from the source URL.
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/5D6CDD7B-EB43-4BAA-B68F-0D508B396503%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/5D6CDD7B-EB43-4BAA-B68F-0D508B396503%40letterboxes.org?utm_medium=email&utm_source=footer>.

Vince

unread,
Sep 27, 2021, 4:36:49 PM9/27/21
to Standard Ebooks
All right, very good, thanks. That will cover 98% of them or so. We do have a few scans from other places (Grant maps from loc.gov.id, a couple of O. Henry stories from the Texas History web site, and a handful of others); we can worry about those when we get the rest taken care of.

We have two different kinds of Google id’s in our metadata, i.e. “classic” interface URL vs the new-style URL.
Classic is “https://books.google.com/books?id=<blah>”. This is the majority of our Google scan URL’s.
New is “https://www.google.com/books/edition/<book name>/<id>, e.g. …/books/The_Romany_Rye/9eNcwZld6I4C.

I don't know whether Google is going to support both styles indefinitely, or if they have a plan to phase out classic? If they do phase out classic, the id alone will not get us back to a new-style URL.

I tested the Romany Rye URL above, converting the new-style to an old-style, i.e. books.google.com/books?id=9eNcwZld6I4C, and it worked. So it appears we can build a classic-style URL from just the id. But we can’t go the other way. So, two questions:
  1. Do we care about standardizing our Google scan URL's in the metadata, either all classic or all new? If new, I’ll have to do some work to get the new-style URL’s for all of our classic ones. (Around 95, give or take.) 
  2. Depending on the answer to #1 (or maybe not), do we still just want to name the PDF with the id, or do we also want to put the book id in, e.g. gb-The_Romany_Rye-9eNcwZld6I4C.pdf for the above example?

Alex Cabal

unread,
Sep 27, 2021, 6:01:26 PM9/27/21
to standar...@googlegroups.com
I asked Google Books and they said t hat the ID is the same, the URL
change is merely cosmetic.

I don't imagine they'll keep the old interface around forever, but it's
not a big priority to update the corpus with the new URLs. But if you
want to do it while you're doing this, that would be fine too. That
might also be scriptable, by getting the HTML of the old interface and
parsing the link to the new interface, or noting the URL of the 302
redirect if there is one.

On 9/27/21 3:36 PM, Vince wrote:
> All right, very good, thanks. That will cover 98% of them or so. We do
> have a few scans from other places (Grant maps from loc.gov.id
> <http://loc.gov.id>, a couple of O. Henry stories from the Texas History
> web site, and a handful of others); we can worry about those when we get
> the rest taken care of.
>
> We have two different kinds of Google id’s in our metadata, i.e.
> “classic” interface URL vs the new-style URL.
> Classic is “https://books.google.com/books?id=<blah>
> <https://books.google.com/books?id=%3Cblah%3E>”. This is the majority of
> our Google scan URL’s.
> New is “https://www.google <https://www.google>.com/books/edition/<book
> name>/<id>, e.g. …/books/The_Romany_Rye/9eNcwZld6I4C.
>
> I don't know whether Google is going to support both styles
> indefinitely, or if they have a plan to phase out classic? If they do
> phase out classic, the id alone will not get us back to a new-style URL.
>
> I tested the Romany Rye URL above, converting the new-style to an
> old-style, i.e. books.google.com/books?id=9eNcwZld6I4C
> <http://books.google.com/books?id=9eNcwZld6I4C>, and it worked. So it
> appears we can build a classic-style URL from just the id. But we can’t
> go the other way. So, two questions:
>
> 1. Do we care about standardizing our Google scan URL's in the
> metadata, either all classic or all new? If new, I’ll have to do
> some work to get the new-style URL’s for all of our classic ones.
> (Around 95, give or take.)
> 2. Depending on the answer to #1 (or maybe not), do we still just want
> to name the PDF with the id, or do we also want to put the book id
> in, e.g. gb-The_Romany_Rye-9eNcwZld6I4C.pdf for the above example?
>
>
>
>> On Sep 27, 2021, at 3:03 PM, Alex Cabal <al...@standardebooks.org
>> <mailto:al...@standardebooks.org>> wrote:
>>
>> I thought about this a little more and let's do it this way:
>>
>> One dir per book, named after the SE identifier
>>
>> Within each dir will be the scans. Filenames will be:
>> {ia,gb,ht}-{pdf_identifier}.pdf where the first var is the
>> abbreviation of the source (I think we'll only have those 3 ones
>> basically) and the 2nd var is whatever the source's book ID is. This
>> can usually be figured out from the source URL.
>>
>> On 9/27/21 3:00 PM, Vince wrote:
>>> Alex, in case this slipped past you (sorry, I know you’re crazy busy)…
>>>> On Sep 24, 2021, at 3:38 PM, Vince wrote:
>>>>
>>>> Alex, when we get there, how do you want to arrange and name the
>>>> scan files on the server? I assume you want them in a separate
>>>> directory structure.
>>>>
>>>> If the directory for the scans was scanpdfs, do you want them all
>>>> just dumped in that directory, or do you want separate directories
>>>> for each book, e.g.
>>>>
>>>>    an a-a-milne_the-red-house-mystery subdirectory with all the scans
>>>>    for that book in it, and
>>>>    an a-merritt_the-moon-pool subdirectory with all the scans for
>>>>    that book in it,
>>>>    and so forth?
>>>>
>>>>
>>>> Do you want the scans to be named as Archive/Google/Hathi have them,
>>>> e.g. redhousemystery00milngoog.pdf, The_Cream_of_the_Jest.pdf, etc.?
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/DEC0B6D6-EB75-44A8-AB16-DEBAB252B5D8%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/DEC0B6D6-EB75-44A8-AB16-DEBAB252B5D8%40letterboxes.org?utm_medium=email&utm_source=footer>.

Vince

unread,
Sep 27, 2021, 6:26:55 PM9/27/21
to Standard Ebooks
Right, the id is the same, but the new-style url also contains the book id and the classic doesn’t. Thus if the classic went away, we couldn’t find the book with just the id.

Given that, and that I now have all this information in one place, it makes the most sense to me to store the new URL’s; that’s lets us get to either classic or new. And thus to also put the book id in the PDF file name.
And if we’re going to standardize on new, I’ll (after all this) look at updating lint to enforce the new style URL as well.

I’ll do the IA’s first and then start looking at Google; I should be able to figure out how to get the new URL in an automated fashion (crosses fingers).

Alex Cabal

unread,
Sep 27, 2021, 6:30:28 PM9/27/21
to standar...@googlegroups.com
No, the text in the new URL is just filler. You can replace it with
anything and as long as the book ID is there, it works

On 9/27/21 5:26 PM, Vince wrote:
> Right, the /id/ is the same, but the new-style url also contains the
> book id and the classic doesn’t. Thus if the classic went away, we
> couldn’t find the book with just the id.
>
> Given that, and that I now have all this information in one place, it
> makes the most sense to me to store the new URL’s; that’s lets us get to
> either classic or new. And thus to also put the book id in the PDF file
> name.
> And if we’re going to standardize on new, I’ll (after all this) look at
> updating lint to enforce the new style URL as well.
>
> I’ll do the IA’s first and then start looking at Google; I should be
> able to figure out how to get the new URL in an automated fashion
> (crosses fingers).
>
>
>> On Sep 27, 2021, at 5:01 PM, Alex Cabal <al...@standardebooks.org
>> <mailto:al...@standardebooks.org>> wrote:
>>
>> I asked Google Books and they said t hat the ID is the same, the URL
>> change is merely cosmetic.
>>
>> I don't imagine they'll keep the old interface around forever, but
>> it's not a big priority to update the corpus with the new URLs. But if
>> you want to do it while you're doing this, that would be fine too.
>> That might also be scriptable, by getting the HTML of the old
>> interface and parsing the link to the new interface, or noting the URL
>> of the 302 redirect if there is one.
>>
>> On 9/27/21 3:36 PM, Vince wrote:
>>> All right, very good, thanks. That will cover 98% of them or so. We
>>> do have a few scans from other places (Grant maps from loc.gov.id
>>> <http://loc.gov.id> <http://loc.gov.id <http://loc.gov.id>>, a couple
>>> of O. Henry stories from the Texas History web site, and a handful of
>>> others); we can worry about those when we get the rest taken care of.
>>> We have two different kinds of Google id’s in our metadata, i.e.
>>> “classic” interface URL vs the new-style URL.
>>> Classic is “https://books.google.com/books?id=<blah>
>>> <https://books.google.com/books?id=<blah>>
>>> <https://books.google.com/books?id=%3Cblah%3E
>>> <https://books.google.com/books?id=%3Cblah%3E>>”. This is the
>>> majority of our Google scan URL’s.
>>> New is “https://www.google <https://www.google> <https://www.google
>>> <https://www.google>>.com/books/edition/<book name>/<id>, e.g.
>>> …/books/The_Romany_Rye/9eNcwZld6I4C.
>>> I don't know whether Google is going to support both styles
>>> indefinitely, or if they have a plan to phase out classic? If they do
>>> phase out classic, the id alone will not get us back to a new-style URL.
>>> I tested the Romany Rye URL above, converting the new-style to an
>>> old-style, i.e. books.google.com/books?id=9eNcwZld6I4C
>>> <http://books.google.com/books?id=9eNcwZld6I4C>
>>> <http://books.google.com/books?id=9eNcwZld6I4C
>>> <http://books.google.com/books?id=9eNcwZld6I4C>>, and it worked. So
>>> it appears we can build a classic-style URL from just the id. But we
>>> can’t go the other way. So, two questions:
>>> 1. Do we care about standardizing our Google scan URL's in the
>>>    metadata, either all classic or all new? If new, I’ll have to do
>>>    some work to get the new-style URL’s for all of our classic ones.
>>>    (Around 95, give or take.)
>>> 2. Depending on the answer to #1 (or maybe not), do we still just want
>>>    to name the PDF with the id, or do we also want to put the book id
>>>    in, e.g. gb-The_Romany_Rye-9eNcwZld6I4C.pdf for the above example?
>>>> On Sep 27, 2021, at 3:03 PM, Alex Cabal <al...@standardebooks.org
>>>> <mailto:al...@standardebooks.org> <mailto:al...@standardebooks.org
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org?utm_medium=email&utm_source=footer>.

Vince

unread,
Sep 27, 2021, 6:59:14 PM9/27/21
to Standard Ebooks
Oh, wow. That’s … dumb. Why would they do that? (<rhetorical/>)

OK, then we definitely only need the id in the PDF file name, so I’m good for that part as is.
Whether we standardize on old or new for the metadata is obviously up to you. If we’re going to, now’s a good time, since I have all the URL’s in one place and can easily(?) get from one to the other. But NBD if we don’t.
> To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/e4e4b37c-f5f5-307c-8e53-460ac51cb42d%40standardebooks.org.

Alex Cabal

unread,
Sep 27, 2021, 7:00:00 PM9/27/21
to standar...@googlegroups.com
Sure, go for it if you are up to it!

maticstric

unread,
Sep 28, 2021, 5:33:20 PM9/28/21
to Standard Ebooks
Since you guys were talking about the scans not from archive, google, and hathi, I made a new sheet on the same googlesheet with all the scans that aren't from those three websites: https://docs.google.com/spreadsheets/d/1uXGPpCbskDQ6uzYwJyt5YpfajnYa1B2D6RaIkI4P-Q8/edit#gid=546768324

There's a total of 18 books. Also, there's actually still 11 books with OLS scans so I'll add them to Vince's list here (additions are in bold):

Andre Norton, Voodoo Planet
Carey Rockwell, Stand by for Mars
H. Beam Piper, The Cosmic Computer
John Campbell, The Black Star Passes
Jules Verne, From the Earth to the Moon
Jules Verne, The Mysterious Island
Jules Verne, Topsy-Turvy
P. G. Wodehouse, Short Fiction (three OLS links)
Sax Rohmer, Brood of the Witch Queen
Stanley Weinbaum, The Dark Other
William Hope Hodgson, The Night Land (two OLS links)

The Wodehouse ones are going to be hard to find because we don't even know which stories/collections those were. The Verne two might be doable.

Alex Cabal

unread,
Sep 28, 2021, 5:35:07 PM9/28/21
to standar...@googlegroups.com
If some of these offer PDF download options, should we upload them to
archive.org as well?

On 9/28/21 4:33 PM, maticstric wrote:
> Since you guys were talking about the scans not from archive, google,
> and hathi, I made a new sheet on the same googlesheet with all the scans
> that aren't from those three websites:
> https://docs.google.com/spreadsheets/d/1uXGPpCbskDQ6uzYwJyt5YpfajnYa1B2D6RaIkI4P-Q8/edit#gid=546768324
>
> There's a total of 18 books. Also, there's actually still 11 books with
> OLS scans so I'll add them to Vince's list here (additions are in bold):
>
> Andre Norton, Voodoo Planet
> Carey Rockwell, Stand by for Mars
> H. Beam Piper, The Cosmic Computer
> John Campbell, The Black Star Passes
> *Jules Verne, From the Earth to the Moon*
> Jules Verne, The Mysterious Island
> *Jules Verne, Topsy-Turvy*
> *P. G. Wodehouse, Short Fiction *(three OLS links)
> **
> <http://loc.gov.id>> <http://loc.gov.id <http://loc.gov.id>
> <http://loc.gov.id <http://loc.gov.id>>>, a couple of O. Henry
> stories from the Texas History web site, and a handful of others);
> we can worry about those when we get the rest taken care of.
> >>>>> We have two different kinds of Google id’s in our metadata,
> i.e. “classic” interface URL vs the new-style URL.
> >>>>> Classic is “https://books.google.com/books?id=
> <https://books.google.com/books?id=><blah>
> <https://books.google.com/books?id=
> <https://books.google.com/books?id=><blah>>
> <https://books.google.com/books?id=%3Cblah%3E
> <https://books.google.com/books?id=%3Cblah%3E>
> <https://books.google.com/books?id=%3Cblah%3E
> <https://books.google.com/books?id=%3Cblah%3E>>>”. This is the
> majority of our Google scan URL’s.
> >>>>> New is “https://www.google <https://www.google>
> <https://www.google <https://www.google>> <https://www.google
> <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org?utm_medium=email&utm_source=footer>>.
>
> >>
> >> --
> >> You received this message because you are subscribed to the
> Google Groups "Standard Ebooks" group.
> >> To unsubscribe from this group and stop receiving emails from
> it, send an email to standardebook...@googlegroups.com.
> >> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/e4e4b37c-f5f5-307c-8e53-460ac51cb42d%40standardebooks.org
> <https://groups.google.com/d/msgid/standardebooks/e4e4b37c-f5f5-307c-8e53-460ac51cb42d%40standardebooks.org>.
>
> >
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/403bbcc9-949f-443d-88e3-2149dda55eecn%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/403bbcc9-949f-443d-88e3-2149dda55eecn%40googlegroups.com?utm_medium=email&utm_source=footer>.

maticstric

unread,
Sep 28, 2021, 6:01:39 PM9/28/21
to Standard Ebooks
Sure, I can give it a go. I already found that the Nostromo Scans were taken from archive.org so I'll make a pull request for that one right now.

I'll update the spreadsheet as I find pdfs, upload them to archive.org, and then make the pull requests.

Alex Cabal

unread,
Sep 28, 2021, 6:04:00 PM9/28/21
to standar...@googlegroups.com
Awesome, thanks!

On 9/28/21 5:01 PM, maticstric wrote:
> Sure, I can give it a go. I already found that the Nostromo Scans were
> taken from archive.org so I'll make a pull request for that one right now.
>
> I'll update the spreadsheet as I find pdfs, upload them to archive.org,
> and then make the pull requests.
>
> On Tuesday, September 28, 2021 at 2:35:07 PM UTC-7 Alex Cabal wrote:
>
> If some of these offer PDF download options, should we upload them to
> archive.org <http://archive.org> as well?
> > <http://loc.gov.id <http://loc.gov.id>>> <http://loc.gov.id
> <https://www.google>>> <https://www.google <https://www.google>
> <https://groups.google.com/d/msgid/standardebooks/403bbcc9-949f-443d-88e3-2149dda55eecn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/standardebooks/403bbcc9-949f-443d-88e3-2149dda55eecn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/4c3814b0-e432-4843-aa8d-2898b515695an%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/4c3814b0-e432-4843-aa8d-2898b515695an%40googlegroups.com?utm_medium=email&utm_source=footer>.

B Keith

unread,
Sep 29, 2021, 10:16:58 AM9/29/21
to Standard Ebooks
The Wodehouse ones come from the Miscellany put together by Gutenberg (https://www.gutenberg.org/ebooks/8190). I’ve looked for them in other sources but have had no luck so far. You seem to be better at it than me so go ahead:

Disentangling Old Duggie
When Papa Swore In Hindustani
Tom, Dick, And Harry


_________

Guadeamus igitur iuvenes dum sumus

maticstric

unread,
Oct 1, 2021, 4:25:02 PM10/1/21
to Standard Ebooks
All of the scans are now archive.org, google books, or hathi except Robbery Under Arms and Personal Memoirs of Ulysses S. Grant. I posted my questions about those on their own threads. The only ones left now are the pgdp ones which we haven't found.

Alex Cabal

unread,
Oct 1, 2021, 4:27:45 PM10/1/21
to standar...@googlegroups.com
Excellent work, thanks! This is going to go a long way towards making
sure we don't lose these scans in the future. I think I will make it
official policy to only accept scans from either GB, HT, or IA with the
option of uploading scans from other sources to IA.

On 10/1/21 3:25 PM, maticstric wrote:
> All of the scans are now archive.org, google books, or hathi
> exceptRobbery Under Arms and Personal Memoirs of Ulysses S. Grant. I
> posted my questions about those on their own threads. The only ones left
> now are the pgdp ones which we haven't found.
>
> https://docs.google.com/spreadsheets/d/1uXGPpCbskDQ6uzYwJyt5YpfajnYa1B2D6RaIkI4P-Q8/edit#gid=546768324
>
> On Wednesday, September 29, 2021 at 7:16:58 AM UTC-7 BTK wrote:
>
> The Wodehouse ones come from the Miscellany put together by
> Gutenberg (https://www.gutenberg.org/ebooks/8190
> <https://www.gutenberg.org/ebooks/8190>). I’ve looked for them in
> other sources but have had no luck so far. You seem to be better at
> it than me so go ahead:
>
> Disentangling Old Duggie
> When Papa Swore In Hindustani
> Tom, Dick, And Harry
>
>
> _________
>
> Guadeamus igitur iuvenes dum sumus
>
>> On Sep 28, 2021, at 3:33 PM, maticstric <matic...@gmail.com> wrote:
>>
>> Since you guys were talking about the scans not from archive,
>> google, and hathi, I made a new sheet on the same googlesheet with
>> all the scans that aren't from those three websites:
>> https://docs.google.com/spreadsheets/d/1uXGPpCbskDQ6uzYwJyt5YpfajnYa1B2D6RaIkI4P-Q8/edit#gid=546768324
>> <https://docs.google.com/spreadsheets/d/1uXGPpCbskDQ6uzYwJyt5YpfajnYa1B2D6RaIkI4P-Q8/edit#gid=546768324>
>>
>> There's a total of 18 books. Also, there's actually still 11 books
>> with OLS scans so I'll add them to Vince's list here (additions
>> are in bold):
>>
>> Andre Norton, Voodoo Planet
>> Carey Rockwell, Stand by for Mars
>> H. Beam Piper, The Cosmic Computer
>> John Campbell, The Black Star Passes
>> *Jules Verne, From the Earth to the Moon*
>> Jules Verne, The Mysterious Island
>> *Jules Verne, Topsy-Turvy*
>> *P. G. Wodehouse, Short Fiction *(three OLS links)
>> **
>> maps from loc.gov.id <http://loc.gov.id/> <http://loc.gov.id
>> <http://loc.gov.id/>> <http://loc.gov.id <http://loc.gov.id/>
>> <http://loc.gov.id <http://loc.gov.id/>>>, a couple of O.
>> Henry stories from the Texas History web site, and a handful
>> of others); we can worry about those when we get the rest
>> taken care of.
>> >>>>> We have two different kinds of Google id’s in our
>> metadata, i.e. “classic” interface URL vs the new-style URL.
>> >>>>> Classic is “https://books.google.com/books?id=
>> <https://books.google.com/books?id=><blah>
>> <https://books.google.com/books?id=
>> <https://books.google.com/books?id=><blah>>
>> <https://books.google.com/books?id=%3Cblah%3E
>> <https://books.google.com/books?id=%3Cblah%3E>
>> <https://books.google.com/books?id=%3Cblah%3E
>> <https://books.google.com/books?id=%3Cblah%3E>>>”. This is the
>> majority of our Google scan URL’s.
>> >>>>> New is “https://www.google <https://www.google/>
>> <https://www.google <https://www.google/>> <https://www.google
>> <https://www.google/> <https://www.google
>> <https://www.google/>>>.com/books/edition/<book name>/<id>,
>> <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org?utm_medium=email&utm_source=footer
>> <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org?utm_medium=email&utm_source=footer>>.
>>
>> >>
>> >> --
>> >> You received this message because you are subscribed to the
>> Google Groups "Standard Ebooks" group.
>> >> To unsubscribe from this group and stop receiving emails
>> from it, send an email to standardebook...@googlegroups.com.
>> >> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/standardebooks/e4e4b37c-f5f5-307c-8e53-460ac51cb42d%40standardebooks.org
>> <https://groups.google.com/d/msgid/standardebooks/e4e4b37c-f5f5-307c-8e53-460ac51cb42d%40standardebooks.org>.
>>
>> >
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Standard Ebooks" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to standardebook...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/standardebooks/403bbcc9-949f-443d-88e3-2149dda55eecn%40googlegroups.com
>> <https://groups.google.com/d/msgid/standardebooks/403bbcc9-949f-443d-88e3-2149dda55eecn%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/70c62043-1078-4358-a179-aeaf3bab8ebfn%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/70c62043-1078-4358-a179-aeaf3bab8ebfn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Matt Chan

unread,
Oct 1, 2021, 4:41:54 PM10/1/21
to standar...@googlegroups.com
Would it be worth it to think about contingency plans if more of these databases get taken down? But I suppose trying to host the scans ourselves is too much... Maybe I'm just paranoid

Alex Cabal

unread,
Oct 1, 2021, 4:43:40 PM10/1/21
to standar...@googlegroups.com
Yes, that's phase 2 which Vince is going to work on: keeping an archive
of the scan PDFs on our server in case of emergency.

On 10/1/21 3:41 PM, Matt Chan wrote:
> Would it be worth it to think about contingency plans if more of these
> databases get taken down? But I suppose trying to host the scans
> ourselves is too much... Maybe I'm just paranoid
>
> On Fri, Oct 1, 2021 at 4:27 PM Alex Cabal <al...@standardebooks.org
> <mailto:al...@standardebooks.org>> wrote:
>
> Excellent work, thanks! This is going to go a long way towards making
> sure we don't lose these scans in the future. I think I will make it
> official policy to only accept scans from either GB, HT, or IA with the
> option of uploading scans from other sources to IA.
>
> On 10/1/21 3:25 PM, maticstric wrote:
> > All of the scans are now archive.org <http://archive.org>, google
> books, or hathi
> > exceptRobbery Under Arms and Personal Memoirs of Ulysses S. Grant. I
> > posted my questions about those on their own threads. The only
> ones left
> > now are the pgdp ones which we haven't found.
> >
> >
> https://docs.google.com/spreadsheets/d/1uXGPpCbskDQ6uzYwJyt5YpfajnYa1B2D6RaIkI4P-Q8/edit#gid=546768324
> <https://docs.google.com/spreadsheets/d/1uXGPpCbskDQ6uzYwJyt5YpfajnYa1B2D6RaIkI4P-Q8/edit#gid=546768324>
> >
> > On Wednesday, September 29, 2021 at 7:16:58 AM UTC-7 BTK wrote:
> >
> >     The Wodehouse ones come from the Miscellany put together by
> >     Gutenberg (https://www.gutenberg.org/ebooks/8190
> <https://www.gutenberg.org/ebooks/8190>
> >     <https://www.gutenberg.org/ebooks/8190
> <https://www.gutenberg.org/ebooks/8190>>). I’ve looked for them in
> >     other sources but have had no luck so far. You seem to be
> better at
> >     it than me so go ahead:
> >
> >     Disentangling Old Duggie
> >     When Papa Swore In Hindustani
> >     Tom, Dick, And Harry
> >
> >
> >     _________
> >
> >     Guadeamus igitur iuvenes dum sumus
> >
> >>     On Sep 28, 2021, at 3:33 PM, maticstric <matic...@gmail.com
> <mailto:al...@standardebooks.org>> wrote:
> >>         >>
> >>         >> No, the text in the new URL is just filler. You can
> replace
> >>         it with anything and as long as the book ID is there, it
> works
> >>         >>
> >>         >> On 9/27/21 5:26 PM, Vince wrote:
> >>         >>> Right, the /id/ is the same, but the new-style url also
> >>         contains the book id and the classic doesn’t. Thus if the
> >>         classic went away, we couldn’t find the book with just
> the id.
> >>         >>> Given that, and that I now have all this information in
> >>         one place, it makes the most sense to me to store the new
> >>         URL’s; that’s lets us get to either classic or new. And thus
> >>         to also put the book id in the PDF file name.
> >>         >>> And if we’re going to standardize on new, I’ll
> (after all
> >>         this) look at updating lint to enforce the new style URL
> as well.
> >>         >>> I’ll do the IA’s first and then start looking at
> Google; I
> >>         should be able to figure out how to get the new URL in an
> >>         automated fashion (crosses fingers).
> >>         >>>> On Sep 27, 2021, at 5:01 PM, Alex Cabal
> >>         <al...@standardebooks.org
> <mailto:al...@standardebooks.org> <mailto:al...@standardebooks.org
> <mailto:al...@standardebooks.org>>>
> >>         wrote:
> >>         >>>>
> >>         >>>> I asked Google Books and they said t hat the ID is the
> >>         same, the URL change is merely cosmetic.
> >>         >>>>
> >>         >>>> I don't imagine they'll keep the old interface around
> >>         forever, but it's not a big priority to update the
> corpus with
> >>         the new URLs. But if you want to do it while you're doing
> >>         this, that would be fine too. That might also be scriptable,
> >>         by getting the HTML of the old interface and parsing the
> link
> >>         to the new interface, or noting the URL of the 302
> redirect if
> >>         there is one.
> >>         >>>>
> >>         >>>> On 9/27/21 3:36 PM, Vince wrote:
> >>         >>>>> All right, very good, thanks. That will cover 98% of
> >>         them or so. We do have a few scans from other places (Grant
> >>         maps from loc.gov.id <http://loc.gov.id>
> <http://loc.gov.id/ <http://loc.gov.id/>> <http://loc.gov.id
> <http://loc.gov.id>
> >>         <http://loc.gov.id/ <http://loc.gov.id/>>>
> <http://loc.gov.id <http://loc.gov.id> <http://loc.gov.id/
> <http://loc.gov.id/ <http://loc.gov.id/>>>>, a couple of O.
> >>         Henry stories from the Texas History web site, and a handful
> >>         of others); we can worry about those when we get the rest
> >>         taken care of.
> >>         >>>>> We have two different kinds of Google id’s in our
> >>         metadata, i.e. “classic” interface URL vs the new-style URL.
> >>         >>>>> Classic is “https://books.google.com/books?id=
> <https://books.google.com/books?id=>
> >>         <https://books.google.com/books?id=
> <https://books.google.com/books?id=>><blah>
> >>         <https://books.google.com/books?id=
> <https://books.google.com/books?id=>
> >>         <https://books.google.com/books?id=
> <https://books.google.com/books?id=>><blah>>
> >>         <https://books.google.com/books?id=%3Cblah%3E
> <https://books.google.com/books?id=%3Cblah%3E>
> >>         <https://books.google.com/books?id=%3Cblah%3E
> <https://books.google.com/books?id=%3Cblah%3E>>
> >>         <https://books.google.com/books?id=%3Cblah%3E
> <https://books.google.com/books?id=%3Cblah%3E>
> >>         <https://books.google.com/books?id=%3Cblah%3E
> <https://books.google.com/books?id=%3Cblah%3E>>>>”. This is the
> >>         majority of our Google scan URL’s.
> >>         >>>>> New is “https://www.google <https://www.google>
> <https://www.google/ <https://www.google/>>
> >>         <https://www.google <https://www.google>
> <https://www.google/ <https://www.google/>>> <https://www.google
> <https://www.google>
> >>         <https://www.google/ <https://www.google/>>
> <https://www.google <https://www.google>
> >>         <https://www.google/
> >>         <mailto:standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>>.
> >>         >>> To view this discussion on the web visit
> >>
> https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org>
> >>
>  <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org>>
> >>
>  <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org?utm_medium=email&utm_source=footer>
> >>
>  <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/standardebooks/9C2A0E86-33A4-48D0-922A-31798B63E85B%40letterboxes.org?utm_medium=email&utm_source=footer>>>.
> >>
> >>         >>
> >>         >> --
> >>         >> You received this message because you are subscribed
> to the
> >>         Google Groups "Standard Ebooks" group.
> >>         >> To unsubscribe from this group and stop receiving emails
> >>         from it, send an email to
> standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> >>         >> To view this discussion on the web visit
> >>
> https://groups.google.com/d/msgid/standardebooks/e4e4b37c-f5f5-307c-8e53-460ac51cb42d%40standardebooks.org
> <https://groups.google.com/d/msgid/standardebooks/e4e4b37c-f5f5-307c-8e53-460ac51cb42d%40standardebooks.org>
> >>
>  <https://groups.google.com/d/msgid/standardebooks/e4e4b37c-f5f5-307c-8e53-460ac51cb42d%40standardebooks.org <https://groups.google.com/d/msgid/standardebooks/e4e4b37c-f5f5-307c-8e53-460ac51cb42d%40standardebooks.org>>.
> >>
> >>         >
> >>
> >>
> >>     --
> >>     You received this message because you are subscribed to the
> Google
> >>     Groups "Standard Ebooks" group.
> >>     To unsubscribe from this group and stop receiving emails
> from it,
> >>     send an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> >>     To view this discussion on the web visit
> >>
> https://groups.google.com/d/msgid/standardebooks/403bbcc9-949f-443d-88e3-2149dda55eecn%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/403bbcc9-949f-443d-88e3-2149dda55eecn%40googlegroups.com>
> >>
>  <https://groups.google.com/d/msgid/standardebooks/403bbcc9-949f-443d-88e3-2149dda55eecn%40googlegroups.com?utm_medium=email&utm_source=footer <https://groups.google.com/d/msgid/standardebooks/403bbcc9-949f-443d-88e3-2149dda55eecn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Standard Ebooks" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to standardebook...@googlegroups.com
> <mailto:standardebooks%2Bunsu...@googlegroups.com>
> > <mailto:standardebook...@googlegroups.com
> <mailto:standardebooks%2Bunsu...@googlegroups.com>>.
> <https://groups.google.com/d/msgid/standardebooks/70c62043-1078-4358-a179-aeaf3bab8ebfn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/standardebooks/70c62043-1078-4358-a179-aeaf3bab8ebfn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to standardebook...@googlegroups.com
> <mailto:standardebooks%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/02153f3c-f098-4c0f-0af0-977042faba81%40standardebooks.org
> <https://groups.google.com/d/msgid/standardebooks/02153f3c-f098-4c0f-0af0-977042faba81%40standardebooks.org>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/CAB6ohTcfPm5qOnkJszYp0dgrWZ6gAgZ%3DBJeP5z2VRp7U4R1SSA%40mail.gmail.com
> <https://groups.google.com/d/msgid/standardebooks/CAB6ohTcfPm5qOnkJszYp0dgrWZ6gAgZ%3DBJeP5z2VRp7U4R1SSA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Vince

unread,
Oct 1, 2021, 4:52:28 PM10/1/21
to Standard Ebooks
Not at all, that’s phase 2. I hope to get at least to the IA’s this weekend. I don’t think Alex plans on making them publicly available, but we’re going to have all of them on our server at least.

Matt Chan

unread,
Oct 1, 2021, 6:11:44 PM10/1/21
to standar...@googlegroups.com
If you need a hand Vince, let me know!


--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Vince

unread,
Oct 2, 2021, 6:52:33 PM10/2/21
to Standard Ebooks
Matic, would you mind giving me edit access to the spreadsheet? I need to update it with all the information I’ve gathered. Thanks!

Vince

unread,
Oct 2, 2021, 7:21:47 PM10/2/21
to Standard Ebooks
While we’re waiting…

The IA scans are in the process of downloading to the server. They’ll probably take the rest of the day and possibly night.

The Google scans are turning out to be problematic. I have the URL’s for each of them, but Google has barriers in place to (apparently) prevent automatic downloads. Getting the page from curl and greping out the PDF URL, it included a signature (?sig=<blah>). I was hoping that would skip the captcha, but it doesn’t; trying to curl that url just returns another page, not the actual PDF. And putting the link into a spreadsheet goes to the captcha page as well.

So, unless someone has an idea for circumventing, it appears we might have to download the Google scans (just under 100 files) one at a time. I don’t have time to do that all myself, but, if we divide it up, we’ll then have to figure out how to get them to the server.

Alex, ideas and/or thoughts?

I haven’t gotten to the Hathi ones yet; there are 180+ of those, and in most cases Hathi doesn't allow PDF downloads except from member institutions. I’ll work to see if I can figure out which ones can and can’t programmatically—tomorrow, hopefully.

maticstric

unread,
Oct 2, 2021, 8:08:51 PM10/2/21
to Standard Ebooks
I added you. You should have edit access now.

Matt Chan

unread,
Oct 2, 2021, 9:40:37 PM10/2/21
to standar...@googlegroups.com
It would require getting an API key, but https://developers.google.com/books/docs/v1/using#download-format <- looking at this automatic download seems to be possible

--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Vince Rice

unread,
Oct 2, 2021, 11:05:51 PM10/2/21
to standar...@googlegroups.com
Well, it might, to someone who knew what to do with the API. :)

On Oct 2, 2021, at 8:40 PM, Matt Chan <thew...@gmail.com> wrote:

It would require getting an API key, but https://developers.google.com/books/docs/v1/using#download-format <- looking at this automatic download seems to be possible.

Alex Cabal

unread,
Oct 3, 2021, 11:39:06 AM10/3/21
to standar...@googlegroups.com
You would use curl on the command line, it would be very easy. This is probably the way to go instead of doing it by hand hundreds of times!

Matt Chan

unread,
Oct 3, 2021, 11:49:19 AM10/3/21
to standar...@googlegroups.com
Vince, if you send me the list of books needed from GB I can go ahead and pull them for you. 

--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

maticstric

unread,
Oct 3, 2021, 11:52:53 AM10/3/21
to Standard Ebooks

Just remove all the archive, hathi, and pgdp ones and only google books should be left.

Matt Chan

unread,
Oct 3, 2021, 2:28:18 PM10/3/21
to standar...@googlegroups.com
It looks like the API only support downloading the epub, not the PDF scans, I'll keep looking to see if there's a way

Vince

unread,
Oct 3, 2021, 2:52:35 PM10/3/21
to Standard Ebooks
I have updated information, and I’m in the process of adding a bunch of Google links for Hathi. Hold off for a bit.

Vince

unread,
Oct 3, 2021, 2:55:50 PM10/3/21
to Standard Ebooks
I have not found curl to work for the PDF’s for Google, I’ll be glad for Matt or someone to show me the error of my ways. Nor do I have no plans to do it by hand hundreds of times. :)

Vince

unread,
Oct 3, 2021, 4:50:36 PM10/3/21
to Standard Ebooks
Here’s the status of downloading the scans. The TLDR version is that the IA scans have all been downloaded to the server, I have all the URL’s for downloading the Google and Hathi scans (that can be downloaded), but I have not been able to figure out a way to programmatically download those scans, and I don’t have the time to devote to downloading them one at a time, so someone else is going to have to take it from here.

The longer version…

I have added a new sheet called “Updated PDF URLs" to Matic’s Google spreadsheet.
That sheet has a separate row for each scan (almost, see below).
  • The “org” column indicates which of the big three (IA, GB, or HT) the scan resides on.
  • The “id” column contains just the id for the IA and GB scans.
  • The “newurl” column contains the new style Google URL for all of the Google scans. For the HT scans, this column contains the Google URL to the Google Books version of the HT scan, if the Google version is viewable/downloadable.
  • The "pdf” column contains different things depending on the org. For IA, it’s the format to use for the download. For Google, it’s the direct URL to download the PDF. (It doesn’t work programmatically, and can be built from the Id, I just included it for time-saving purposes when I thought it could be programmed.) For Hathi, it contains the direct Hathi URL to the scan (not the record) for those that are downloadable on Hathi.
  • The “notes” column contains notes where needed.

Internet Archive—The IA scans have all been downloaded to the SE server (in my “scans” directory, Alex). There are four “borrow” links that do not have PDFs and therefore those directories are empty.

Google—I have not been able to figure out how to get the Google scans programmatically, even with the URL’s. They display a captcha when using the URL directly in a browser, and you just get a page back when using curl. Someone smarter than me will have to figure that out. If you do figure it out, then I will be happy to do the actual downloading to the server.

Hathi—Hathi in general doesn’t allow downloads of its files except from partner (university) institutions. The exceptions are files marked purely “Public Domain”; these can be downloaded in full, but the pdf is generated “on the fly,” so there’s not a URL to go to to download the file. Clicking the button on the page causes the PDF to be generated and downloaded. Generating takes a while, almost as long as the download on the one I tried. For these, as noted above, the “pdf” column contains the direct URL to the scan page where the download button can be clicked. As with the Google entries, I see no way to automate that, since there’s no URL to go to to get the file. Worse than the Google entries, the time to download also includes generation, so these are all going to take a relatively long time.

For others, if the file is marked “Google-digitized” on Hathi, there’s a “Find at Google Books” link under the “Get This Item” menu. Since it takes several clicks to get there (once on the record to get to the actual scan page, once on the menu, once on Find at Google Books), and since at least one of those sometimes had a redirect on it, I had to collect all of them manually. Although it will always find an entry at Google, that file at Google does not always have a preview, i.e. it’s not always viewable/downloadable. If it is viewable, the newurl column contains the Google URL for the book. From it, it will require whatever can be determined for the other Google downloads. Note that Alex will have to decide whether to name the files “ht-" since they were originally Hathi, or “gb-" since we did the actual download from Google.

In a number of instances, the Hathi record is for a magazine or other publication that has many volumes, and it's not clear what volumes are being used on the book in question. These are marked “Multiple volumes” in the notes column, and someone is going to have to figure out which specific scan is being used. Where the record contained multiple volumes and all volumes were obviously part of the SE publication, e.g. Don Quixote, etc., I provided the Google URL’s for all of them (if they were viewable) in the newurl column. This is the only instance where a row contains multiple scans.

(Because of that, I’ll note again that we should stop requiring a record ID for Hathi. Use the record if the record helps, use the direct URL otherwise. If there’s only one volume, it doesn’t matter whether it’s the record or the scan itself. If there are multiple volumes for the book being produced, the record saves us some work. But if it’s multiple volumes for other reasons, e.g. magazines for short story collections, or serialized novels, or whatever, then we lose all information about what scan was actually being used, and some of the records have dozens of entries on them. And we have far more instances of the latter situation than the former. We’re crippling our metadata for no good reason.)

For King Lear, the scan URL in the SE metadata is for a different book (i.e. not King Lear), so someone will have to get the correct URL before we can do anything with it.

As noted above, if someone can figure out how to programmatically download from Google with the id’s, then let me know and I’ll be glad to handle doing the downloads on the server.

Vince

unread,
Oct 3, 2021, 4:52:22 PM10/3/21
to Standard Ebooks
Sorry, forgot—I created GB and HT filtered views on that sheet as well. I didn’t create an IA view since they’re already done.

Matt Chan

unread,
Oct 3, 2021, 4:59:28 PM10/3/21
to standar...@googlegroups.com
On the old URL page:
^ This link does not have a pdf that can be downloaded
william-shakespeare_a-midsummer-nights-dream https://books.google.com/books?id=y8weAAAAMAAJ
^ This link is incorrect; links to some other book instead. Perhaps this link would work instead? https://www.google.com/books/edition/A_Midsummer_night_s_Dream/1s09AAAAYAAJ?hl=en&gbpv=0

I figure it'll take me equal amount of time (if not longer) to write a script to download all the books from GB than just to put netflix on and click through all of them, but there they are:

These are just the GB ones, I can do the Hathi ones next

--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

maticstric

unread,
Oct 3, 2021, 7:19:42 PM10/3/21
to Standard Ebooks
My bad on the william-shakespeare_a-midsummer-nights-dream scan. I probably dragged my mouse from william-hazlitt_table-talk to it and overwrote it. I'll update the spreadsheet (both old and new) with the correct scan link which is: https://catalog.hathitrust.org/Record/004135080

The fritz-leiber_short-fiction one is the actual scan in the <dc:source> though, but it's obviously wrong since it's from a 2009 publication. I found an archive.org scan of it (https://archive.org/details/1962-09_IF) so I can replace that one as well. Vince, could you download it and add it to the server?

Matt Chan

unread,
Oct 3, 2021, 7:43:55 PM10/3/21
to standar...@googlegroups.com
Okay I've downloaded all the books that I can; there's about 10 or so that aren't on GB, and I need a University account to get the full PDF. My sister works at an HT-affiliated university, I can get those through her. I've zipped the books and are uploading them to my GDrive. There are like 6GB worth of book pdfs so it'll take awhile, I'll update when they are done. At the meantime, here are some of the stuff I've come across on the spreadsheet:




HT Books, not actually available on GB (no full PDF available HT either, only page by page):
The Secret Adversary
Andre Norton
Herland
John Keat's Poem (there are some on GB, but not the version on Hathi)
Moonfleet
Island of Space
A confession and What I believe
m-r-james_short-fiction
https://catalog.hathitrust.org/Record/102295859
The eight strokes of the clock
Love Among the Chickens
PSmith, Journalist
Gladiator
Ara Vus Prec (Part of TS Elliot collection)

Links need updating:
HT/GB links to Shakespeare:


Need Clarification:
The three links for: fyodor-sologub_short-fiction_john-cournos_stephen-graham_rosa-savory-graham_p-selver
This one link for h-g-wells_short-fiction https://catalog.hathitrust.org/Record/000642318
These two links for ivan-bunin_short-fiction_s-s-koteliansky_d-h-lawrence_leonard-woolf_bernard-guilbert-guerney_the-russian-review_natalie-a-duddington https://catalog.hathitrust.org/Record/000501549 https://catalog.hathitrust.org/Record/007908346
These three links for leo-tolstoy_short-fiction_louise-maude_aylmer-maude_nathan-haskell-dole_constance-garnett_j-d-duff_leo-weiner_r-s-townsend_hagberg-wright_benjamin-tucker_everymans-library_vladimir-chertkov_isabella-fyvie-mayo https://catalog.hathitrust.org/Record/011726151 https://catalog.hathitrust.org/Record/008634081 https://catalog.hathitrust.org/Record/008973985
This one link for leonid-andreyev_short-fiction_herman-bernstein_alexandra-linden_l-a-magnus_k-walter_w-h-lowe_the-russian-review_archibald-j-wolfe_john-cournos_r-s-townsend_maurice-magnus https://catalog.hathitrust.org/Record/007908346
This one link for thomas-paine_the-american-crisis https://catalog.hathitrust.org/Record/009832797
This one link for vladimir-korolenko_short-fiction_aline-delano_sergius-stepniak_william-westall_thomas-seltzer_the-russian-review_marian-fell_clarence-manning https://catalog.hathitrust.org/Record/000676865
These are for magazine scans, not sure if I needed to grab all of them or just a selected few

philip-francis-nowlan_the-airlords-of-han https://catalog.hathitrust.org/Record/102007014
These links link to the wrong book



maticstric

unread,
Oct 3, 2021, 7:51:48 PM10/3/21
to Standard Ebooks
Airlords of Han was probably my mistake again... Already fixed to https://archive.org/details/Amazing_Stories_v03n12_1929-03_ATLPM-Urf in the spreadsheets. Vince can you download this one as well?

King Lear is as it is in the <dc:source> so I won't mess with it for now.

B Keith

unread,
Oct 3, 2021, 8:58:33 PM10/3/21
to Standard Ebooks
So I’ve fallen behind on just what you guys are doing but  I ma not sure why you are changing the links to Google books. Google books is notoriously bad for access for those fo us outside the US.  For example I originally did Major Barbara and now I won't be able to access the linked scans.

Is this really the best idea?

_________

Guadeamus igitur iuvenes dum sumus

Alex Cabal

unread,
Oct 3, 2021, 9:08:08 PM10/3/21
to standar...@googlegroups.com
Yeah that's a good point, I didn't think of that.

For now, let's continue how we're doing it for the scope of this
project. If a producer needs the scans for something and they can't get
access, the entire point of this project is that we can just send them a
PDF.

For future ebooks, we can say to prefer HT over GB in the ebook source
for the geoblocking reason, and we can download an archival PDF from GB
if the identical scans are available there.

On 10/3/21 7:58 PM, B Keith wrote:
> So I’ve fallen behind on just what you guys are doing but  I ma not sure
> why you are changing the links to Google books. Google books is
> notoriously bad for access for those fo us outside the US.  For example
> I originally did Major Barbara and now I won't be able to access the
> linked scans.
>
> Is this really the best idea?
> _________
>
> Guadeamus igitur iuvenes dum sumus
>
>> On Oct 3, 2021, at 5:43 PM, Matt Chan <thew...@gmail.com
>> Saga:https://www.google.com/books/edition/The_Forsyte_Saga/r-VGAAAAYAAJ <https://www.google.com/books/edition/The_Forsyte_Saga/r-VGAAAAYAAJ>
>> The Marvelous Land of
>> Oz:https://catalog.hathitrust.org/Record/009927232
>> <https://catalog.hathitrust.org/Record/009927232>
>> The gods of
>> pagana:https://www.google.com/books/edition/The_Gods_of_Pegāna/rtI_AQAAMAAJ
>> <https://www.google.com/books/edition/The_Gods_of_Peg%C4%81na/rtI_AQAAMAAJ>
>> <https://www.google.com/books/edition/Edward_the_Third/S3VlAAAAMAAJ?hl=en>
>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/32F499E1-EE54-4F52-A24A-D8C2B661BA5A%40gmail.com
> <https://groups.google.com/d/msgid/standardebooks/32F499E1-EE54-4F52-A24A-D8C2B661BA5A%40gmail.com?utm_medium=email&utm_source=footer>.

Matt Chan

unread,
Oct 3, 2021, 9:22:31 PM10/3/21
to standar...@googlegroups.com
The Google Books links aren't really "replacing" the scans for e.g. Major Barbara, but rather they are what the Hathi page points to, since otherwise you'd need to be affiliated with a partnered university to obtain the Hathi full scans. In any case, here are all the PDFs I've scraped from Hathi/Google

Hathi pdf: https://drive.google.com/file/d/1itgoSKJj-euiS5pnHCAWfJLuSFf1kNHD/view?usp=sharing

Weijia Cheng

unread,
Oct 3, 2021, 10:18:58 PM10/3/21
to Standard Ebooks

Matt Chan

unread,
Oct 3, 2021, 10:33:42 PM10/3/21
to standar...@googlegroups.com

Vince

unread,
Oct 3, 2021, 11:54:55 PM10/3/21
to Standard Ebooks
We shouldn’t be changing the links (I certainly wasn’t). The instances where we’re using Google Books for the scans of HT books is purely to dl the scans for our server, not to change where the scans point to in our metadata. IO, if the SE metadata pointed to HT, it should continue to point to HT.

Vince

unread,
Oct 4, 2021, 12:41:37 AM10/4/21
to Standard Ebooks
Right, as I noted in my email, the ht books in the updated sheet that don’t have newurl’s aren’t downloadable on either HT or Google.
The links don’t need to be updated. We’re not changing the links in the metadata, we’re just using Google to download the scans for our server.
As also noted, the ones in the updated sheet that have “Multiple volumes” in the notes are ones where we can’t tell what volume from the record(s) is being used. Someone’s going to have to figure those out, preferably the producer of each book. (I own the Thoreau, I just didn’t have time to get to it today.)

On the Shakespeares, I noted all three scans on the first Shakespeare entry; the rest of the Shakespeares (except King Lear) all use the same scans. I didn’t duplicate them so we wouldn’t download a bunch of duplicates to the server.

The updated sheet is the “golden” sheet. It should be used going forward, and if any needed updates are needed, they should be made to it.

Also, remember that the Google/Hathi files need to be renamed with “ht-“ and “gb-“ prefixes on the filenames. (As I mentioned earlier, Alex will have to make a call on which prefix we use for the Hathi scans that we actually downloaded from Google.)


These three links for leo-tolstoy_short-fiction_louise-maude_aylmer-maude_nathan-haskell-dole_constance-garnett_j-d-duff_leo-weiner_r-s-townsend_hagberg-wright_benjamin-tucker_everymans-library_vladimir-chertkov_isabella-fyvie-mayo https://catalog.hathitrust.org/Record/011726151https://catalog.hathitrust.org/Record/008634081 https://catalog.hathitrust.org/Record/008973985

Vince

unread,
Oct 4, 2021, 12:45:16 AM10/4/21
to Standard Ebooks
I added to the server:
Amazing_Stories_v03n12_1929-03_ATLPM-Urf.pdf to philip-francis-nowlan_the-airlords-of-han
1962_09_IF.pdf to fritz-leiber_short-fiction

I think that was everything; if I missed one, let me know.

Vince

unread,
Oct 4, 2021, 12:53:06 AM10/4/21
to Standard Ebooks
Thanks to Matic, Matt, Weijia for all your help!

Matt Chan

unread,
Oct 4, 2021, 8:50:10 AM10/4/21
to standar...@googlegroups.com
So the links I said "needs to be updated" included links to Google Books where the scans aren't actually available, so they can be updated with the link I provided instead (under NewURL), if desired. Same with some of the Hathi links, the Hathi links themselves don't need to be updated, but the Google Books link ON the Hathi record page itself don't actually provide full scans, probably some confusion between Hathi and Google on their part. Those Google links can be placed in the "NewURL" as well, I think. Some of the Hathi records on the spreadsheets don't have a link under newurl, but I was able to find Google Books links — sometimes these were the same one provided on the Hathi record page themselves, sometimes not — that have full scans.

In the two zip files, the ones in in google_books.zip are all from records on the spreadsheet marked "GB", and in hathi_books.zip, all files in that archive are from records marked "HT", even though some files are actually downloaded from linked GB pages on the Hathi records page (or in some cases, GB links I had to look for myself, since the GB link on Hathi actually don't give full scan access.)

Hope that makes sense!

Also: Should I go ahead and try to obtain copies of Hathi full scans that need University-partner access?

On Mon, Oct 4, 2021 at 12:53 AM Vince <vr_se...@letterboxes.org> wrote:
Thanks to Matic, Matt, Weijia for all your help!

--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.

Alex Cabal

unread,
Oct 4, 2021, 11:08:24 AM10/4/21
to standar...@googlegroups.com
Actually can you download the duplicate Shakespeares? If we forget and
go in looking for those scans later it would be confusing not to find
them, and maybe not obvious where to look.

On 10/3/21 11:41 PM, Vince wrote:
> Right, as I noted in my email, the ht books in the updated sheet that
> don’t have newurl’s aren’t downloadable on either HT or Google.
> The links don’t need to be updated. We’re not changing the links in the
> metadata, we’re just using Google to download the scans for our server.
> As also noted, the ones in the updated sheet that have “Multiple
> volumes” in the notes are ones where we can’t tell what volume from the
> record(s) is being used. Someone’s going to have to figure those out,
> preferably the producer of each book. (I own the Thoreau, I just didn’t
> have time to get to it today.)
>
> On the Shakespeares, I noted all three scans on the first Shakespeare
> entry; the rest of the Shakespeares (except King Lear) all use the same
> scans. I didn’t duplicate them so we wouldn’t download a bunch of
> duplicates to the server.
>
> The updated sheet is the “golden” sheet. It should be used going
> forward, and if any needed updates are needed, they should be made to it.
>
> Also, remember that the Google/Hathi files need to be renamed with “ht-“
> and “gb-“ prefixes on the filenames. (As I mentioned earlier, Alex will
> have to make a call on which prefix we use for the Hathi scans that we
> actually downloaded from Google.)
>
>
>> On Oct 3, 2021, at 6:43 PM, Matt Chan <thew...@gmail.com
>> <mailto:thew...@gmail.com>> wrote:
>>
>> Okay I've downloaded all the books that I can; there's about 10 or so
>> that aren't on GB, and I need a University account to get the full
>> PDF. My sister works at an HT-affiliated university, I can get those
>> through her. I've zipped the books and are uploading them to my
>> GDrive. There are like 6GB worth of book pdfs so it'll take awhile,
>> I'll update when they are done. At the meantime, here are some of the
>> stuff I've come across on the spreadsheet:
>>
>> *HT Books, not actually available on GB (no full PDF available HT
>> either, only page by page):*
>> The Secret Adversary
>> Andre Norton
>> Herland
>> John Keat's Poem (there are some on GB, but not the version on Hathi)
>> Moonfleet
>> Island of Space
>> A confession and What I believe
>> m-r-james_short-fiction
>> https://catalog.hathitrust.org/Record/102295859
>> <https://catalog.hathitrust.org/Record/102295859>
>> The eight strokes of the clock
>> Love Among the Chickens
>> PSmith, Journalist
>> Gladiator
>> Ara Vus Prec (Part of TS Elliot collection)
>>
>> *Links need updating:*
>> Doctor
>> Thorne:https://www.google.com/books/edition/Doctor_Thorne/kWkRAAAAYAAJ?hl=en&gbpv=0
>> Saga:https://www.google.com/books/edition/The_Forsyte_Saga/r-VGAAAAYAAJ <https://www.google.com/books/edition/The_Forsyte_Saga/r-VGAAAAYAAJ>
>> The Marvelous Land of
>> Oz:https://catalog.hathitrust.org/Record/009927232
>> <https://catalog.hathitrust.org/Record/009927232>
>> The gods of
>> pagana:https://www.google.com/books/edition/The_Gods_of_Pegāna/rtI_AQAAMAAJ
>> <https://www.google.com/books/edition/The_Gods_of_Peg%C4%81na/rtI_AQAAMAAJ>
>> <https://www.google.com/books/edition/Edward_the_Third/S3VlAAAAMAAJ?hl=en>
>>
>>
>> *Need Clarification:*
>> The three links for:
>> fyodor-sologub_short-fiction_john-cournos_stephen-graham_rosa-savory-graham_p-selver
>> This one link for h-g-wells_short-fiction
>> https://catalog.hathitrust.org/Record/000642318
>> <https://catalog.hathitrust.org/Record/000642318>
>> These two links for henry-david-thoreau_essays
>> https://catalog.hathitrust.org/Record/000597656
>> <https://catalog.hathitrust.org/Record/000597656>https://catalog.hathitrust.org/Record/000597656
>> <https://catalog.hathitrust.org/Record/000597656>
>> These two links for
>> ivan-bunin_short-fiction_s-s-koteliansky_d-h-lawrence_leonard-woolf_bernard-guilbert-guerney_the-russian-review_natalie-a-duddington
>> https://catalog.hathitrust.org/Record/000501549
>> <https://catalog.hathitrust.org/Record/000501549>https://catalog.hathitrust.org/Record/007908346
>> <https://catalog.hathitrust.org/Record/008697500>https://catalog.hathitrust.org/Record/009563304
>> <https://catalog.hathitrust.org/Record/009563304>
>> These three links for
>> leo-tolstoy_short-fiction_louise-maude_aylmer-maude_nathan-haskell-dole_constance-garnett_j-d-duff_leo-weiner_r-s-townsend_hagberg-wright_benjamin-tucker_everymans-library_vladimir-chertkov_isabella-fyvie-mayo
>> https://catalog.hathitrust.org/Record/011726151
>> <https://catalog.hathitrust.org/Record/011726151>https://catalog.hathitrust.org/Record/008634081
>> <https://catalog.hathitrust.org/Record/008634081>https://catalog.hathitrust.org/Record/008973985
>> <https://catalog.hathitrust.org/Record/008973985>
>> This one link for
>> leonid-andreyev_short-fiction_herman-bernstein_alexandra-linden_l-a-magnus_k-walter_w-h-lowe_the-russian-review_archibald-j-wolfe_john-cournos_r-s-townsend_maurice-magnus
>> https://catalog.hathitrust.org/Record/007908346
>> <https://catalog.hathitrust.org/Record/007908346>
>> This one link for thomas-paine_the-american-crisis
>> https://catalog.hathitrust.org/Record/009832797
>> <https://catalog.hathitrust.org/Record/009832797>
>> This one link for
>> vladimir-korolenko_short-fiction_aline-delano_sergius-stepniak_william-westall_thomas-seltzer_the-russian-review_marian-fell_clarence-manning
>> https://catalog.hathitrust.org/Record/000676865
>> <https://catalog.hathitrust.org/Record/000676865>
>> /_These are for magazine scans, not sure if I needed to grab all of
>> them or just a selected few_/
>>
>> king learhttps://catalog.hathitrust.org/Record/004390776
>> <https://catalog.hathitrust.org/Record/102007014>
>> /_These links link to the wrong book_/
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/E958F6BC-FC0F-484C-A284-B993FE584F44%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/E958F6BC-FC0F-484C-A284-B993FE584F44%40letterboxes.org?utm_medium=email&utm_source=footer>.

Alex Cabal

unread,
Oct 4, 2021, 11:11:51 AM10/4/21
to standar...@googlegroups.com
On 10/4/21 7:49 AM, Matt Chan wrote:
> Also: Should I go ahead and try to obtain copies of Hathi full scans
> that need University-partner access?

Yes, go ahead, thanks!

Vince Rice

unread,
Oct 4, 2021, 11:36:33 AM10/4/21
to standar...@googlegroups.com
This is up to Alex, but I don’t think we should change links anywhere, and we probably shouldn’t be using different scans, either.

Different scans are often from different editions, sometimes from different translations, etc. So different scans will not necessarily match the production. Storing scans that don’t match the ones used for the production is therefore misleading. And we shouldn’t change the links without first confirming that they DO match the production, for the same reason.

On Oct 4, 2021, at 7:50 AM, Matt Chan <thew...@gmail.com> wrote:


Alex Cabal

unread,
Oct 4, 2021, 11:37:33 AM10/4/21
to standar...@googlegroups.com
Yes, agreed
>> <mailto:standardebooks%2Bunsu...@googlegroups.com>.
>> <https://groups.google.com/d/msgid/standardebooks/528E7863-E8AE-466E-8A26-4DA30E1AE89D%40letterboxes.org>.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Standard Ebooks" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to standardebook...@googlegroups.com
>> <mailto:standardebook...@googlegroups.com>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/standardebooks/CAB6ohTeK33bii_9jw6iVFuAat%3DCN3RRJgsZ1ieiizM9iMJNYdw%40mail.gmail.com
>> <https://groups.google.com/d/msgid/standardebooks/CAB6ohTeK33bii_9jw6iVFuAat%3DCN3RRJgsZ1ieiizM9iMJNYdw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/BBD197A5-51CA-4294-B3F4-13CF421A452A%40letterboxes.org
> <https://groups.google.com/d/msgid/standardebooks/BBD197A5-51CA-4294-B3F4-13CF421A452A%40letterboxes.org?utm_medium=email&utm_source=footer>.

Matt Chan

unread,
Oct 4, 2021, 11:44:02 AM10/4/21
to standar...@googlegroups.com
I think it would be less confusing if we copy and rename the Shakespeare collections (V1 - V3) pdf for each of the individual play productions?

To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/standardebooks/36a2f8db-5691-0369-0665-11ecab98a6cc%40standardebooks.org.

Matt Chan

unread,
Oct 4, 2021, 11:44:59 AM10/4/21
to standar...@googlegroups.com
I see what you're saying, but some of the Google Books links on that spread sheet do not contain full scans, just an FYI.

Matt Chan

unread,
Oct 4, 2021, 11:46:52 AM10/4/21
to standar...@googlegroups.com
For the ones that I were able to find on GB full scans, yet they may or may not match the ones on Hathi, I guess the recourse would be to go through my University contact to obtain the scans then? This would take longer as there are a significant number of those on Hathi that are like that?

I can still get those, I think, given time. It'll probably take a week or some if that's okay, Alex?

Alex Cabal

unread,
Oct 4, 2021, 1:24:09 PM10/4/21
to standar...@googlegroups.com
Sure, there's no rush
> >> <mailto:thew...@gmail.com <mailto:thew...@gmail.com>>>
> <https://www.google.com/books/edition/The_Gods_of_Peg%C4%81na/rtI_AQAAMAAJ
> <https://catalog.hathitrust.org/Record/000597656>>https://catalog.hathitrust.org/Record/000597656
> <http://catalog.hathitrust.org/Record/004390776>
> >> <https://catalog.hathitrust.org/Record/004390776
> <https://catalog.hathitrust.org/Record/004390776>>
> >> philip-francis-nowlan_the-airlords-of-han
> >> https://catalog.hathitrust.org/Record/102007014
> <https://catalog.hathitrust.org/Record/102007014>
> >> <https://catalog.hathitrust.org/Record/102007014
> <https://catalog.hathitrust.org/Record/102007014>>
> >> /_These links link to the wrong book_/
> >
> > --
> > You received this message because you are subscribed to
> the Google
> > Groups "Standard Ebooks" group.
> > To unsubscribe from this group and stop receiving emails
> from it, send
> > an email to standardebook...@googlegroups.com
> <mailto:standardebooks%2Bunsu...@googlegroups.com>
> > <mailto:standardebook...@googlegroups.com
> <mailto:standardebooks%2Bunsu...@googlegroups.com>>.
> <https://groups.google.com/d/msgid/standardebooks/E958F6BC-FC0F-484C-A284-B993FE584F44%40letterboxes.org?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/standardebooks/E958F6BC-FC0F-484C-A284-B993FE584F44%40letterboxes.org?utm_medium=email&utm_source=footer>>.
>
> --
> You received this message because you are subscribed to the
> Google Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails
> from it, send an email to
> standardebook...@googlegroups.com
> <mailto:standardebooks%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/36a2f8db-5691-0369-0665-11ecab98a6cc%40standardebooks.org
> <https://groups.google.com/d/msgid/standardebooks/36a2f8db-5691-0369-0665-11ecab98a6cc%40standardebooks.org>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/standardebooks/CAB6ohTe171PTjnB80FHcspRkdq7BYwX495L93%3D-aGEYWQ9Yuxw%40mail.gmail.com
> <https://groups.google.com/d/msgid/standardebooks/CAB6ohTe171PTjnB80FHcspRkdq7BYwX495L93%3D-aGEYWQ9Yuxw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Vince

unread,
Oct 4, 2021, 2:18:16 PM10/4/21
to Standard Ebooks
We don’t have to download them multiple times, or even upload them multiple times—once we get them to the server we can just copy them to the appropriate directories.

It’s Alex’s call, but I would vote no. It would be more confusing, IMO, if our scans were a different name than the scans indicated in the metadata. We don’t name the magazines we download according to the short stories we’re using in them, and so forth. We’re keeping the name in all cases according to how they’re named at the source; we’re just adding the prefix so we know what the source was. And these scans are only meant as insurance; for regular purposes, everyone is going to continue to use the source scans.

Vince

unread,
Oct 4, 2021, 2:27:45 PM10/4/21
to Standard Ebooks
Right, but those are just my mistakes. The original scans are from Hathi; I just used the “Find on Google” button, and in some cases didn’t notice that, although the scan was on Google, a preview/download wasn’t available.

The Hathi scans are full scans, and they’re the ones that were used for the productions. If we can’t download them, we can’t download them, but the links should stay as they are.


On Oct 4, 2021, at 10:44 AM, Matt Chan <thew...@gmail.com> wrote:

I see what you're saying, but some of the Google Books links on that spread sheet do not contain full scans, just an FYI.

Matt Chan

unread,
Oct 4, 2021, 2:34:30 PM10/4/21
to standar...@googlegroups.com
Assuming that most folks don't have access to the University-affiliated-only HT full scans, then they won't be able to get a full scan anyway through those links, right?

--
You received this message because you are subscribed to the Google Groups "Standard Ebooks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to standardebook...@googlegroups.com.
It is loading more messages.
0 new messages