A general question for books that are not available on Project Gutenberg

38 views
Skip to first unread message

Chelsea McGill

unread,
8:22 AM (12 hours ago) 8:22 AM
to Standard Ebooks
I have put together a list of books that I would like to work on in the future, but I have noticed that very few are available from Project Gutenberg. How would the ebook creation flow work if it is available from archive.org or other sources? 

An example is: The assemblies of al Harîri

https://catalog.hathitrust.org/Record/001668743

https://archive.org/details/in.ernet.dli.2015.70309/page/n5/mode/2up

https://books.google.co.in/books?id=P-oOAAAAQAAJ&pg=PP1#v=onepage&q&f=false

https://archive.org/details/assembliesofalha015555mbp


Or 

Vāsavadattā by Subandhu (early 7th century)

https://www.loc.gov/resource/gdcmassbookdig.vasavadattasansk00suba/?st=pdf

https://www.rarebooksocietyofindia.org/book_archive/196174216674_10154406233921675.pdf

https://www.rarebooksocietyofindia.org/postDetail.php?id=196174216674_10154406233921675


This is just for my future knowledge so I can plan accordingly on the amount of work these would be. 


Best, 

Chelsea

Alex Cabal

unread,
10:01 AM (11 hours ago) 10:01 AM
to standar...@googlegroups.com
You can find transcriptions anywhere, as long as you can demonstrate
that the transcriptions are of an edition of the book published in the
US-PD era. PG is just extremely convenient because they take great care
to already ensure that, so we can assume all of their content is already
cleared for US-PD.

However if you can't find transcriptions online - only page scan images
- then transcribing a book is a very different task than what we do at
SE. SE doesn't specialize in transcription (although people have
transcribed for SE in the past). To transcribe a book, I suggest
visiting Distributed Proofreaders, pgdp.net. They specialize in
transcription and will help you much more than we could. Transcription
is a deceptively difficult and error-prone process. Hendrik Kaiber on
this list does a lot of work for PGDP.

On 6/24/26 7:22 AM, Chelsea McGill wrote:
> I have put together a list of books that I would like to work on in the
> future, but I have noticed that very few are available from Project
> Gutenberg. How would the ebook creation flow work if it is available
> from archive.org or other sources?
>
> An example is: The assemblies of al Harîri
>
> https://catalog.hathitrust.org/Record/001668743 <https://
> catalog.hathitrust.org/Record/001668743>
>
> https://archive.org/details/in.ernet.dli.2015.70309/page/n5/mode/2up
> <https://archive.org/details/in.ernet.dli.2015.70309/page/n5/mode/2up>
>
> https://books.google.co.in/books?id=P-
> oOAAAAQAAJ&pg=PP1#v=onepage&q&f=false <https://books.google.co.in/books?
> id=P-oOAAAAQAAJ&pg=PP1#v=onepage&q&f=false>
>
> https://archive.org/details/assembliesofalha015555mbp <https://
> archive.org/details/assembliesofalha015555mbp>
>
>
> Or
>
> Vāsavadattā by Subandhu (early 7th century)
>
> https://www.loc.gov/resource/gdcmassbookdig.vasavadattasansk00suba/?
> st=pdf <https://www.loc.gov/resource/
> gdcmassbookdig.vasavadattasansk00suba/?st=pdf>
>
> https://www.rarebooksocietyofindia.org/
> book_archive/196174216674_10154406233921675.pdf <https://
> www.rarebooksocietyofindia.org/
> book_archive/196174216674_10154406233921675.pdf>
>
> https://www.rarebooksocietyofindia.org/postDetail.php?
> id=196174216674_10154406233921675 <https://
> www.rarebooksocietyofindia.org/postDetail.php?
> id=196174216674_10154406233921675>
>
>
> This is just for my future knowledge so I can plan accordingly on the
> amount of work these would be.
>
>
> Best,
>
> Chelsea
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/
> standardebooks/0eb9c4e8-0263-4c7f-a1ec-d8e00554143en%40googlegroups.com
> <https://groups.google.com/d/msgid/standardebooks/0eb9c4e8-0263-4c7f-
> a1ec-d8e00554143en%40googlegroups.com?utm_medium=email&utm_source=footer>.

Hendrik Kaiber

unread,
10:21 AM (10 hours ago) 10:21 AM
to Standard Ebooks
Hello. I think I can give my opinion.

The most direct way to work on a project without a transcription is to create one yourself. Like Alex said, it is quite difficult to produce a good transcription, and even experienced transcribers will likely produce a text with several errors. I don't recommend this for anything larger than a short story or poem.

Wikisource has no entry barrier and can produce texts quickly, but is usually also filled with small errors. I think PGDP is generally the best way to have an accurate, high quality transcription (with the bonus that the HTML is much cleaner). It can take a long time, however, so it is something one should prepare for the non-immediate future.

I have the ability to create projects at PGDP, and I am willing to create projects for people here, provided they are public domain (both of yours are) and the images have a viable quality (the ones you linked should be, bar any missing or bad pages). You can contact me via email if you want (I think it's best not to do it here to avoid cluttering the list). Even if the books aren't produced here, they would still be available there.

If transcription is something you are interested in as well, volunteering at PGDP would be appreciated. The work there is spreaded among volunteers, so there is no excessive work for any single person.

—Hendrik

Chelsea McGill

unread,
10:57 AM (10 hours ago) 10:57 AM
to standar...@googlegroups.com
Thanks everyone for your response. If the text has a transcription on say Wikisource, how would I get it into the correct format to start the work? Would I need to create the different files by hand, copy and pasting the text? 

Best,
Chelsea

You received this message because you are subscribed to a topic in the Google Groups "Standard Ebooks" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/standardebooks/5czn9tvPWsw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to standardebook...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/standardebooks/d2ce240f-527e-4731-a63a-5082d58cbba4n%40googlegroups.com.

Alex Cabal

unread,
10:58 AM (10 hours ago) 10:58 AM
to standar...@googlegroups.com
Probably yes. Wikisource formats vary greatly. Do an initial import with
the base text in however many files makes sense, and commit that as the
first commit. Then you can manipulate that into the SE format.

On 6/24/26 9:56 AM, Chelsea McGill wrote:
> Thanks everyone for your response. If the text has a transcription on
> say Wikisource, how would I get it into the correct format to start the
> work? Would I need to create the different files by hand, copy and
> pasting the text?
>
> Best,
> Chelsea
>
> On Wed, 24 Jun, 2026, 7:51 pm Hendrik Kaiber,
> visiting Distributed Proofreaders, pgdp.net <http://pgdp.net>.
> They specialize in
> transcription and will help you much more than we could.
> Transcription
> is a deceptively difficult and error-prone process. Hendrik
> Kaiber on
> this list does a lot of work for PGDP.
>
> On 6/24/26 7:22 AM, Chelsea McGill wrote:
> > I have put together a list of books that I would like to work
> on in the
> > future, but I have noticed that very few are available from
> Project
> > Gutenberg. How would the ebook creation flow work if it is
> available
> > from archive.org <http://archive.org> or other sources?
> >
> > An example is: The assemblies of al Harîri
> >
> > https://catalog.hathitrust.org/Record/001668743 <https://
> catalog.hathitrust.org/Record/001668743> <https://
> > catalog.hathitrust.org/Record/001668743 <http://
> catalog.hathitrust.org/Record/001668743>>
> >
> > https://archive.org/details/in.ernet.dli.2015.70309/page/n5/
> mode/2up <https://archive.org/details/in.ernet.dli.2015.70309/
> page/n5/mode/2up>
> > <https://archive.org/details/in.ernet.dli.2015.70309/page/n5/
> mode/2up <https://archive.org/details/in.ernet.dli.2015.70309/
> page/n5/mode/2up>>
> >
> > https://books.google.co.in/books?id=P- <https://
> books.google.co.in/books?id=P->
> > oOAAAAQAAJ&pg=PP1#v=onepage&q&f=false <https://
> books.google.co.in/books <https://books.google.co.in/books>?
> > id=P-oOAAAAQAAJ&pg=PP1#v=onepage&q&f=false>
> >
> > https://archive.org/details/assembliesofalha015555mbp
> <https://archive.org/details/assembliesofalha015555mbp> <https://
> > archive.org/details/assembliesofalha015555mbp <http://
> archive.org/details/assembliesofalha015555mbp>>
> >
> >
> > Or
> >
> > Vāsavadattā by Subandhu (early 7th century)
> >
> > https://www.loc.gov/resource/
> gdcmassbookdig.vasavadattasansk00suba/ <https://www.loc.gov/
> resource/gdcmassbookdig.vasavadattasansk00suba/>?
> > st=pdf <https://www.loc.gov/resource/ <https://www.loc.gov/
> resource/>
> > gdcmassbookdig.vasavadattasansk00suba/?st=pdf>
> >
> > https://www.rarebooksocietyofindia.org/ <https://
> www.rarebooksocietyofindia.org/>
> > book_archive/196174216674_10154406233921675.pdf <https://
> > www.rarebooksocietyofindia.org/ <http://
> www.rarebooksocietyofindia.org/>
> > book_archive/196174216674_10154406233921675.pdf>
> >
> > https://www.rarebooksocietyofindia.org/postDetail.php
> <https://www.rarebooksocietyofindia.org/postDetail.php>?
> > id=196174216674_10154406233921675 <https://
> > www.rarebooksocietyofindia.org/postDetail.php <http://
> www.rarebooksocietyofindia.org/postDetail.php>?
> > id=196174216674_10154406233921675>
> >
> >
> > This is just for my future knowledge so I can plan
> accordingly on the
> > amount of work these would be.
> >
> >
> > Best,
> >
> > Chelsea
> >
> > --
> > You received this message because you are subscribed to the
> Google
> > Groups "Standard Ebooks" group.
> > To unsubscribe from this group and stop receiving emails from
> it, send
> > an email to standardebook...@googlegroups.com
> > <mailto:standardebook...@googlegroups.com>.
> > To view this discussion visit https://groups.google.com/d/
> msgid/ <https://groups.google.com/d/msgid/>
> > standardebooks/0eb9c4e8-0263-4c7f-a1ec-
> d8e00554143en%40googlegroups.com <http://40googlegroups.com>
> > <https://groups.google.com/d/msgid/
> standardebooks/0eb9c4e8-0263-4c7f- <https://groups.google.com/d/
> msgid/standardebooks/0eb9c4e8-0263-4c7f->
> > a1ec-d8e00554143en%40googlegroups.com?
> utm_medium=email&utm_source=footer <http://40googlegroups.com?
> utm_medium=email&utm_source=footer>>.
>
> --
> You received this message because you are subscribed to a topic in
> the Google Groups "Standard Ebooks" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/
> topic/standardebooks/5czn9tvPWsw/unsubscribe <https://
> groups.google.com/d/topic/standardebooks/5czn9tvPWsw/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
> standardebooks/d2ce240f-527e-4731-
> a63a-5082d58cbba4n%40googlegroups.com <https://groups.google.com/d/
> msgid/standardebooks/d2ce240f-527e-4731-
> a63a-5082d58cbba4n%40googlegroups.com?
> utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Standard Ebooks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to standardebook...@googlegroups.com
> <mailto:standardebook...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/
> standardebooks/
> CAPeSxYb1M%3Dhj03ULqxKhK%2BTGCVLMFqrmVdX7xG6XDUCAiwyqLA%40mail.gmail.com
> <https://groups.google.com/d/msgid/standardebooks/
> CAPeSxYb1M%3Dhj03ULqxKhK%2BTGCVLMFqrmVdX7xG6XDUCAiwyqLA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Weijia Cheng

unread,
11:26 AM (9 hours ago) 11:26 AM
to Standard Ebooks
In my experience, the best way to work with texts from Wikisource is to download the book as an HTML file ("Download" -> "Looking for a different format?"). I then like to use pandoc to convert the HTML to Markdown format (this simplifies the Wikisource formatting which can get unneccesarily complicated) and then clean up the Markdown before backconverting to HTML with pandoc. There is an example of this workflow in There Is Confusion.
Reply all
Reply to author
Forward
0 new messages