Book repo file naming

4 views
Skip to first unread message

Seth Woodworth

unread,
Mar 24, 2015, 10:23:53 AM3/24/15
to gitenber...@googlegroups.com
Sam Wilson has been offering pull requests on Thyrza by George Gissing.  It has been very helpful to get feedback on our very nascent documentation (see Tiago's thread on about the wiki).

Questions have come up regarding file naming, both past and future.  Here is a quick review (that I ought to put in the new documentation repo):

(<book_id> in the following examples stands for the numerical book ID given to each work by Project Gutenberg. It shows up in the end of the github repo name)
 
Project Gutenberg file layout:
+ '<book_id>.txt' - the canonical source file for a book
+ '<book_id>_h/' - a directory containing the html version of the book (optional)
+ '<book_id>_8.txt' - (optional) an ascii-only edition of the book file
+ 'pg<book_id>.rdf' - an RDF metadata file from PG (added to the book folder by us, created by PG)
+ 'old/' - (optional) PG old editions of released books

GITenberg files created on upload:
+ 'README.rst' - a Readme file for each repo with a simple intro
+ 'LICENSE' - a copy of the PG license/footer
+ 'CONTRIBUTING.rst' - instructions on how to contribute to this github repo

Files we will be adding:
+ '<book_id>.asciidoc' - a book file that has been converted to asciidoc
+ 'metadata.yml' - a yaml file serialization of available metadata for the book


The questions that Sam (and later @rdhyee) bring up are:
+ when creating the .asciidoc file, do we delete the original .txt file?
+ when creating unicode encoded txt files, do we keep the intermediate file?
+ Should we keep PG's 'old/' directory of book editions?  They are now stored in git history and can be recalled.
+ If we have finished asciidoc conversion of a book, should we keep the '<book_id>_h' folder of html?

Thoughts/opinions on these questions (or others related to filenames/structure?)

--Seth

Eric Hellman

unread,
Mar 24, 2015, 10:53:12 AM3/24/15
to gitenber...@googlegroups.com
I would like us to consider the possibility of removing <book_id> from the repo file names.

Advantages
1. it makes the toolchain more reusable for non-PG projects, also for PG projects prior to id assignment.
2. removing the need to pass a book_id to the toolchain simplifies interfaces

Disadvantages
1. adds complexity to a reverse (Gitenberg->Gutenberg) data flow


Eric

--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gitenberg-project/CA%2B_Hb1DepDbq_koOo6Ryk2DPpTzpEuumhUZ3rOWJ-orbCgjaHA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Tiago Saboga

unread,
Mar 24, 2015, 12:30:59 PM3/24/15
to gitenber...@googlegroups.com
I have still no clue about the future data-flow, and it seems to me a complex point. I understand it cannot be fully addressed before more work is done, but the sooner the problem can be tackled, the better.

We do a first import of PG files, and then extract the text and convert it to asciidoc. How could we do such a conversion in a way that allows us to merge changes in the asciidoc version back onto the PG format? I hope we can come up with something better than manually comparing them...

And what about making a dedicated branch for original PG texts, in order to have its history tracked separately from gitenberg files? I have no previous experience with dealing with books, but it is the way debian current packaging practices does, allowing for a clearer separation of upstream and local patches. This way we could make git-clear the fact that a certain version of gitenberg.asciidoc file comes from a specific PG version.

Tiago



Reply all
Reply to author
Forward
0 new messages