GITenberg status report

1,741 views
Skip to first unread message

Eric Hellman

unread,
Mar 3, 2015, 10:56:56 AM3/3/15
to gitenber...@googlegroups.com
GITenberg status report.
--------------------------------

Seth started GITenberg back in September of 2012. It was pretty much a one person effort. Through this mailing list, a few other people started thinking about what it could be. I discovered the project and joined up in March of 2014 when I was exploring similar ideas. The project got some good exposure on Hacker News last August.


### Knight Foundation Grant
When I heard about the Knight News Challenge for Libraries, I suggested to Seth that GITenberg might be a good fit. Together with Raymond Yee, Seth and I put together a proposal. We got help from Jenny Lee, Phoebe Espiritu, and Emily Nimsakont. 


There were 676 entrants in the News Challenge, and believe it or not, GITenberg was one of 22 entries to receive funding. We've been awarded a $35,000 "Prototype Grant", which will allow us to spend some real development time to start turning the idea into something that really works. More to the point, we have a deadline (in late June!) for demonstrating the GITenberg concept.

Now the work begins.

### Next Steps

Aside from 45,000+ repos on GitHub (a significant achievement by itself) GITenberg has so far been more concept than reality. If you tried to adopt a repo and submitted a pull request, you'll surely be aware that the GITenberg of today is more of a sketch than a working system. To make it a working system, we'll have to assemble a lot of cooperating components. Thankfully most of the components we need exist, and people are working on them. This became very clear at the Hack day sponsored by New York Public Library in January.

So I think it's important to make that sketch more explicit.

### Core Vision

The core vision is that for any text in Project Gutenberg, anyone will be able to fork a repo, commit a change, and GITenberg machinery triggered by the commit will derive ebook files and metadata products. The commit can be submitted as a pull request, and accepted PRs will get fed back into Project Gutenberg. We hope.

At this point, I should comment about Project Gutenberg. To fulfill its mission, Project Gutenberg has to be very conservative in its processes and operations. It doesn't have the resources to engage in speculative projects. So while the Project Gutenberg is enabling the experimentation we're doing, (and happy that we're doing it) we expect that GITenberg will need to prove itself before the PG feedback is a real thing.

One thing that Project Gutenberg has been thinking about for years is the source format for its texts. For a good while, that format was 7 bit ascii text files, and there was a lot of resistance to migrating to anything more "modern". Now, the plain text you get from Project Gutenberg is utf-8. Sort of. The html files are maintained separately, and are not uniform; there's a lot of hand-coding. Changing the source format to RST, XML or TEI has been discussed. The PG ebook files (MOBI and EPUB) are built using a script called ebookmaker which digests the html files. The HTML files are thus the "source" files as far as the ebooks are concerned. It should be possible for us to duplicate this workflow in the GITenberg machinery.

On the metadata side the situation is more obscure, and we're still working to understand it. There's a set of RDF files, there are metadata records associated with each ebook folder.


### Book Formats

We've surveyed the components now available, and we feel that we can also improve on the existing workflow by migrating away from HTML as a source format. At this point, asciidoc appears to be the best fit for a format that can be a source format for the required product files, while at the same time fitting with the established PG text corpus and the Git-based version control. It looks like the best choice for ebook and web formats is the HTMLBook flavor of HTML5. http://oreillymedia.github.io/HTMLBook/ There’s a converter for asciidoc that makes htmlbook files. https://github.com/oreillymedia/asciidoctor-htmlbook and css themes that support htmlbook. We expect that alternate paths into HTMLBook can be developed (or already exist) for LaTeX and TEI source formats. Pandoc has done quite a lot.

Internet Archive seems like the best destination for GITenberg produced ebook files.

NYPL Labs has done some really nice work on generating covers for PG texts, we expect to integrate that work as well.

On the metadata side, we've started looking at YAML as an appropriate serialization for PG-associated metadata. conversion to MARC and other formats should be straightforward in the backend.

### Issues

Github itself has presented us with a set of challenges to address. The large number of repos in the GITenberg organization breaks some Github tools. For example, GitHub for Mac became unstable for me, and some 3rd party integrations would time out when we tried enabling them. We broke our Github pages. So we need to understand this better; Github support has been very responsive. There's a separate organization "gitenberg-dev" https://github.com/gitenberg-dev that we're using to let us easily work on code untill we fully understand how to work with 50,000 repos; at this point, you probably don’t want to be a member of the Gitenberg organization but you might want to join gitenberg-dev, even if you’re not a developer.

The non-programmer usability of Github is another problem. We're going to set up a "github for poets" sandbox to see if this challenge can be addressed.

Despite the Knight grant, and the efforts of some committed volunteers, this is still a very small effort. GITenberg can't succeed without a lot of help, cooperation, and collaboration. I hope everyone on this list will be help us nurture that success.

Here’s something each of us can do to get the ball rolling: Decide on a Gitenberg repo to contribute to. Star it in Github. Then add it to the list of active repos at https://github.com/gitenberg-dev/wiki/blob/master/activerepos.csv
(send a PR or create an issue https://github.com/gitenberg-dev/wiki/issues )

If you’re new to Github, instructions are at https://github.com/gitenberg-dev/wiki/blob/add_how_to/how_to.md

There's a huge amount that we don't know, and so much prior work we've yet to absrb but we're really encouraged by all the expressions of support we've received. Thank you all!

Eric


Eric Hellman

Roger Sperberg

unread,
Mar 4, 2015, 12:10:55 PM3/4/15
to gitenber...@googlegroups.com
This is great news.

I thought the NYPL cover-generating work was neat, but since then I've encountered tagxedo.com, which I think provides a less-abstract and more direct connection to content.

Here is a tagxedo-generated image for Rime of the Ancient Mariner:

The fonts were selected at random, and I picked one of the many shapes and color themes available, but this tag-cloud approach seems to me a more rewarding reflection of the content of a particular file than does geometric abstraction.

Seth Woodworth

unread,
Mar 4, 2015, 1:30:46 PM3/4/15
to gitenber...@googlegroups.com
That looks nice! There are a number of different ways to auto-generate covers.  I'm inclined to accept whatever folks think look the best.

Ideally, we could get high-quality original artwork of covers released under the public domain as SVG.  That might actually happen in some cases, I'm talking to a few groups that might be willing to provide some covers.



--
You received this message because you are subscribed to the Google Groups "GITenberg Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gitenberg-proj...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gitenberg-project/db973dc7-9f3c-430d-bfb5-a8289251937c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Roger Sperberg

unread,
Mar 4, 2015, 2:34:48 PM3/4/15
to gitenber...@googlegroups.com
Of course, I suggest using generated art as a cover only in the cases when no cover art is available. 

With 45,000 texts, the number is bound to be quite large, so probably the more strategies the better.

Eric Hellman

unread,
Mar 4, 2015, 7:33:00 PM3/4/15
to gitenber...@googlegroups.com
It's worth looking at the LibraryThing page of covers for the same work.

How would gitenberg decide?

One criterion could be - latest public domain/ CC0 cover.

But maybe we also need to figure out a test for generated or original (new) covers. Perhaps the criterion for accepting a PR with a new original cover would be a poll with some number of votes.

Deciding on a cover, or even a defined set of covers, is a really hard problem!

There is value in having a cover that people recognize as a gitenberg cover.

Many distributers might want to have a customized cover: the NYPL curated cover, or the Overdrive curated cover. Much to think about!


Tom Morris

unread,
Mar 24, 2015, 2:28:55 PM3/24/15
to gitenber...@googlegroups.com
On Tue, Mar 3, 2015 at 10:56 AM, Eric Hellman <er...@hellman.net> wrote:
> GITenberg status report.
> --------------------------------
>
> Seth started GITenberg back in September of 2012. It was pretty much a one
> person effort. Through this mailing list, a few other people started
> thinking about what it could be. I discovered the project and joined up in
> March of 2014 when I was exploring similar ideas. The project got some good
> exposure on Hacker News last August.
>
>
> ### Knight Foundation Grant
> When I heard about the Knight News Challenge for Libraries, I suggested to
> Seth that GITenberg might be a good fit. Together with Raymond Yee, Seth and
> I put together a proposal. We got help from Jenny Lee, Phoebe Espiritu, and
> Emily Nimsakont.
>
> https://www.newschallenge.org/challenge/libraries/feedback/gitenberg-modern-maintenance-infrastructure-for-our-literary-heritage

Congratulations on the windfall! What entity received the grant? The
application says "Gitenberg Project." Is that a non-profit
incorporated somewhere or did some other entity receive the money?

> We've surveyed the components now available, and we feel that we can also
> improve on the existing workflow by migrating away from HTML as a source
> format. At this point, asciidoc appears to be the best fit for a format that
> can be a source format for the required product files, while at the same
> time fitting with the established PG text corpus and the Git-based version
> control.
...
> Internet Archive seems like the best destination for GITenberg produced
> ebook files.
...
> On the metadata side, we've started looking at YAML as an appropriate
> serialization for PG-associated metadata. conversion to MARC and other
> formats should be straightforward in the backend.

The grant application seems to incorporate decisions on a bunch of
topics that I didn't even realize were even on the table: asciidoc,
yaml, etc.

Have I just not been paying enough attention or were all these
dicussions held in a different forum?

Tom

Seth Woodworth

unread,
Mar 24, 2015, 3:41:08 PM3/24/15
to gitenber...@googlegroups.com
On Tue, Mar 24, 2015 at 2:28 PM, Tom Morris <tfmo...@gmail.com> wrote:


Congratulations on the windfall!  What entity received the grant?  The
application says "Gitenberg Project."  Is that a non-profit
incorporated somewhere or did some other entity receive the money?

No entity yet.  The Knight Foundation granted the money to the Miami Foundation on our behalf and we're able to submit invoices against the grant.  I think we are as-of-yet undecided if the organization should be sheltered under an existing non-profit like Internet Archive, or DPLA (were either open to the idea).

The grant application seems to incorporate decisions on a bunch of
topics that I didn't even realize were even on the table: asciidoc,
yaml, etc.

The unfortunate trade-offs of a rapid grant writing process and the need of communicating to everyone as a group.  These are mainly decisions I've made, but aren't set in stone.  I feel like I can make a strong case for asciidoc and I would be happy to discuss it at length.

Knight has funded us to produce a prototype over the next ~3 months.  For expediency, choices were made for the grant application.  I'd like to think that prototype goal made the right decisions, but I'm 100% willing to question and debate any decisions made during this prototype phase.

Eric Hellman

unread,
Mar 24, 2015, 3:42:24 PM3/24/15
to gitenber...@googlegroups.com
On Mar 24, 2015, at 2:28 PM, Tom Morris <tfmo...@gmail.com> wrote:


Congratulations on the windfall!

Thanks, Tom! But windfall isn't a word we would use to describe it, because there's a connotation of "random, lucky, unearned". But in the sense of "welcome, unexpected", definitely.

 What entity received the grant?  The
application says "Gitenberg Project."  Is that a non-profit
incorporated somewhere or did some other entity receive the money?

The fiscal sponsor for the project is the Miami Foundation. I just received word that a check had been sent to them.

Gitenberg is not a legal entity; we're investigating possibilities.


We've surveyed the components now available, and we feel that we can also
improve on the existing workflow by migrating away from HTML as a source
format. At this point, asciidoc appears to be the best fit for a format that
can be a source format for the required product files, while at the same
time fitting with the established PG text corpus and the Git-based version
control.
...
Internet Archive seems like the best destination for GITenberg produced
ebook files.
...
On the metadata side, we've started looking at YAML as an appropriate
serialization for PG-associated metadata. conversion to MARC and other
formats should be straightforward in the backend.

The grant application seems to incorporate decisions on a bunch of
topics that I didn't even realize were even on the table: asciidoc,
yaml, etc.

Have I just not been paying enough attention or were all these
dicussions held in a different forum?

Tom

Tom,

I don't think you've missed much.

If you look at the proposal from October, none of these "decisions" are there, although Seth was just getting interested in asciidoc. I'd say the discussions have just started and are ongoing, and nothing is written in stone. At the time I wrote that first status report, I had known about yaml for approximately 1 day.

There's been some discussion in the issues sections of Gitenberg and gitenberg-dev, and we're trying to figure out how to make sure things get appropriately exposed.

To review the 3 issues you highlight:

Source format:
the tool chain will probably be supporting multiple source formats via build processors. The contenders, as I see it are
1. the existing Gutenberg html. Most of the code already exists.
2. asciidoc. A lot of the code already exists. This is receiving the most attention
3. latex. definitely has its uses
4. tei. ditto.
5. markdown. some code exists. I tried this on one repo, Meh.
6. htmlbook flavor of html5. Probably only practical via asciidoc.

File Destination.
The possibilities
1. Internet archive. 
3. Both

Metadata format
I've written 4 gists on it this week, summarizing my work from last week. and there's probably 2 more coming.

Eric





Reply all
Reply to author
Forward
0 new messages