Some suggestions towards a style guide

startling

unread,

Feb 27, 2012, 9:56:49 PM2/27/12

to project-alexandria-books

First of all: hi, I lurk on HN and saw this and I'm pretty excited
about it.

So, I'd like to format a short-ish story, maybe [An Occurrence at Owl
Creek Bridge][] by Ambrose Bierce. How should the formatting look?
There was some discussion about this in the Welcome to Project
Alexandria but I feel this warrants its own thread. rst seems like a
decent choice, but I have a few questions (and suggestions) about how
basic things should be handled.

Line breaks: I'd like to suggest that lines be wrapped to 80 or 90
characters. It makes reading and writing a thousand times easier, and
it can be easily undone by automated tools. Paragraph breaks would be
double-newlines, which seems consistent with how rst is usually
handled.

Line numbers: I'd like to do away completely with line numbers and
page numbers in source texts. They're distracting, hard to format
around, and easily added by automated tools. These aren't ubiquitous
in Project Gutenberg texts, but they're terrible, so I feel I need to
mention them.

Metadata: the license information and metadata at the beginnings and
ends of Project Gutenberg texts have always seemed extraneous to me.
Could we have these things in separate files? a license.txt and
metadata.txt seem like they would work. Maybe we should have a
readme.rst to cooperate with github? It would be neat if we could add
little descriptions or reviews of the text somewhere, especially for
the lesser-known works that don't exist online but for Project
Gutenberg. I think metadata should include (at the very least) word
count, authors, editors, contributors, original language, text
language, translators, publishing date, source isbn (if it's
available) and maybe a set of topic tags. I'd suggest YAML or JSON for
metadata, leaning heavily towards YAML as it's meant to be human-
written. JSON is nicer to write than XML, but it's easy to leave a
trailing comma or something that invalidates the whole file.

Encoding: All tools should work with utf-8 by default. Utf-8 texts
shouldn't have BOM because they make editing and writing tools a pain.
We could use utf-16 when double-width characters are needed, in which
case we would need a BOM; I suspect these would require a different
set of tools, anyway.

It seems prudent not to worry about specifics, but to stick to a well-
defined standard for easy conversion later should we change our minds.
These are just a few suggestions I have, and places I could see being
a little thorny.

I hope this doesn't come off as too opinionated or bikesheddy. Thanks
for reading!

[An Occurrence at Owl Creek Bridge]: http://www.gutenberg.org/ebooks/375

Brian Troutwine

unread,

Feb 27, 2012, 10:42:18 PM2/27/12

to prj-ale...@googlegroups.com

Greetings.

Please review https://github.com/blt/project-alexandria/tree/owl_creek_bierce
Travel down b/ and let me know what you think. I've set the directory
structure to be alphabetical, which works OK for now. This is
divergent from my outstanding pull-request to the main project, but I
think I prefer this far more than the flat-files in root approach.

On Mon, Feb 27, 2012 at 9:56 PM, startling <tdixo...@gmail.com> wrote:
> First of all: hi, I lurk on HN and saw this and I'm pretty excited
> about it.
>
> So, I'd like to format a short-ish story, maybe [An Occurrence at Owl
> Creek Bridge][] by Ambrose Bierce. How should the formatting look?
> There was some discussion about this in the Welcome to Project
> Alexandria but I feel this warrants its own thread. rst seems like a
> decent choice, but I have a few questions (and suggestions) about how
> basic things should be handled.
>
> Line breaks: I'd like to suggest that lines be wrapped to 80 or 90
> characters. It makes reading and writing a thousand times easier, and
> it can be easily undone by automated tools. Paragraph breaks would be
> double-newlines, which seems consistent with how rst is usually
> handled.

I chose 80: `fmt -w 80`

> Line numbers: I'd like to do away completely with line numbers and
> page numbers in source texts. They're distracting, hard to format
> around, and easily added by automated tools. These aren't ubiquitous
> in Project Gutenberg texts, but they're terrible, so I feel I need to
> mention them.

Agreed.

> Metadata: the license information and metadata at the beginnings and
> ends of Project Gutenberg texts have always seemed extraneous to me.
> Could we have these things in separate files? a license.txt and
> metadata.txt seem like they would work. Maybe we should have a
> readme.rst to cooperate with github? It would be neat if we could add
> little descriptions or reviews of the text somewhere, especially for
> the lesser-known works that don't exist online but for Project
> Gutenberg. I think metadata should include (at the very least) word
> count, authors, editors, contributors, original language, text
> language, translators, publishing date, source isbn (if it's
> available) and maybe a set of topic tags. I'd suggest YAML or JSON for
> metadata, leaning heavily towards YAML as it's meant to be human-
> written. JSON is nicer to write than XML, but it's easy to leave a
> trailing comma or something that invalidates the whole file.

I dropped the metadata into 'metadata.yml' and the license into
'license.txt'. I didn't fill out metadata.yml as much as you'd
intended.

> Encoding: All tools should work with utf-8 by default. Utf-8 texts
> shouldn't have BOM because they make editing and writing tools a pain.
> We could use utf-16 when double-width characters are needed, in which
> case we would need a BOM; I suspect these would require a different
> set of tools, anyway.
>
> It seems prudent not to worry about specifics, but to stick to a well-
> defined standard for easy conversion later should we change our minds.
> These are just a few suggestions I have, and places I could see being
> a little thorny.
>
> I hope this doesn't come off as too opinionated or bikesheddy. Thanks
> for reading!

Patches welcome if I missed your vision!

> [An Occurrence at Owl Creek Bridge]: http://www.gutenberg.org/ebooks/375
>

> --
> You have received this message from the Project Alexander mailing list.
> The IRC channel is #ProjectAlexandria on irc.freenode.net.
> The project-central repository is here: https://github.com/felix-faber/project-alexandria

--
Brian L. Troutwine

Martin DeMello

unread,

Feb 27, 2012, 10:42:49 PM2/27/12

to prj-ale...@googlegroups.com

I do not believe that line breaks should be a property of the source
text. We can have a linebreak formatter through which to produce
output, but that's a separate issue.

martin

Cameron Hill

unread,

Feb 27, 2012, 11:00:18 PM2/27/12

to project-alexandria-books

I agree. The raw source of the text should be human readable but not
necessarily beautiful. Defined line widths are unnecessary styling (in
the sense that they have no inherent 'data' wrt the text itself) but
can always be applied (even customized) on output along with any
additional style and formatting that will make the texts 'beautiful'
to readers.

Brian Troutwine

unread,

Feb 27, 2012, 11:10:45 PM2/27/12

to prj-ale...@googlegroups.com

In the case of ReST, the single line breaks wil only be reflected in
the source, which I think _should_ be nicely ordered. What does and
does not translate into a public-ready format will be defined by the
markup format converter; I assumed these comments were for input
texts.

--
Brian L. Troutwine

Cameron Hill

unread,

Feb 27, 2012, 11:30:44 PM2/27/12

to project-alexandria-books

I'm still catching up on ReST and wanted to share a few resources I've
come across in case any one else wants to read up:

Intro to RST: http://docutils.sourceforge.net/docs/ref/rst/introduction.html
http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html

Online RST editor to play around with syntax: http://rst.ninjs.org/

Pandoc: haskell library that has begun to tackle the problem of
converting between numerous formats (incl markdown, RST, HTML, LateX,
PDF, etc) http://johnmacfarlane.net/pandoc/

It also appears that, similar to Markdown extensions, it is possible
to extend ReST through custom directives. We will come across many
pain points in deciding how to capture some types of formatting in our
input file (small-caps, for example, don't seem to be available in
either markdown or ReST out of the box). Even if some existing tools
or output formats don't know how to handle some of this data, I think
it's important that we try to capture it in the input so that tools
(if/when they can interpret more nuanced formatting) will be able to
output the most "complete" work possible. Having the ability to extend
our text format is important, however we open up a can of worms when
we start heavily modifying the standard ReST format.

Another thread mentioned TEI (http://www.tei-c.org/index.xml) which I
haven't looked into yet, and while I'm wary of it's XML foundations,
perhaps they have a more robust format we can work with.

startling

unread,

Feb 28, 2012, 3:52:19 AM2/28/12

to project-alexandria-books

A link that i've found really useful is Sphinx's primer:
http://sphinx.pocoo.org/rest.html

On Feb 27, 10:30 pm, Cameron Hill <cameron.h...@gmail.com> wrote:
> I'm still catching up on ReST and wanted to share a few resources I've
> come across in case any one else wants to read up:
>

> Intro to RST:http://docutils.sourceforge.net/docs/ref/rst/introduction.htmlhttp://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html

>
> Online RST editor to play around with syntax:http://rst.ninjs.org/
>
> Pandoc: haskell library that has begun to tackle the problem of
> converting between numerous formats (incl markdown, RST, HTML, LateX,

> PDF, etc)http://johnmacfarlane.net/pandoc/

Travis Jensen

unread,

Feb 28, 2012, 12:41:50 PM2/28/12

to prj-ale...@googlegroups.com

In some ways, I think about these source texts like source code. Yes, you can store JavaScript that doesn't have line endings, but you don't want to. To the extent possible, source should be valuable by itself.

But, I also recognize I may be taking my analogy too far here. The fact is there is no semantic difference between text with and without returns and little syntactic difference.

Would there be value in one or the other from the perspective of being able to apply meta-data about specific portions of the text?

tj

--
Travis Jensen

Read the Software Maven @ http://softwaremaven.innerbrane.com/
Read my LinkedIn profile @ http://www.linkedin.com/in/travisjensen
Read my Twitter mumblings @ http://twitter.com/SoftwareMaven
Send me email @ travis...@gmail.com

*What kind of guy calls himself the Software Maven???*

Allen Tan

unread,

Feb 28, 2012, 2:25:08 PM2/28/12

to project-alexandria-books

I want to point out that there are times, as in poetry or in
Shakespeare plays, that line breaks at specific points are very
important. There should be a way of distinguishing between line
endings added for readability and line endings that are crucial for
the text.

Soft line breaks vs hard line breaks, maybe?

Allen

On Feb 28, 12:41 pm, Travis Jensen <travis.jen...@gmail.com> wrote:
> In some ways, I think about these source texts like source code. Yes, you
> can store JavaScript that doesn't have line endings, but you don't want to.
> To the extent possible, source should be valuable by itself.
>
> But, I also recognize I may be taking my analogy too far here. The fact is
> there is no semantic difference between text with and without returns and
> little syntactic difference.
>
> Would there be value in one or the other from the perspective of being able
> to apply meta-data about specific portions of the text?
>
> tj
>

> On Mon, Feb 27, 2012 at 8:42 PM, Martin DeMello <martindeme...@gmail.com>wrote:
>
>
>
>
>
>
>
>
>
> > I do not believe that line breaks should be a property of the source
> > text. We can have a linebreak formatter through which to produce
> > output, but that's a separate issue.
>
> > martin
>

> *Travis Jensen*
> ***
> *Read the Software Maven @http://softwaremaven.innerbrane.com/
> Read my LinkedIn profile @http://www.linkedin.com/in/travisjensen
> Read my Twitter mumblings @http://twitter.com/SoftwareMaven
> Send me email @ travis.jen...@gmail.com
>
> **What kind of guy calls himself the Software Maven???**

startling

unread,

Mar 1, 2012, 3:30:58 PM3/1/12

to project-alexandria-books

In those cases, you would use a double-linebreak, which rst makes into
an actual linebreak in its output.

Reply all

Reply to author

Forward