book markup

13 views
Skip to first unread message

E L

unread,
Oct 17, 2012, 1:17:56 AM10/17/12
to open...@googlegroups.com
Hi,

Another important point that will contribute a lot to cross project collaboration is the
format that the text or books are being kept. I know both orayta and opensidur really look into
it but just my thoughts about  what can be important.

- separation of information and display
- Support for books versions e.g. the tikuneu girsaut can live together on the same file
- Easy to parse
- Allows adding metadata, if it's pasuk number parasha or page in gmara.
- Supports metadata such as chapter title author etc
- Supports cross book references, so for example I can reference a pasuk or gmara part. That can be useful
in a sidur for example, where I can just import a lot of things from tehilim or the mishna

What do you think? Is there some format we can start working from?Tei?

Ely
 

Yaron Shahrabani

unread,
Oct 17, 2012, 11:54:15 AM10/17/12
to E L, open...@googlegroups.com
What's the disadvantages of ODF?
Yaron Shahrabani
<Hebrew translator>



--
 
 

Efraim Feinstein

unread,
Oct 18, 2012, 1:29:29 PM10/18/12
to open...@googlegroups.com
Hi,

On 10/17/2012 08:54 AM, Yaron Shahrabani wrote:
> What's the disadvantages of ODF?

ODT is a word processor format. It's quite complicated to work with on
its face and handles issues that word processors handle. It is not a
good format for presenting semantically-tagged information about a book
(although you can shoehorn it in, just as you can in HTML using
microformats).

TEI is verbose, but does handle a lot of the issues that we would need
to handle, including academic metadata. TEI also has a built-in
extension mechanism that does two things: (1) allows removal of tags
that you don't need (2) allows addition of tags in cases where you need
to tag a specific structure that isn't handled in the default tag set.

Open Siddur proposes a TEI extension that's useful for liturgy (and, by
extension, other Jewish books). The main issues that are necessary to
handle are:
- multiple concurrent XML hierarchies
- structures that are unique to liturgy
- linkages of texts internally and externally (annotations, comments, notes)
- transclusion
- conditional inclusion of texts (and how to specify the conditions!)

-Efraim

Aharon Varady

unread,
Oct 18, 2012, 3:57:48 PM10/18/12
to open...@googlegroups.com
More information about the TEI extension Efraim developed for Jewish Liturgy is at:



What we've been investigating is what online collaborative tools we can use to markup texts with JLPTEI XML.

Ideally, we'd like markup within a web interface to be invisible -- something more of a function of highlighting text and selecting how it should be appropriately tagged from a toolbar and context menu.

Practically, Efraim has worked to convert existing markup languages to JLPTEI. Most of this work has focused on a markup language that J.B. Hare developed for his Internet Sacred Text Archive which he called STML (Sacred Text Markup Language). Efraim's documentation for manually tagging proofread, digital texts with STML is here: https://github.com/opensiddur/opensiddur/wiki/Open-Siddur-STML

STML provides something like a minimum set of markup for converting a siddur to JLPTEI.

We remain very interested in how we can employ a combination of wikimedia templates and wiki tags to prepare texts within the collaborative transcription/proofreading interface of Wikisource, so that these texts can be directly converted to JLPTEI XML, and added to our open source database of Jewish liturgy and liturgy related work.

Aharon

Efraim Feinstein

unread,
Oct 18, 2012, 4:46:03 PM10/18/12
to open...@googlegroups.com
On 10/18/2012 12:57 PM, Aharon Varady wrote:
> More information about the TEI extension Efraim developed for Jewish
> Liturgy is at:
>
> http://wiki.jewishliturgy.org/JLPTEI
>

Just so you know, that page is a bit out of date -- the format also
underwent a revision to make it simpler to create/digest.

The current docs are actually generated from the source code (I know, I
know...)

--
---
Efraim Feinstein
Lead Developer
Open Siddur Project
http://opensiddur.net
http://wiki.jewishliturgy.org

E L

unread,
Oct 19, 2012, 4:10:39 AM10/19/12
to Efraim Feinstein, open...@googlegroups.com
Hi,
I started reading it, IMHO this file format is scary.
I don't see any non expert being able to understand what's going on there.
Which reduces the amount of people that will be able to program for this file format.

I have a lot of comments about different parts, for example the reference to the bible uses non jewish names and splitting to chapters. Do you really want to use it? the citation allows you to do whole psukim, but for example
in the gmara and rashi it's really important to know which words they chose to quote. Also a lot of times
Rashi and other Mefarshim referred to a certain version of the gmara or mishna, it will make it very hard to study without it.

The problem is that it's hard to comment about the whole thing. Maybe we should make a cross project file format
that can be discussed in parts? Maybe people have different ideas such as basing on relaxNG or epub formats.
I think we can start with bible+gmara with mefarshim that will cover most cases.

Kodesh for the amazing work over the file format!

Ely
 
--



Efraim Feinstein

unread,
Oct 19, 2012, 4:31:33 AM10/19/12
to E L, open...@googlegroups.com
Hi,

On 10/19/2012 01:10 AM, E L wrote:
> Hi,
> I started reading it, IMHO this file format is scary.

Have you seen the ODT spec? :-)

Seriously, though -- comments accepted, but know that the spec on the
wiki is more complex than the current one.

> I don't see any non expert being able to understand what's going on there.
> Which reduces the amount of people that will be able to program for
> this file format.

It's actually quite a bit simpler than what's in that doc, and it's
relatively easy to digest simple documents. I'm attaching 3 example
documents from Tanach:
1. A prose text
2. A poetic text
3. A book

(The split into chapters is because the original source - the WLC - is
like that, and, while it's not ideal, it is somewhat standard.)

>
> I have a lot of comments about different parts, for example the
> reference to the bible uses non jewish names and splitting to chapters.

Actually, it doesn't. In the current incarnation, the names are
arbitrary -- they're filenames. A reference either takes the form
filename#id or filename#range(id1,id2). (The @cref portion has been
removed because it's more difficult to implement).

> Do you really want to use it? the citation allows you to do whole
> psukim, but for example
> in the gmara and rashi it's really important to know which words they
> chose to quote.

Actually, the references are completely generic. As long as there's a
unique id, there's the ability to reference it.

> The problem is that it's hard to comment about the whole thing. Maybe
> we should make a cross project file format
> that can be discussed in parts? Maybe people have different ideas such
> as basing on relaxNG or epub formats.

RelaxNG is a schema language. EPUB is essentially a formatting language
(a limited version of HTML for ebooks).

I would welcome discussing everything in parts (and I've been waiting
for someone to want to discuss this format issue for a while!). I am
certainly open to accepting an in-field de-facto standard over my
concoction, but it should support the features that will be needed.

I would strongly recommend reading some parts of the TEI Guidelines at
<http://www.tei-c.org>. A lot of the problems you'll encounter trying to
work on book formats are solved there.
בראשית א.xml
תהלים א.xml
בראשית.xml

E L

unread,
Oct 19, 2012, 10:07:50 AM10/19/12
to Efraim Feinstein, open...@googlegroups.com
Hi,

On Fri, Oct 19, 2012 at 10:31 AM, Efraim Feinstein <efraim.f...@gmail.com> wrote:
Hi,


On 10/19/2012 01:10 AM, E L wrote:
Hi,
I started reading it, IMHO this file format is scary.

Have you seen the ODT spec? :-)

The problem with xml specs is that they merge together into a one huge monster..

Seriously, though -- comments accepted, but know that the spec on the wiki is more complex than the current one.

Do you have a link to the new version then?
 

I don't see any non expert being able to understand what's going on there.
Which reduces the amount of people that will be able to program for this file format.

It's actually quite a bit simpler than what's in that doc, and it's relatively easy to digest simple documents. I'm attaching 3 example documents from Tanach:
1. A prose text
2. A poetic text
3. A book

(The split into chapters is because the original source - the WLC - is like that, and, while it's not ideal, it is somewhat standard.)

I see, if there a way to do things like aazinu and shirat hayam?
 


I have a lot of comments about different parts, for example the reference to the bible uses non jewish names and splitting to chapters.

Actually, it doesn't. In the current incarnation, the names are arbitrary -- they're filenames. A reference either takes the form filename#id or filename#range(id1,id2). (The @cref portion has been removed because it's more difficult to implement).

I'm a great believer in simplicity :-) But I guess it's a good base to build upon.



Do you really want to use it? the citation allows you to do whole psukim, but for example
in the gmara and rashi it's really important to know which words they chose to quote.

Actually, the references are completely generic. As long as there's a unique id, there's the ability to reference it.

To be able to use something like that in a program in an efficient way one will need to index it or even database it.
We should consider both the storing/working on the information and how it should be use from within programs.
For example in cell phone it's important that the texts will be shared.

The problem is that it's hard to comment about the whole thing. Maybe we should make a cross project file format
that can be discussed in parts? Maybe people have different ideas such as basing on relaxNG or epub formats.

RelaxNG is a schema language. EPUB is essentially a formatting language (a limited version of HTML for ebooks).

They both support of separation between information and presentation. and can use XML for extensions.
But I think what is important is to first set the design goals:

- Simple to use (By everyone, to the point where people can write their dvar tora in that format and use references and stuff when showing it to others).
- Can be used in distributed editing projects
- Jewish oriented, I know it's weird to set it as a goal, but if you are interested I can elaborate on why it's important.
- Supports multiple versions of the same book (e.g. you can reference to a version, see differences in an easy way etc.)
- Support references and quotes efficiently
- Supports search and indexing.

I also think that keeping what is now should not be a design goal. For example I see no reason to split the gmara
by pages from 300 years ago when allowing people to reference to them or any other weird decision like that.

The bible IMHO should be split by jewish books and inside the book it should be split parshiut ptochut. (this is the way it was given from sinai or by roach hajkodesh).
 
I would welcome discussing everything in parts (and I've been waiting for someone to want to discuss this format issue for a while!). I am certainly open to accepting an in-field de-facto standard over my concoction, but it should support the features that will be needed.

I would strongly recommend reading some parts of the TEI Guidelines at <http://www.tei-c.org>. A lot of the problems you'll encounter trying to work on book formats are solved there.


I'll look more into it.
 
--
---
Efraim Feinstein
Lead Developer
Open Siddur Project
http://opensiddur.net
http://wiki.jewishliturgy.org


Looking forward for a fruit full discussion

Ely

Efraim Feinstein

unread,
Oct 19, 2012, 12:54:33 PM10/19/12
to E L, open...@googlegroups.com
Hi,


On 10/19/2012 07:07 AM, E L wrote:

I see, if there a way to do things like aazinu and shirat hayam?

Yes. For Ha'azinu and Shirat Hayam, each tei:seg (segment) is one divided part. You could give them different type attributes, but...
this is a semantic format, not a presentation format. The formatting would generally have to be done in CSS, XSL-FO or whatever presentation tool is being used.


 
Do you really want to use it? the citation allows you to do whole psukim, but for example
in the gmara and rashi it's really important to know which words they chose to quote.

Actually, the references are completely generic. As long as there's a unique id, there's the ability to reference it.

To be able to use something like that in a program in an efficient way one will need to index it or even database it.

The Open Siddur platform is an indexed XML database, but the format is theoretically usable/transformable outside the database.

*Any* cross referenced format would have to be indexed for efficiency.


We should consider both the storing/working on the information and how it should be use from within programs.

The format I'm proposing is an archival format. Display/presentation would be in HTML+CSS, ODT, PDF, etc. The advantage of XML as archival format is that it's relatively easy to transform.


For example in cell phone it's important that the texts will be shared.

Not sure what you mean by "shared" here. On cell phone apps, a lot of the processing tends to be done server-side. Another possibility is that the cell phone actually stores a spliced output, not the archival format.




RelaxNG is a schema language. EPUB is essentially a formatting language (a limited version of HTML for ebooks).

They both support of separation between information and presentation. and can use XML for extensions.
But I think what is important is to first set the design goals:

- Simple to use (By everyone, to the point where people can write their dvar tora in that format and use references and stuff when showing it to others).

One of the things I learned early on is that you really sacrifice a lot of power if you require people to hand-code XML. The solution: humans use tools, the tools write XML.


- Can be used in distributed editing projects

TEI based formats are already used in many distributed editing projects (see Google).


- Jewish oriented, I know it's weird to set it as a goal, but if you are interested I can elaborate on why it's important.

You probably should rethink this one. I don't think inventing a totally new format for particularism is a good idea. I think it should handle all the use cases.


- Supports multiple versions of the same book (e.g. you can reference to a version, see differences in an easy way etc.)

The mechanism I propose for choices is using a combination of the tei:choice element and a new j:option element, with external references providing the conditions when that choice should be used.

Referencing an entire "version" is a reference to the top level XML file that transcludes all the other XML in that version and specifies its conditionals. Referencing a small part is an id or range reference.  Remember that, for any book, the versions you have extant are not all versions. A user might develop their own, and you want to maintain referential integrity (so they can use the same comment/annotations) even where totally new combinations of texts are selected.


- Support references and quotes efficiently

Working in XML, I think the xml:id, XPointer and TEI pointer system is about the best you'll get (the one thing my XML db is missing is an indexed backreference -- it shouldn't be too difficult to add it in; I'm working on a toy example now. If anyone's good at Java, there's a better way to do it as an eXist extension.


- Supports search and indexing.

That's a database's job, not a format's.



I also think that keeping what is now should not be a design goal. For example I see no reason to split the gmara
by pages from 300 years ago when allowing people to reference to them or any other weird decision like that.

I agree completely here. It's also why I think concurrent hierarchies are a necessary feature.


The bible IMHO should be split by jewish books and inside the book it should be split parshiut ptochut. (this is the way it was given from sinai or by roach hajkodesh).

I don't think this should be part of the spec. Who cares how you split a file? It's arbitrary. If you want a parasha, reference its paragraph!

PS You'll also find that there is still fundamental disagreement about the parasha division (which open, which closed). The non-Jewish chapter division may be arbitrary, but it's standard and well understood. But, as I said, this is not important to a file format.
jlptei.doc.html

Dovi Jacobs

unread,
Oct 20, 2012, 2:11:01 PM10/20/12
to Efraim Feinstein, E L, open...@googlegroups.com
שבוע טוב לכולם, רק רציתי להודיע לכם שהוספתי הזמנה ב"מזנון" של ויקיטקסט כדי שאנשים ידעו על קיומו של הפורום הזה ואולי גם יצטרפו!



From: Efraim Feinstein <efraim.f...@gmail.com>
To: E L <nak...@gmail.com>
Cc: open...@googlegroups.com
Sent: Friday, October 19, 2012 6:54 PM
Subject: Re: book markup

--
 
 


E L

unread,
Oct 26, 2012, 6:21:57 AM10/26/12
to Efraim Feinstein, open...@googlegroups.com
Hi,
I read a bit about TEI and the extra markup you suggested.
There are lot of small things that I think we should add or do different but in general
I think it's a very good start. Much better than I was hoping for:)

It is a bit complicated, but beside the basic books the rest should be simpler to handle.
I think we could use one of those metadata wiki project to integrate it with wikibooks.
I do think that we should also think about ways to efficiently use it, especially on mobile devices
such as tablets and phones. If applications could use the same database of books and the same
library for rendering the information it could spore a lot more mobile applications.

Actually, it will be very interesting if Moshe from orayta and Uri from tfilun could give us some
insight about how they do it now.

Ely

--
 
 

Efraim Feinstein

unread,
Oct 26, 2012, 4:47:21 PM10/26/12
to E L, open...@googlegroups.com
Hi,

On 10/26/2012 03:21 AM, E L wrote:
> Hi,
> I read a bit about TEI and the extra markup you suggested.
> There are lot of small things that I think we should add or do
> different but in general

Let's discuss specifics!

> It is a bit complicated, but beside the basic books the rest should be
> simpler to handle.
> I think we could use one of those metadata wiki project to integrate
> it with wikibooks.

I'm not sure which project you're talking about? Link?

> I do think that we should also think about ways to efficiently use it,
> especially on mobile devices
> such as tablets and phones. If applications could use the same
> database of books and the same
> library for rendering the information it could spore a lot more mobile
> applications.

The format I'm suggesting is an *archival* format, suitable for a
database. The advantage of using a standardized and well-specified
archival format is that the output format could then take advantage of
device-specific issues/features instead of cramming information about
every device's preferred output into the archive. Separation of concerns
is one of the fundamental principles of well-designed XML.

E L

unread,
Oct 27, 2012, 2:14:45 PM10/27/12
to Efraim Feinstein, open...@googlegroups.com
On Fri, Oct 26, 2012 at 10:47 PM, Efraim Feinstein <efraim.f...@gmail.com> wrote:
Hi,


On 10/26/2012 03:21 AM, E L wrote:
Hi,
I read a bit about TEI and the extra markup you suggested.
There are lot of small things that I think we should add or do different but in general

Let's discuss specifics!
Sure, but where do you start with such a big format?:)
 

It is a bit complicated, but beside the basic books the rest should be simpler to handle.
I think we could use one of those metadata wiki project to integrate it with wikibooks.

I'm not sure which project you're talking about? Link?

I know http://semantic-mediawiki.org/ Dovi said there is a more specific discussion for wikibooks
Maybe he'll want to tell us more about it?

I do think that we should also think about ways to efficiently use it, especially on mobile devices
such as tablets and phones. If applications could use the same database of books and the same
library for rendering the information it could spore a lot more mobile applications.

The format I'm suggesting is an *archival* format, suitable for a database. The advantage of using a standardized and well-specified archival format is that the output format could then take advantage of device-specific issues/features instead of cramming information about every device's preferred output into the archive. Separation of concerns is one of the fundamental principles of well-designed XML.

I agree. Separation is very important. But since we are very small community we need to provide tools
for everything from viewing editing and anything else or no one will ever use that format.
I'm not saying we don't need archival format, just that we need more than just that. we need to solve the how to use it
problem.

I was really hoping to trigger some other people to join the discussion as well:/

--
---
Efraim Feinstein
Lead Developer
Open Siddur Project
http://opensiddur.net
http://wiki.jewishliturgy.org


Ely

Efraim Feinstein

unread,
Oct 28, 2012, 12:22:50 AM10/28/12
to E L, open...@googlegroups.com
Hi,


On 10/27/2012 11:14 AM, E L wrote:


On Fri, Oct 26, 2012 at 10:47 PM, Efraim Feinstein <efraim.f...@gmail.com> wrote:
Hi,


On 10/26/2012 03:21 AM, E L wrote:
Hi,
I read a bit about TEI and the extra markup you suggested.
There are lot of small things that I think we should add or do different but in general

Let's discuss specifics!
Sure, but where do you start with such a big format?:)

Start with the basics... (or, start anywhere, it doesn't matter)



I agree. Separation is very important. But since we are very small community we need to provide tools
for everything from viewing editing and anything else or no one will ever use that format.

Which is why I think we should start with a more-or-less common archival format. That way, multidirectional conversions would be most possible (you only have to write 1 transform to get in and one to get out! display/edit->archive, archive->display/edit)

E L

unread,
Nov 2, 2012, 8:01:56 AM11/2/12
to Efraim Feinstein, open...@googlegroups.com
almost a week after and my head hurts from reading the TEI manual;)
Can we start in a more simple way? Can you show an example for a simple book?
Like lets say a chapter from the rambam or something?

Ely

--
 
 

Efraim Feinstein

unread,
Nov 2, 2012, 12:15:37 PM11/2/12
to E L, open...@googlegroups.com
On 11/02/2012 05:01 AM, E L wrote:
> almost a week after and my head hurts from reading the TEI manual;)
> Can we start in a more simple way? Can you show an example for a
> simple book?
> Like lets say a chapter from the rambam or something?
>

I attached a few chapters of Tanach to a previous email.

The TEI manual is actually one of the better-written specs I've read :-)

E L

unread,
Nov 3, 2012, 12:29:02 PM11/3/12
to Efraim Feinstein, open...@googlegroups.com
On Fri, Nov 2, 2012 at 6:15 PM, Efraim Feinstein <efraim.f...@gmail.com> wrote:
On 11/02/2012 05:01 AM, E L wrote:
almost a week after and my head hurts from reading the TEI manual;)
Can we start in a more simple way? Can you show an example for a simple book?
Like lets say a chapter from the rambam or something?


I attached a few chapters of Tanach to a previous email.

It had tags on every word. Is that the usual use case of tei? I assumed that the tanach is special in some sense.
 
The TEI manual is actually one of the better-written specs I've read :-)

Yes, and still very complicated for us mortals :)
note that no open content Hebrew content beside opensiddur is using that spec. And since
It's a good spec in general it means that something is scaring people away.
I think that if I go over the spec on the mailing list no one beside you will bother reading it.
If we want to get more people to support and participate we to somehow simplify it.
Both by making it simpler for people and by knowing which tools we need.

Ely

Efraim Feinstein

unread,
Nov 3, 2012, 10:00:47 PM11/3/12
to E L, open...@googlegroups.com
Hi,


On 11/03/2012 09:29 AM, E L wrote:
On Fri, Nov 2, 2012 at 6:15 PM, Efraim Feinstein <efraim.f...@gmail.com> wrote:
On 11/02/2012 05:01 AM, E L wrote:
almost a week after and my head hurts from reading the TEI manual;)
Can we start in a more simple way? Can you show an example for a simple book?
Like lets say a chapter from the rambam or something?


I attached a few chapters of Tanach to a previous email.

It had tags on every word. Is that the usual use case of tei? I assumed that the tanach is special in some sense.

In TEI, it's the project that decides where to put markup. There's no requirement on what level to mark up. However, word markup means you can reference any word (eg, for commentary), independent of what else happens (eg, addition or subtraction of a word).


 
The TEI manual is actually one of the better-written specs I've read :-)

Yes, and still very complicated for us mortals :)

Ever tried the CSS spec? :-)


note that no open content Hebrew content beside opensiddur is using that spec.
And since
It's a good spec in general it means that something is scaring people away.

The WLC uses TEI internally.

TEI was developed by humanities scholars, and (1) the sample space is pretty small and (2) most open content Hebrew sources never thought seriously about many of the issues involved. In addition, programmers tend to run first to an RDBMS. That's fine for internal stuff, but it doesn't work for interchange.


I think that if I go over the spec on the mailing list no one beside you will bother reading it.
If we want to get more people to support and participate we to somehow simplify it.

I'm all for that! In fact, TEI is *made* for simplification. Nobody ever expects to use the TEI_all schema! I did some simplification for Open Siddur, and I think I sent you the schema docs over the list. (It's not fully documented specifically for us, because that's the hard part). Where TEI has hundreds of tags, I think I'm down to ~100 and most are for various header-related things.
Reply all
Reply to author
Forward
0 new messages