Hi, and thanks for posting!
(and Hi, Sefaria-dev, I'm Efraim, lead developer over at Open
Siddur)
I have a half-written email in my drafts folder about this that I
never got around to completing.
On 06/14/2012 01:41 PM, Brett Lockspeiser wrote:
Hi all,
Here are some more concrete thoughts on steps towards using
Sefaria's source sheet builder to demo the experience of custom
Siddur building using OpenSiddur's texts.
I've put up some info about the Sefaria API at
www.sefaria.org/developers
. The GET section is only tangentially relevant, but can give
some picture of how Sefaria works under the hood. The POST
section describes what needs to happen to get new texts into
Sefaria.
Ideally, I'd like texts to be able to be transferred
bidirectionally.
What Sefaria offers *right now* is a great UI (and a great test-bed
UI, even if it's not feature-complete from the Open Siddur sense).
What Open Siddur offers is a server system intended to handle very
complex texts. Open Siddur internally stores in XML, which means
that our documents do not necessarily have uniform structure.
Fortunately, in the first pass, you can almost *map* an XML document
onto uniform structure. Some of the issues are below.
I'm attaching a current schema documentation snapshot, which should
at least give you an idea of what elements exist and what they can
contain. Unfortunately, the documentation is not completely filled
in (particularly the human-readable part). If anyone intends to work
on this, I can make some first-pass documentation a priority. The
docs up at
wiki.jewishliturgy.org are out of date since I just
reviewed and finalized the schema.
I'm also attaching 5 other files:
- an annotation document (this is textual annotation, there's also
conceptual annotation, but I don't have any ready examples)
- a bibliographic record
- Psalms 1 (which demonstrates both a stream of text and multiple
hierarchies)
- the entire book of Psalms (which demonstrates a resource that is
just combined resources)
- a contributor record
Apologies for the large attachments.
I know you already have a Tanach, but it's our demo text too. :-)
To get a working demo we need:
- A named list of Siddur components to work with (this list
can start small for a minimal initial prototype)
I *think* that what you call a "component", I've been calling a
"resource" using XML database terminology. Everyone else calls it a
"file". The key features are demonstrated in the 2 Psalms examples:
- the header has enough information to figure out the source and
who's responsible for activity, eg, transcription. There's also a
revision history, but that doesn't really exist until the documents
get in the db.
- a stream of text (conveniently called a streamText) made up of
small segments which should be "minimal units of meaning" (say, 1-5
words). This can be mapped onto a Sefaria structure with a few
caveats, like:
-- you can't preserve word identity unless the text is canonized
-- kri/ktiv (which is internal in the segment), spelling
regularization (the word "Yerushalayim" comes to mind, though, I
don't think I marked it up in my Tanach conversion), corrections of
a transcription (not relevant to Sefaria? Should it be?), divine
name markup (useful, for example, if you want to create a document
that is not sheimot), incidental transliteration (useful if you want
to regularize transliteration across an entire published siddur).
- the concurrent hierarchic layers section: Psalms is actually quite
regular, so its concurrencies (paragraphs--which are not marked up
here!, verses, and line groups/lines) don't do funky things like
cross boundaries, but be aware that it is entirely possible.
There are also specialized resources like annotation documents,
conditional documents (describe things like "times when this prayer
is said" including inline documentation), bibliographic documents
(example attached), contributor documents. Another relevant document
(which I don't have a ready example of) is a translation document,
which, instead of having a streamText, has a parallelText element
that looks like:
<parallelText>
<parallelGrp>
<ptr n="original" xml:lang="he"
target="/data/original/My_Original_Text#range(se_5,se_7)"/>
<ptr n="parallel" xml:lang="en"
target="/data/translation/en/My_Translation/My_Translated_Text#range(se_9,se_15)"/>
</parallelGrp>
<!-- more parallelGrps here -->
</parallelText>
The complication here with respect to Sefaria's data model is that
translation can align at any level down to the segment.
- For each component, if it does not already exist in
Sefaria (i.e., not Tanach and Mishna), a description of its
structure. Components will need to be small enough that they
have a uniform structure. I imagine most components at this
size will only have a single level of structure, so this
description may just end up looking like ["Line"] (if a
component is composed of lines) or ["Blessing"] (if it's
composed of blessings) or whatever the appropriate name is.
These descriptions will need to be posted to /api/index/ or
entered manually in the site.
We don't differentiate document types like "blessing." Text is text.
It may be *annotated* as a blessing, if, for example, we wanted to
allow searches over all blessings.
- Each text will then need to be transformed into an array
of strings and posted to /api/texts/
I don't think this is currently possible without some loss of
information. However, that may not be such a bad thing. Some kind of
structured encoding is probably better than none. Many of the actual
siddurim we have were contributed in the form of MS Word documents,
and would require an outlay of effort to make them workable in
anyone's database.
- Once this is done, the Source Sheet builder can be used as
normal to add, order, title and comment on all these
components, making it possible to build and save a complete
Siddur. As discussed, I would not recommend trying to store
a whole Siddur in one sheet (which is loaded as one page),
but rather break it up sections.
Can a source sheet include other source sheets? That's basically
Open Siddur's global data model (where s/source sheet/resource/g).
- A static table of contents can be made linking to each
individual sheet. That way also a list of Siddurim and a
nicely formatted table of contents for each Siddur could be
hosted on opensiddur.org and only
link out to sefaria.org for the
contents.
As a first database-able pass, I think this would be great, as it
would resolve one major problem we have that I mentioned above:
getting data database-ready in any form. My only worry is too much
loss of information, requiring a second encoding pass. If we can
resolve that on Sefaria's end (or come up with a hack), that would
be even better.
One issue we have with our data as it stands now is that much of it
is not proofread. One goal I've had for a long time is to have a
Wikisource-style transcription editor where documents could be
proofread (and their public domain nature could be proved) against
page images.
- I think the hard part here is 1 and 2 -- figuring out how
to fit the complex, annotated structures that you have into
Sefaria's simplistic data model. Details will be lost. For
example, "Mourner's Kadidsh" is an appropriate component. Is
it clear in general how it should be segmented?
It depends on the context. The title, "Mourner's Kaddish" is a
heading (I think you have those). "Mourner's Kaddish" itself is a
resource. In the *best* case of the XML data model, "Kaddish" is a
resource and the variant types of Kaddish (and the nusach-based
textual variants) just have different associated conditionals.
- Handling instructional texts like "recited by the
congregation" is tricky as well, as I don't think it ought
to be included in the text of the Kaddish itself given
there's no way to differentiate it.
Instructions (like "On Shabbat, say" or "Recited by the
congregation") are both treated as annotations (human readable) and
conditionals (so a computer could parse not to include the text,
say, when it is not Shabbat).
- A current work around is to build such instructions into
the Siddur as 'comments'. e.g., include Kaddish line 1-3,
then comment "Congregation reads and Mourner responds", then
include Kaddish line 4-8.
How does one tell what a comment applies to? What if there is
overlap between regions with instructions?
Example: Ya'aleh v'Yavo:
"On Rosh Chodesh, Yom Tov and Chol Hamoed, add:" (applies to the
whole thing)
"On Rosh Chodesh -- " (applies to one line...)
In Open Siddur, an annotation can link anything with an xml:id (down
to a word), though, by convention, I'd rather not link further down
than a segment if I can help it.
In any case, you guys are the Siddur experts, so love to hear
what you think. I believe this proposal is doable, but I think
it will take some real work to coordinate and I am sure we will
run into difficult cases with the text and then missing features
/ bugs in the source sheet builder. If we want to move forward I
suggest we move right into steps (1) & (2) maybe in a Google
Doc / Spreadsheet to start looking at specifics.
In the end, I would also like to have a UI for Open Siddur; I see
Sefaria as a good start there too. My ideal would be to be able to
use Open Siddur API calls on Sefaria's UI (understanding the caveat
about the different data models). I don't really have the facility
yet to determine whether that's possible or it would be just as much
work fitting square pegs into circular holes as writing a UI from
scratch. Thoughts on that?
Thanks again!
--
---
Efraim Feinstein
Lead Developer
Open Siddur Project
http://opensiddur.net
http://wiki.jewishliturgy.org