Fwd: [Sefaria Project] Should Sefaria texts be corrected or verbatim

14 views
Skip to first unread message

Aharon Varady

unread,
Nov 28, 2012, 10:18:30 AM11/28/12
to Open Siddur Technical Discussion List, open...@googlegroups.com
Below is a great example of an issue where our transcription policy can provide some direction, and where the goals of other projects must be taken into consideration when considering what "authenticity" means for a digital edition of a work in the Public Domain. It also highlights some of the issues of data exchange that some of us considered earlier. Before I reply, does anyone have thoughts to contribute?

Aharon




---------- Forwarded message ----------
From: Yehoshua Kahan <ravye...@gmail.com>
Date: Wed, Nov 28, 2012 at 9:47 AM
Subject: [Sefaria Project] Should Sefaria texts be corrected or verbatim
To: sef...@googlegroups.com


Shalom to everyone. I hope you've seen Massechet Berakhot in Sefaria gradually taking shape with all the Rashis and soon, IYH, Tosafot for each page.  I'd like to have input from the Sefaria community on an issue I've encountered in reviewing Rashi comments on the Gemara. An example will serve to illustrate the dilemma: On Berachot 22a, the first three Rashis (which currently appear on Sefaria as Rashis 1,2 and 4) actually belong to the end of the previous page, 21b. This is made explicit in a anonymous marginal note in the printed Vilna text adjoining the misplaced printed texts. Apparently, this fact escaped the Wikisource people, because  the texts there (which serve as our source for Sefaria) reflect the printed Vilna layout.

So: what should our policy be - to reproduce faithfully the Vilna text, with its errors, or to correct those errors when we notice them? Here, the correction is easy - we'll just add those Rashis as sources to the correct lines on the previous page, and delete them from 22a. But in general, once we wade into textual emendation (that's effectively what this is), where do we stop? Or maybe here it's not a real question - it was just a printing error, and if anyone had anted up for a re-typesetting of the entire text, the widow and brothers Romm of Vilna would have been only too happy to oblige?

Thoughts?

Rav Berachot,

Yehoshua Kahan

--
You received this message because you are subscribed to the Google Groups "Sefaria Project" group.
To post to this group, send email to sef...@googlegroups.com.
To unsubscribe from this group, send email to sefaria+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/sefaria?hl=en.
 
 



--
Aharon Varady
Founding Director, Hierophant
the Open Siddur Project
http://opensiddur.org

Efraim Feinstein

unread,
Nov 28, 2012, 11:51:04 AM11/28/12
to opensid...@googlegroups.com, open...@googlegroups.com
Hi,

This is a well-known issue in transcription :-)

I don't think anything escaped the Wikisource people: I think they have transcription guidelines which involve literal transcription with no corrections (perhaps beyond obvious typos?).

The issue here described below is a bit different, though, because it's only a layout feature of the text that's in question -- that is, what page the text appeared on. This is not one of the primary hierarchies of most texts, and, despite what many people believe, the Vilna edition pagination is not from Sinai. :-)

For the purposes of Open Siddur, I think a situation like this would be handled like this:
- Because the Vilna edition has a canonical pagination, it might be worth placing an overlapping "page" hierarchy. It is not one of the primary textual hierarchies, though.
- The page hierarchy would reflect the pagination in the original text, as would the links to the source transcriptions. This might mean using continuation markers (@next/@prev attributes) if there's a discontinuity.
- Internal hyperlinks would be correct. Pagination/source is irrelevant to them.
 
Currently, Open Siddur's design is not intended to reproduce an existing printed layout, so I don't think it's a big issue for us.

For interchange, it may be an issue, depending on how the exporters/importers handle linkages. If the linkages are correct anyway, it should work. If the transcribers/encoders on the other side decided that the Vilna pagination is so canonical that it breaks internal linkages, there's no way to fix it other than manual correction.
-- 
---
Efraim Feinstein
Lead Developer
Open Siddur Project
http://opensiddur.net
http://wiki.jewishliturgy.org

E L

unread,
Nov 28, 2012, 12:03:26 PM11/28/12
to Efraim Feinstein, opensid...@googlegroups.com, open...@googlegroups.com
That sounds like a good solution for opensiddur.
But somehow it doesn't help other projects :-)
It does show the importance of finally having a place with at least the basic texts in a normal format.
When are we going to start working on it? It will benefit all projects..

Ely

--
 
 

Efraim Feinstein

unread,
Nov 28, 2012, 12:15:48 PM11/28/12
to E L, opensid...@googlegroups.com, open...@googlegroups.com
Hi,

On 11/28/2012 09:03 AM, E L wrote:
> That sounds like a good solution for opensiddur.
> But somehow it doesn't help other projects :-)

I think it could.

> It does show the importance of finally having a place with at least
> the basic texts in a normal format.

Note that neither of the ways of doing this (original source is king or
content is king) are "wrong". A "normal" format could canonize either or
both.

> When are we going to start working on it? It will benefit all projects..

I'm waiting for specific questions. :-)

E L

unread,
Nov 28, 2012, 12:22:07 PM11/28/12
to Efraim Feinstein, opensid...@googlegroups.com, open...@googlegroups.com
On Wed, Nov 28, 2012 at 7:15 PM, Efraim Feinstein <efraim.f...@gmail.com> wrote:
Hi,


On 11/28/2012 09:03 AM, E L wrote:
That sounds like a good solution for opensiddur.
But somehow it doesn't help other projects :-)

I think it could.


It does show the importance of finally having a place with at least the basic texts in a normal format.

Note that neither of the ways of doing this (original source is king or content is king) are "wrong". A "normal" format could canonize either or both.


When are we going to start working on it? It will benefit all projects..

I'm waiting for specific questions. :-)



What are the steps needed to have the following in a good format and have tools where it can be used for other projects:
- Bible (versions + mefarshim at least Rashi and Unkelus)
- Mishna (Versions with Bartanuta/rambam)
- Gmara (Versions + Rashi + tosfot + rosh)
 - Mechiltas/Medrash raba etcs

Ely

Efraim Feinstein

unread,
Nov 28, 2012, 1:25:26 PM11/28/12
to open...@googlegroups.com
For *any* project, you need a good transcription, a defined archival
format, and some tools.

The most important inter-project thing is the defined archival format --
once you have that, you can build tools. Transcription can occur (and
already is occurring) in parallel.

E L

unread,
Nov 28, 2012, 4:01:40 PM11/28/12
to Efraim Feinstein, open...@googlegroups.com
Hi,
I think it's a good idea to start with some books and add things to them.
This way we can see what is missing. Then develop a new format and continue
the iteration. Agile/Extreme is much better than waterfall in this case;)

Ely

--



Efraim Feinstein

unread,
Nov 28, 2012, 4:08:05 PM11/28/12
to E L, open...@googlegroups.com
On 11/28/2012 01:01 PM, E L wrote:
> Hi,
> I think it's a good idea to start with some books and add things to them.
> This way we can see what is missing. Then develop a new format and
> continue
> the iteration. Agile/Extreme is much better than waterfall in this case;)
>

Formats are one of the few cases where you can save a lot of time by
getting things right (or close to right) at the start. There's a lot of
toolchain that goes into building useful, extensible (and not ad-hoc)
formats.

Since the whole Biblical text is already transcribed very well, it's a
good test case (and it's the one I've used). It's pretty regular, so it
will hit the features used most often, it has a good number of edgy
things in it, but it doesn't hit everything.

Brett Lockspeiser

unread,
Nov 28, 2012, 10:14:38 PM11/28/12
to E L, opensid...@googlegroups.com, open...@googlegroups.com


On Wed, Nov 28, 2012 at 9:22 AM, E L <nak...@gmail.com> wrote:
What are the steps needed to have the following in a good format and have tools where it can be used for other projects:
- Bible (versions + mefarshim at least Rashi and Unkelus)
- Mishna (Versions with Bartanuta/rambam)
- Gmara (Versions + Rashi + tosfot + rosh)
 - Mechiltas/Medrash raba etcs


This is very much what we're working on at www.sefaria.org. I've been following some of threads on the opentora list, but haven't yet known how to jump in, in some sense because the formats that we are working with are _so_ basic compared to the level of conversation. But it does sound like that's you're preference as a way to start (and mine too). My goal was to have formats that would make it very simple for a developer to build something quickly at the expense of being perfect. 

In short, our data model considers all texts to be nested arrays of strings. Each text can have it's own level of depth for nesting and its own set of name for what each level in the array is called. So for Bible we have "Chapter">"Verse", for Mishna "Chapter">"Mishna", for Gemara "Daf">"Line" and Midrash "Chapter">"Paragraph". Commentaries are texts in their own right that follow the structure of the text they comment on, plus an additional level of depth called "Comment". So the the text "Rashi on Bereishit" has structure "Chapter">"Verse">"Comment" which allows e.g, three Distinct Rashis on Bereishit 1:1. 

I'd like to output all the data that we have in simple JSON / XML formats here: https://github.com/blockspeiser/Sefaria-Data . I had been following this list to see if it gave any insight on the details of what this format should, but maybe instead, like your proposing, I should just come with something simple first and then send if your way for any suggestions. There's only so much we can do anyway giving the data we have. 

Thanks,
Brett



Dovi Jacobs

unread,
Nov 29, 2012, 12:35:44 AM11/29/12
to Aharon Varady, Open Siddur Technical Discussion List, open...@googlegroups.com, sef...@googlegroups.com
Hi, in cases like the misplaced Rashi, or for any text that involves something more than objective transcription but rather requires human common sense or serious thought, I highly suggest that a any collaborative project like Sefaria do what we have been doing at Hebrew Wikisource, namely: Dedicate a technical page *about* the edition that describes a collaborative community decision for how the text should best be edited.

In this particular case, that would mean a page *about* the Talmud edition, which might even begin with a simple note saying that in cases of misplaced Rashis like this one, our edition places them on the proper amud. Any further decisions about how to edit the Talmud edition would also be added to that page, which can be developed over the long run just like the text itself.

It might also be wise to find a way to document the moving of the misplaced Rashis in the text itself, perhaps in the code underlying the text, in a way that does not distract the reader. In my opinion an open footnote would be to distracting. But that is ultimately up to the contributors themselves.


From: Aharon Varady <aha...@opensiddur.org>
To: Open Siddur Technical Discussion List <opensid...@googlegroups.com>; open...@googlegroups.com
Sent: Wednesday, November 28, 2012 5:18 PM
Subject: Fwd: [Sefaria Project] Should Sefaria texts be corrected or verbatim

--
 
 


Dovi Jacobs

unread,
Nov 29, 2012, 12:49:40 AM11/29/12
to Efraim Feinstein, opensid...@googlegroups.com, open...@googlegroups.com
>>I don't think anything escaped the Wikisource people: I think they have transcription guidelines which involve literal transcription with no corrections (perhaps beyond obvious typos?).

This is not correct. First of all that Talmud example certainly DID escape us: At the time, the Talmud text at Wikisource was automatically copied from another source (as public domain material). Because it has mostly not enjoyed the privilege of human editing, therefore editing guidelines have unfortunately never been formulated for it!

Furthermore, for books at Wikisource that do involve human editing (such as formatted editions, corrected editions, and editions with collaborative commentaries), we go far beyond correcting obvious typos. Intensive human labor that involves informed judgments about a text must also be made transparent. Therefore, such books often have technical wiki-pages dedicated to them which describe the editorial policies for the Wikisource edition. These technical pages are also developed collaboratively and accepted through community consensus. They can be quite rich, and sometimes may even involve a great deal of documentation of objective sources (just like good Wikipedia articles).

E L

unread,
Nov 30, 2012, 8:13:16 AM11/30/12
to Brett Lockspeiser, opensid...@googlegroups.com, open...@googlegroups.com
Hi,

I think it's very simplified. IMHO we should do it once in a good format that keeps all the information one
might need. And then different projects can build on top of it.
The bible for example is very specific and I don't think there is even a point in building a GUI for it.
On the other hand the mishna and gmara might be used as a use case to how references should be.

After that each project can build libraries on top of it.

Ely 




--
 
 

Brett Lockspeiser

unread,
Dec 6, 2012, 12:52:23 PM12/6/12
to Marc Stober, opensid...@googlegroups.com, open...@googlegroups.com
This is helpful to consider, and brings me to one specific question right off the bat: 

This format wraps chapter/verse markers, but doesn't wrap the text itself. I had assumed that the Sefaria simple XML output would look something like 

<book>
    <book-title>Genesis</book-title>
    <chapter num="1">
         <verse num="1">In the beginning...</verse>
         <verse num="2">And the earth...</verse>
         ....
    </chapter>
</book>

It seems to me (though I really am an XML noob, so I suspect you may be about to teach me a lesson) that this format makes it much easier to query for sections of a text, so that if you want to select a particular verse you just query for e.g., the 4th <verse> in the 3rd <chapter> and then you have the complete node you want to work with. Still possible with your format, but doesn't it require looking for the appropriate marker and then walking through to collect elements until you hit a termination, which could be either another verse marker or a chapter marker or an end of book?

Thanks,
Brett


On Thu, Dec 6, 2012 at 8:15 AM, Marc Stober <marcs...@gmail.com> wrote:
It's incomplete but I have a simple XML Bible format here, intended to be a format that other projects can easily work with:

http://marcstober.com/1917JPS-preview/1917JPS.xml

It doesn't attempt to be a format for all text, just for the Bible (and for a specific translation at that). Although the format in sefaria-data, while not using XML, is logically pretty similar, so it might be worth standardizing on that (which is a format already used by some Christian bible software if I'm not mistaken).


--
You received this message because you are subscribed to the Google Groups "opensiddur-tech" group.
To post to this group, send email to opensid...@googlegroups.com.
To unsubscribe from this group, send email to opensiddur-te...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/opensiddur-tech?hl=en.



--
marcs...@gmail.com ~ www.marcstober.com ~ twitter: marcstober

Reply all
Reply to author
Forward
0 new messages