Raw original language XML representation

Robert Hunt

unread,

Dec 18, 2009, 11:00:45 PM12/18/09

to open-sc...@googlegroups.com

Hi everyone,

I only discovered this group the other week, and so I'm pretty new
to most of your work and ideas. To introduce myself, I'm a New Zealander
who has been doing Bible translation for a small language group in Asia,
a Linux user, a former programmer, and someone who is frustrated by
copyrighting of Christian materials (ranging from Scriptures to songs to
Bible study materials) which makes it hard (and sometimes impossible)
for Christians and other interested people in developing nations to
access these vital resources. It's a long story (which I hope to
eventually write-up more fully at http://openscripture.blogspot.com/),
but the long-and-short of it is that I am currently looking for Hebrew
and Greek resources which can be made freely available (in a parsed
interlinear form) to Bible translators around the world.

Trying to keep only one topic in a particular thread, I want to
start by seeking your advice about Unicode XML encodings of original
language texts. Thinking about this, a parchment, scroll, or codex has
some associated meta data including it's name(s), discovery date(s),
estimated creation date, discovery location(s), owner(s), author,
copyist, current location(s), material, ink type, pen type, form,
format, clarity, language, areas of the Bible covered, URL of facsimile,
etc., etc.

I guess that a typical ancient document contains a string of letters
(which may or may not contain spaces/word-breaks, vowel markings, accent
markings, punctuation marks, other markings, etc.) to which could
hopefully be added pointers to the facsimile URL along with additional
location information such as page, column, and line numbers, etc.) There
will sometimes be approximately x letters missing from within the text,
and perhaps some places where a particular letter is either totally
undecipherable or else unclear and hence ambiguous between a range of
two or more letters.

Then I guess there are often marginal notes (do any have
"footnotes"?) which presumably can be determined (with various degrees
of certainty) to apply to certain lines or words of the original text.

If it helps to think of it this way, the above is my first general
attempt at a useful digital format for an "OCR" representation of a
particular ancient document.

So this document and its digital representation may simply be one
long string of consonants or it may have word breaks and sentence
punctuation. On top of this basic structure we would have another kind
of metadata, such as word and sentence and paragraph breaks (including
an allowance for places where the breaks are ambiguous) and various
systems of chapter/verse numbering etc. Alternatively, this kind of
metadata could perhaps be encoded in a separate index file.

I am aware of OSIS and need to study it more carefully, but I don't
think it was designed to represent this level of "raw" text. So my
question is: Is this level of raw text or anything similar already
available? (I'm asking both about the XML format and about the
data/texts themselves.)

Thanks for your help and advice,
Robert Hunt.

Daniel Owens

unread,

Dec 18, 2009, 11:34:26 PM12/18/09

to open-sc...@googlegroups.com

Robert,

Welcome. I just read your introductory blog post, and I think you will
find many of a similar mind to your own. I've been in SE Asia for the
past six years, which is partly why I am motivated to be involved.

I think OSIS can account for the kind of meta-data you mention. If you
are interested in, for example, representing the scribal notes such as
we find in the BHS, I think it is possible to represent them in the
text, but you'll likely have to tie that data to a verse. You may just
need to dig in the OSIS manual, and if you don't find something there
are ways of creating custom data types.

Daniel

> --
>
> You received this message because you are subscribed to the Google Groups "Open Scriptures" group.
> To post to this group, send email to open-sc...@googlegroups.com.
> To unsubscribe from this group, send email to open-scriptur...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/open-scriptures?hl=en.
>
>
>
>

Robert Hunt

unread,

Dec 19, 2009, 2:16:53 AM12/19/09

to open-sc...@googlegroups.com

Thanks, Daniel. Are verses primary in OSIS? If so, that wouldn't handle
alternative versification systems very well??? Actually, I feel that the
tendency to use and quote Scripture by "verses" has often "cheapened"
use of the Scriptures and led to many things being quoted and used
totally out of context. All to say that I see lots of good reasons not
to tie things to verses.

And as for tying marginal notes to verses, that would seem rather
imprecise (even if better than nothing).

Ah, I found a sample OSIS text. It seems that chapter/verse numbers are
milestones inside blocks. But it seems that only one versification
system can be used within a particular document. That would seem to make
OSIS not very general in a worldwide context. :-( I think I like the
idea of putting the verse information in a separate index file better.

BTW, are the OSIS people still active? I got no response to an email a
while back noting a possible error in the 2.1 User Manual, and there is
no apparent progress on V2.5.

Blessings,
Robert.

Kari Valkama

unread,

Dec 19, 2009, 6:41:07 AM12/19/09

to open-sc...@googlegroups.com

Hi all,

I am wondering if OXES (Open XML for Editing Scripture) format would be a good choice.

It is used by SIL in FieldWorks Translation editor.

I tried to Google it, but did not find anything substantial about it, so it seems to be rather hidden for an open specification. Anyway that is the format SIL has chosen for its Scripture format, so we might want to think if that has any relevance for open-scriptures.

If I am not mistaken OXES uses paragraphs as the basic unit instead of verses. Verses are like elements, if I am not mistaken. They are searchable, but they are not part of the tree structure, so different versification systems can coexist.

Yours,

Kari Valkama

David Troidl

unread,

Dec 19, 2009, 8:32:40 AM12/19/09

to open-sc...@googlegroups.com

Hi Robert,

On 12/19/2009 2:16 AM, Robert Hunt wrote:
> Thanks, Daniel. Are verses primary in OSIS? If so, that wouldn't handle
> alternative versification systems very well??? Actually, I feel that the
> tendency to use and quote Scripture by "verses" has often "cheapened"
> use of the Scriptures and led to many things being quoted and used
> totally out of context. All to say that I see lots of good reasons not
> to tie things to verses.
>

There would still have to be some 'hooks' in the text that notes,
pointing and parsing could be attached to. Simple character count would
run afoul of the slightest change or corruption in the original document.

> And as for tying marginal notes to verses, that would seem rather
> imprecise (even if better than nothing).
>
> Ah, I found a sample OSIS text. It seems that chapter/verse numbers are
> milestones inside blocks. But it seems that only one versification
> system can be used within a particular document. That would seem to make
> OSIS not very general in a worldwide context. :-( I think I like the
> idea of putting the verse information in a separate index file better.
>

Yes, the current recommendation is to milestone verse divisions. The
osisID has the capacity to specify a specific text, identified in the
header, such as WLC:Gen.1.1, so alternate versification schemes could
conceivably be included.

> BTW, are the OSIS people still active? I got no response to an email a
> while back noting a possible error in the 2.1 User Manual, and there is
> no apparent progress on V2.5.
>

The last official OSIS release was in March 2006. The project is still
active, in some sense, under the developers at CrossWire.org, who
produce various Bible study software, with texts available in many of
the world's languages. And here at Open Scriptures, we are trying to
evaluate the situation. Can OSIS be revived? Is it the best choice for
our work? I tend to lean in that direction, but the question is just
now under consideration.

Peace,

David

David Troidl

unread,

Dec 19, 2009, 9:18:56 AM12/19/09

to open-sc...@googlegroups.com

From the FieldWorks Translation Editor 3.0 ReadMe

NOTE: The OXES format is still being evaluated and refined, so it should be used only for transferring data between users and it should not be used for long-term archival.

On 12/19/2009 6:41 AM, Kari Valkama wrote:

Open XML for Editing Scripture

Chris Little

unread,

Dec 19, 2009, 2:09:06 PM12/19/09

to open-sc...@googlegroups.com

On 12/18/2009 11:16 PM, Robert Hunt wrote:
> Ah, I found a sample OSIS text. It seems that chapter/verse numbers are
> milestones inside blocks. But it seems that only one versification
> system can be used within a particular document. That would seem to make
> OSIS not very general in a worldwide context. :-( I think I like the
> idea of putting the verse information in a separate index file better.

David Troidl already gave an overview of this, but I want to be a little
more explicit.

OSIS allows milestoned or container chapters & verses. You can use
whichever you prefer, but within a document, you must be consistent in
using only one type (milestoned or container). All else being equal, the
book/section/paragraph hierarchy is preferred to the book/chapter/verse
hierarchy, as the primary structure of the document.

Each chapter/verse can take an osisID to identify itself according to
one or more reference schemes (identified in the header). And those
reference schemes need not align identically. So, assuming you have
identified two reference schemes named Vulg and GNT, it is permissible
to have a chunk of markup that looks like this:

So both Vulg and GNT would include "text text text" in Matt.1.1, but
only Vulg would include "word word word" within Matt.1.1.

You can pretty much mark any kind of overlapping versification you like,
and can employ as many different reference schemes within a single
document as you need.

> BTW, are the OSIS people still active? I got no response to an email a
> while back noting a possible error in the 2.1 User Manual, and there is
> no apparent progress on V2.5.

Unfortunately one of the Bible Technologies Group domains
(bibletechnologieswg.org) was allowed to lapse and has been purchased by
a domain squatter. So, if you were trying to email, osis-editors@
bibletechnologieswg.org, it would likely bounce. You could try
osis-e...@crosswire.org (since CrossWire hosts the OSIS mailing
lists). Or you might try contacting Patrick Durusau directly, since he's
the manual's author. His email is on the first page of the manual.

Are there specific features you would like to see in a future version of
OSIS that are absent from 2.1.1?

--Chris

David Troidl

unread,

Dec 19, 2009, 3:14:31 PM12/19/09

to open-sc...@googlegroups.com

I'm replying to this one too. Yes, there are features I'd like to see
added.

First, there should be better attributes for the 'w' element. Why
should we have to limp along with a single 'lemma' attribute, and then
have to put a prefix on everything that is supposed to go in there. A
separate 'Strong' attribute, at least, would be very helpful, if not
TWOT or TDNT, or a combination that could cover either, depending on
testament. Strong numbers are still widely used, and TWOT is supposed
to be more accurate.

I could also use better elements for the Strong's Hebrew Dictionary.
Something for derivation, definition and translations (from the KJV).

The 'hi' element is a little strange too, and is intended to cover so
much ground. Separate 'em' and 'strong' elements, at least, would clean
up the markup quite a bit.

Peace,

David

On 12/19/2009 2:09 PM, Chris Little wrote:
> On 12/18/2009 11:16 PM, Robert Hunt wrote:
>
>> Ah, I found a sample OSIS text. It seems that chapter/verse numbers are
>> milestones inside blocks. But it seems that only one versification
>> system can be used within a particular document. That would seem to make
>> OSIS not very general in a worldwide context. :-( I think I like the
>> idea of putting the verse information in a separate index file better.
>>
> David Troidl already gave an overview of this, but I want to be a little
> more explicit.
>

> OSIS allows milestoned or container chapters& verses. You can use

Chris Little

unread,

Dec 19, 2009, 10:19:42 PM12/19/09

to open-sc...@googlegroups.com

On 12/19/2009 12:14 PM, David Troidl wrote:
> I'm replying to this one too. Yes, there are features I'd like to see
> added.
>
> First, there should be better attributes for the 'w' element. Why
> should we have to limp along with a single 'lemma' attribute, and then
> have to put a prefix on everything that is supposed to go in there. A
> separate 'Strong' attribute, at least, would be very helpful, if not
> TWOT or TDNT, or a combination that could cover either, depending on
> testament. Strong numbers are still widely used, and TWOT is supposed
> to be more accurate.

I would guess that this will not change. Adding attributes that make
specific reference to particular works is a can of worms that we
wouldn't want to be involved in. (It raises questions of which works and
how many works get this special treatment.)

More importantly, it's not necessary, since lemma can handle all of
these. The lemma attribute is actually a list of osisGenType values, so
(assuming all of the work prefixes have been identified in the header)
you could have an element that looks like:

<w lemma="Strong:G1234 louwNida:1.1a TDNT:23b ANLEX:νονσενς">νονσεντις</w>

Spaces delineate, so some substitution character must be used if you
want to include a space within an osisGenType value. (Based on some
discussions in osis-core, I've recommended using NBSP in place of space,
but that's more of an ad hoc solution than part of the standard.) Other
illegal characters may be escaped with a preceding \.

> I could also use better elements for the Strong's Hebrew Dictionary.
> Something for derivation, definition and translations (from the KJV).

Dictionaries weren't ever really addressed, though they were a planned
future feature. I believe our direction was heading towards
encapsulating TEI within OSIS elements, since TEI already has
superfluous support for dictionary markup. I've been happily employing
TEI P5 in place of OSIS for all manner of lexically-keyed documents.

For a few sparse details on what I've done in that area with TEI, you
can check out the CrossWire Wiki at
http://www.crosswire.org/wiki/TEI_Dictionaries. The customized TEI P5
schema I rolled adds the ability to link to/from content in OSIS by
adding osisID and osisRef to a number of TEI elements. Sample docs are
also provided.

> The 'hi' element is a little strange too, and is intended to cover so
> much ground. Separate 'em' and 'strong' elements, at least, would clean
> up the markup quite a bit.

I would have liked an <emph> element too, and argued for one at one
point. As a consolation, we got <hi type="emphasis">. I don't know that
those who felt so strongly that <emph> should be omitted could be
convinced that it is necessary.

Since OSIS has its roots in TEI, <emph> would be used, rather than
HTML's and . I'm not sure that there's really a semantic
difference between and sufficient to argue for two such
elements.

--Chris

Weston Ruter

unread,

Dec 20, 2009, 3:07:33 AM12/20/09

to open-scriptures

2009/12/19 Chris Little <chrisc...@gmail.com>

> The 'hi' element is a little strange too, and is intended to cover so
> much ground. Separate 'em' and 'strong' elements, at least, would clean
> up the markup quite a bit.

I would have liked an <emph> element too, and argued for one at one
point. As a consolation, we got <hi type="emphasis">. I don't know that
those who felt so strongly that <emph> should be omitted could be
convinced that it is necessary.

Since OSIS has its roots in TEI, <emph> would be used, rather than
HTML's and . I'm not sure that there's really a semantic
difference between and sufficient to argue for two such
elements.

Do you think that the <hi> element more closely corresponds to HTML5's element than to or ? http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-mark-element

Chris Little

unread,

Dec 20, 2009, 11:39:15 AM12/20/09

to open-sc...@googlegroups.com

On 12/20/2009 12:07 AM, Weston Ruter wrote:
> 2009/12/19 Chris Little <chrisc...@gmail.com
> <mailto:chrisc...@gmail.com>>

No, I would say <hi> is not similar to HTML5 . Like many OSIS
elements, <hi> has its roots in TEI, so see
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-hi.html

In TEI and OSIS, <hi> is chiefly a place to hang attributes indicating
text decoration (making no claim to their semantic significance). For
TEI, that is achieved with rend. For OSIS, owing to a great dislike for
rend, these values are contained in type.

HTML5 is for identifying a string with relevance in another
context. If it could be compared to anything in OSIS, I would say it is
similar to <catchWord>, except that the latter marks the string being
discussed within the discussion, not in its original context.

--Chris

David Troidl

unread,

Dec 20, 2009, 4:23:39 PM12/20/09

to open-sc...@googlegroups.com

Chris,

Thanks so much for all the info.

One other issue that's come up on Open Scripture is that Crossway was
expressing that they would really like to go with OSIS for their ESV web
service, but need a way of serving valid OSIS fragments. Obviously,
throwing a complete header on a single verse would be overkill. My
thought is maybe there could be a simple envelope, possibly with a link
to a complete header maintained on the server, so the fragment would
validate in that form, but if any of the work or workPrefix elements
were needed, they could be easily accessed.

Peace,

David

On 12/19/2009 10:19 PM, Chris Little wrote:
> On 12/19/2009 12:14 PM, David Troidl wrote:
>
>> I'm replying to this one too. Yes, there are features I'd like to see
>> added.
>>
>> First, there should be better attributes for the 'w' element. Why
>> should we have to limp along with a single 'lemma' attribute, and then
>> have to put a prefix on everything that is supposed to go in there. A
>> separate 'Strong' attribute, at least, would be very helpful, if not
>> TWOT or TDNT, or a combination that could cover either, depending on
>> testament. Strong numbers are still widely used, and TWOT is supposed
>> to be more accurate.
>>
> I would guess that this will not change. Adding attributes that make
> specific reference to particular works is a can of worms that we
> wouldn't want to be involved in. (It raises questions of which works and
> how many works get this special treatment.)
>
> More importantly, it's not necessary, since lemma can handle all of
> these. The lemma attribute is actually a list of osisGenType values, so
> (assuming all of the work prefixes have been identified in the header)
> you could have an element that looks like:
>

> <w lemma="Strong:G1234 louwNida:1.1a TDNT:23b ANLEX:οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½">οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½</w>

Weston Ruter

unread,

Dec 20, 2009, 5:39:32 PM12/20/09

to Chris Little, open-scriptures

Thanks, Chris.

In TEI and OSIS, <hi> is chiefly a place to hang attributes indicating
text decoration (making no claim to their semantic significance).

But HTML5's and do have semantic significance:

The element represents stress emphasis of its contents.

The element represents strong importance for its contents.

http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-em-element
http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-strong-element

Whereas for :

The mark element represents a run of text in one document marked or highlighted for reference purposes, due to its relevance in another context. When used in a quotation or other block of text referred to from the prose, it indicates a highlight that was not originally present but which has been added to bring the reader's attention to a part of the text that might not have been considered important by the original author when the block was originally written, but which is now under previously unexpected scrutiny. When used in the main prose of a document, it indicates a part of the document that has been highlighted due to its likely relevance to the user's current activity.

http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-mark-element

Thanks for the useful comparison between HTML5's and OSIS's <catchWord>.

Is the difference here that TEI's use of <hi> is strictly presentational (“a place to hang attributes indicating text decoration”) whereas HTML's usages are (supposed to be) strictly semantic? TEI is seeking to exactly represent the original *appearance* and structure of the texts it encodes, whereas HTML is seeking more to represent the text's semantics (where presentational markup is deprecated)?

Weston

--Chris

David Troidl

unread,

Dec 20, 2009, 6:52:27 PM12/20/09

to open-sc...@googlegroups.com

Hi Chris,

I've had a look at your TEI dictionary page.οΏ½ The schema is rather dense, not nearly as readable as the OSIS schema, right out of the box.οΏ½ Assuming the sample strongs.tei.xml is fairly representative, I'm wondering if it will cover the data I have.οΏ½ See the sample below of the entry for H1, or the full dictionary at http://open-scriptures.googlecode.com/svn/trunk/data/strongs-dictionaries/hebrew/.

The abuses of OSIS markup should be obvious, but also the requirements of the data.οΏ½ The data is from David Instone-Brewer's 2-Letter Lookup site, used by permission.οΏ½ Many of the Strong's entries have variant spellings, so the association of the transliteration and pronunciation with the Hebrew word they apply to seems important.οΏ½ Of course the basic Strong's dictionary doesn't have this problem, but I see no reason for losing the richness of the data in its current form.

The purpose of @n on the entry is for efficient access in PHP.οΏ½ @gloss is the TWOT number, @lemma is the pointed dictionary form, @morph is the part of speech (for want of a better attribute), @POS is the pronunciation, @xlit is the transliteration, @ID is the Strong's number (appropriate for a dictionary, the H prefix required for a valid XML ID).οΏ½ The content of the element is the unpointed form, useful for alphabetizing.

The foreign and list elements should be clear.οΏ½ The 3 note elements at the end are what I was referring to in the earlier email.οΏ½ They separate out the derivation information, the definition given by Strong, and the KJV translations.οΏ½ I was hoping something like this would eventually be incorporated into OSIS.οΏ½ The markup of your Strong's dictionary doesn't seem very promising in that direction.οΏ½ Any thoughts?

Peace,

David

<div type="entry" n="1">
        <w gloss="4a" lemma="אָב" morph="n-m" POS="awb" xlit="'ab" ID="H1" xml:lang="heb">אב</w>
        <foreign xml:lang="grc">
          <w gloss="G:1118" />
          <w gloss="G:2730" />
          <w gloss="G:3390" />
          <w gloss="G:3507" />
          <w gloss="G:3509" />
          <w gloss="G:3962" />
          <w gloss="G:3965" />
          <w gloss="G:3966" />
          <w gloss="G:3967" />
          <w gloss="G:3971" />
        </foreign>
        <list>
          <item>1) father of an individual</item>
          <item>2) of God as father of his people</item>
          <item>3) head or founder of a household,  group,  family,  or clan</item>
          <item>4) ancestor</item>
          <item>4a) grandfather,  forefathers — of person</item>
          <item>4b) of people</item>
          <item>5) originator or patron of a class,  profession,  or art</item>
          <item>6) of producer,  generator (fig.)</item>
          <item>7) of benevolence and protection (fig.)</item>
          <item>8) term of respect and honour</item>
          <item>9) ruler or chief (spec.)</item>
        </list>
        <note type="exegesis">a primitive word;</note>
        <note type="explanation">
          <hi>father</hi>, in a literal and immediate, or figurative and remote application</note>
        <note type="translation">chief, (fore-) father(-less), [idiom] patrimony, principal. Compare names in 'Abi-'.</note>
      </div>

On 12/19/2009 10:19 PM, Chris Little wrote:

On 12/19/2009 12:14 PM, David Troidl wrote:

I'm replying to this one too.  Yes, there are features I'd like to see
added.

First, there should be better attributes for the 'w' element.  Why
should we have to limp along with a single 'lemma' attribute, and then
have to put a prefix on everything that is supposed to go in there.  A
separate 'Strong' attribute, at least, would be very helpful, if not
TWOT or TDNT, or a combination that could cover either, depending on
testament.  Strong numbers are still widely used, and TWOT is supposed
to be more accurate.

I would guess that this will not change. Adding attributes that make 
specific reference to particular works is a can of worms that we 
wouldn't want to be involved in. (It raises questions of which works and 
how many works get this special treatment.)

More importantly, it's not necessary, since lemma can handle all of 
these. The lemma attribute is actually a list of osisGenType values, so 
(assuming all of the work prefixes have been identified in the header) 
you could have an element that looks like:

<w lemma="Strong:G1234 louwNida:1.1a TDNT:23b ANLEX:οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½">οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½</w>

Spaces delineate, so some substitution character must be used if you 
want to include a space within an osisGenType value. (Based on some 
discussions in osis-core, I've recommended using NBSP in place of space, 
but that's more of an ad hoc solution than part of the standard.) Other 
illegal characters may be escaped with a preceding \.

I would have liked an <emph> element too, and argued for one at one 
point. As a consolation, we got <hi type="emphasis">. I don't know that 
those who felt so strongly that <emph> should be omitted could be 
convinced that it is necessary.

Since OSIS has its roots in TEI, <emph> would be used, rather than 
HTML's <em> and <strong>. I'm not sure that there's really a semantic 
difference between <em> and <strong> sufficient to argue for two such 
elements.

--Chris

--

You received this message because you are subscribed to the Google Groups "Open Scriptures" group.
To post to this group, send email to open-sc...@googlegroups.com.
To unsubscribe from this group, send email to open-scriptur...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/open-scriptures?hl=en.

Chris Little

unread,

Dec 20, 2009, 7:47:39 PM12/20/09

to open-scriptures

On 12/20/2009 2:39 PM, Weston Ruter wrote:
> Thanks, Chris.
>
> In TEI and OSIS, <hi> is chiefly a place to hang attributes indicating
> text decoration (making no claim to their semantic significance).
>
>
> But HTML5's and do have semantic significance:
>
> The element represents stress emphasis of its contents.
>
> The element represents strong importance for its contents.
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-em-element
> http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-strong-element

TEI <hi> has no semantic significance, but TEI <emph> does and appears
quite similar to HTML . If any element of TEI closely correlates to
HTML's , I don't know what it is.

> Is the difference here that TEI's use of <hi> is strictly presentational

> (ï¿½a place to hang attributes indicating text decorationï¿½) whereas HTML's

> usages are (supposed to be) strictly semantic? TEI is seeking to exactly
> represent the original *appearance* and structure of the texts it
> encodes, whereas HTML is seeking more to represent the text's semantics
> (where presentational markup is deprecated)?

TEI's <hi> is non-semantic and indicates that its contents are
presentationally distinct from surrounding text, but I think any TEI
element that can contain CDATA can also possess a rend attribute to
describe its presentation.

I don't know that any other TEI elements are focused on presentation,
whereas a fairly large number of HTML entities are focused on
presentation (though their number is fewer in HTML5), so I would not say
that TEI seeks to represent the original appearance. An application of
TEI like EpiDoc might be more specifically concerned with document
appearance, but I'm not particularly familiar with EpiDoc. (If you're at
all considering encoding papyri, manuscripts, etc. I would recommend
taking a look at EpiDoc.)

I imagine that the situations for using <hi> in TEI are similar to what
we proposed for its use in OSIS: When encoding documents from a 3rd
party (e.g. an old book) the original semantic distinction of a string
of highlighted text may not be reliably reconstructed. In such cases,
the most conservative solution is to simply record how the text differs
from its surroundings.

--Chris

Joel Leineweber

unread,

Dec 20, 2009, 9:02:20 PM12/20/09

to open-sc...@googlegroups.com

Although it would be slightly more work, I thought I should propose the idea that we present the data in multiple formats. Perhaps you can request osis or other xml, html 4, html 5, or json. HTML would be great for inexperienced users who don't want to parse xml or whatever. JSON is becoming very popular especially for javascript services, and if we formatted it correctly, it wouldn't have the inherent problems of multiple heirarchies and overlapping elements, because it's essentially serialized objects.

On Sun, Dec 20, 2009 at 7:47 PM, Chris Little <chrisc...@gmail.com> wrote:

On 12/20/2009 2:39 PM, Weston Ruter wrote:
> Thanks, Chris.
>
> In TEI and OSIS, <hi> is chiefly a place to hang attributes indicating
> text decoration (making no claim to their semantic significance).
>
>
> But HTML5's and do have semantic significance:
>
> The element represents stress emphasis of its contents.
>
> The element represents strong importance for its contents.
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-em-element
> http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-strong-element

TEI <hi> has no semantic significance, but TEI <emph> does and appears
quite similar to HTML . If any element of TEI closely correlates to
HTML's , I don't know what it is.

> Is the difference here that TEI's use of <hi> is strictly presentational

> (“a place to hang attributes indicating text decoration”) whereas HTML's

> usages are (supposed to be) strictly semantic? TEI is seeking to exactly
> represent the original *appearance* and structure of the texts it
> encodes, whereas HTML is seeking more to represent the text's semantics
> (where presentational markup is deprecated)?

TEI's <hi> is non-semantic and indicates that its contents are
presentationally distinct from surrounding text, but I think any TEI
element that can contain CDATA can also possess a rend attribute to
describe its presentation.

I don't know that any other TEI elements are focused on presentation,
whereas a fairly large number of HTML entities are focused on
presentation (though their number is fewer in HTML5), so I would not say
that TEI seeks to represent the original appearance. An application of
TEI like EpiDoc might be more specifically concerned with document
appearance, but I'm not particularly familiar with EpiDoc. (If you're at
all considering encoding papyri, manuscripts, etc. I would recommend
taking a look at EpiDoc.)

I imagine that the situations for using <hi> in TEI are similar to what
we proposed for its use in OSIS: When encoding documents from a 3rd
party (e.g. an old book) the original semantic distinction of a string
of highlighted text may not be reliably reconstructed. In such cases,
the most conservative solution is to simply record how the text differs
from its surroundings.

--Chris

Chris Little

unread,

Dec 20, 2009, 9:08:01 PM12/20/09

to open-sc...@googlegroups.com

On 12/20/2009 3:52 PM, David Troidl wrote:
> Hi Chris,
>

> I've had a look at your TEI dictionary page. The schema is rather

> dense, not nearly as readable as the OSIS schema, right out of the box.

> Assuming the sample strongs.tei.xml is fairly representative, I'm

> wondering if it will cover the data I have. See the sample below of the

> entry for H1, or the full dictionary at
> http://open-scriptures.googlecode.com/svn/trunk/data/strongs-dictionaries/hebrew/.

I should say that the samples are not necessarily good samples. :)

TEI is capable of being vastly more expressive than the sample documents
might imply. I would recommend taking a look at the <entryFree>
documentation at
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-entryFree.html

The schema hosted at crosswire.org isn't meant to be readable, since
it's a customized schema generated via TEI's Roma tool (with the osisID
& osisRef parts added by hand). In general, I would recommend looking
through the TEI P5 docs, since they document every facet of the standard
in great detail with abundant examples.

> The abuses of OSIS markup should be obvious, but also the requirements

> of the data. The data is from David Instone-Brewer's 2-Letter Lookup
> site, used by permission. Many of the Strong's entries have variant

> spellings, so the association of the transliteration and pronunciation

> with the Hebrew word they apply to seems important. Of course the basic

> Strong's dictionary doesn't have this problem, but I see no reason for
> losing the richness of the data in its current form.
>

> The purpose of @n on the entry is for efficient access in PHP. @gloss

> is the TWOT number, @lemma is the pointed dictionary form, @morph is the
> part of speech (for want of a better attribute), @POS is the
> pronunciation, @xlit is the transliteration, @ID is the Strong's number
> (appropriate for a dictionary, the H prefix required for a valid XML

> ID). The content of the element is the unpointed form, useful for
> alphabetizing.

I think TEI would force you to extract most of these attributes into
elements. These are just quick guesses, without putting too much thought
into the exercise, but I think the following mappings from your markup
to TEI P5 would probably represent the text well:

The TWOT number would probably go in an <ref> element, since it's
basically a reference into another work.
The dictionary form would go in <orth>.
Morphological info/part of speech goes into <gramGrp> (or it could use
the various elements without the group element: <pos>, <gen>, <case>,
<tns>, <mood>, etc.)
Pronunciation goes into <pron>.
The transliterated form could go into a <pron> as well, with a type to
identify it as transliteration.

> The foreign and list elements should be clear. The 3 note elements at
> the end are what I was referring to in the earlier email. They separate

> out the derivation information, the definition given by Strong, and the

> KJV translations. I was hoping something like this would eventually be
> incorporated into OSIS. The markup of your Strong's dictionary doesn't
> seem very promising in that direction. Any thoughts?

The contents of your <foreign> element might be best placed into an <xr>
(cross-ref) or <re> (related entry) element.
OSIS <list> and <item> are copied from TEI, so those are exactly the
same, but you would enclose the whole list within a TEI <def> element.
Even better would be to replace the list items with individual <sense>
elements.
Your <note>s might require some type values in order to clarify their
intent, but I think:
The exegesis note could probably go inside an <etym> (etymology) element.
The explanation note sounds like it should go in <usg>.
The translation note looks like a good candidate for <gloss> (or its
individual parts could go there).

Another great resource for TEI is Perseus. They have far better examples
of thoroughly marked dictionaries, though their TEI is from a much older
version of the standard and their documents are not underlyingly
Unicode. Their dictionaries include a couple of Liddel & Scott Greek
lexicons (cf.
http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3atext%3a1999.04.0057).
The underlying XML can be viewed for each entry, and the whole
collection can be downloaded via
http://www.perseus.tufts.edu/hopper/opensource/download.

--Chris

David Troidl

unread,

Dec 20, 2009, 9:26:56 PM12/20/09

to open-sc...@googlegroups.com

Storing all those formats would be a maintenance nightmare.ï¿½ We are talking here about the base format to be used for storage.ï¿½ All the others could come by transformation.ï¿½ Style sheets or scripts would have to be developed in each case, but that's a whole different layer of the operation.

Peace,

David

On 12/20/2009 9:02 PM, Joel Leineweber wrote:

Although it would be slightly more work, I thought I should propose the idea that we present the data in multiple formats. Perhaps you can request osis or other xml, html 4, html 5, or json. HTML would be great for inexperienced users who don't want to parse xml or whatever. JSON is becoming very popular especially for javascript services, and if we formatted it correctly, it wouldn't have the inherent problems of multiple heirarchies and overlapping elements, because it's essentially serialized objects.

ï¿½

On Sun, Dec 20, 2009 at 7:47 PM, Chris Little <chrisc...@gmail.com> wrote:

On 12/20/2009 2:39 PM, Weston Ruter wrote:
> Thanks, Chris.
>

> ï¿½ ï¿½ In TEI and OSIS, <hi> is chiefly a place to hang attributes indicating
> ï¿½ ï¿½ text decoration (making no claim to their semantic significance).

>
>
> But HTML5's and do have semantic significance:
>

> ï¿½ ï¿½ The element represents stress emphasis of its contents.
>
> ï¿½ ï¿½ The element represents strong importance for its contents.
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-em-element
> http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-strong-element

TEI <hi> has no semantic significance, but TEI <emph> does and appears
quite similar to HTML . If any element of TEI closely correlates to
HTML's , I don't know what it is.

> Is the difference here that TEI's use of <hi> is strictly presentational

> (ï¿½a place to hang attributes indicating text decorationï¿½) whereas HTML's

Joel Leineweber

unread,

Dec 20, 2009, 9:39:01 PM12/20/09

to open-sc...@googlegroups.com

Sorry I misunderstood. For some reason I thought we were discussing a database-driven solution, not a documents stored in a repository type situation.

If the situation is the documents stored in a database similar to Weston's schema that he was working on, then as long as the schema accounts for all needs (and I think it was getting close), then making the web services API spit out various formats would not be difficult.

Sorry for the confusion.

On Sun, Dec 20, 2009 at 9:26 PM, David Troidl <David...@aol.com> wrote:

Storing all those formats would be a maintenance nightmare. We are talking here about the base format to be used for storage. All the others could come by transformation. Style sheets or scripts would have to be developed in each case, but that's a whole different layer of the operation.

Peace,

David

On 12/20/2009 9:02 PM, Joel Leineweber wrote:

Although it would be slightly more work, I thought I should propose the idea that we present the data in multiple formats. Perhaps you can request osis or other xml, html 4, html 5, or json. HTML would be great for inexperienced users who don't want to parse xml or whatever. JSON is becoming very popular especially for javascript services, and if we formatted it correctly, it wouldn't have the inherent problems of multiple heirarchies and overlapping elements, because it's essentially serialized objects.

On Sun, Dec 20, 2009 at 7:47 PM, Chris Little <chrisc...@gmail.com> wrote:

On 12/20/2009 2:39 PM, Weston Ruter wrote:
> Thanks, Chris.
>

> In TEI and OSIS, <hi> is chiefly a place to hang attributes indicating

> text decoration (making no claim to their semantic significance).
>
>
> But HTML5's and do have semantic significance:
>

> The element represents stress emphasis of its contents.
>

> The element represents strong importance for its contents.
>
> http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-em-element
> http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#the-strong-element

TEI <hi> has no semantic significance, but TEI <emph> does and appears
quite similar to HTML . If any element of TEI closely correlates to
HTML's , I don't know what it is.

> Is the difference here that TEI's use of <hi> is strictly presentational

> (“a place to hang attributes indicating text decoration”) whereas HTML's

Ze'ev Clementson

unread,

Dec 21, 2009, 12:39:04 AM12/21/09

to open-sc...@googlegroups.com

Wow, I go away for a week and there are over 100 messages on the list! Just a personal comment: I've found that David's WLC and Strongs OSIS XML files are very easy to work with and transform into other formats. I transformed them into SQL statements for populating a SQLite3 database. The attached image shows the database tables (and their relationships) that I created from David's XML files. I agree with David and don't think it's necessary (or even a good idea) to keep multiple formats in the openscriptures repository as there should really only be 1 canonical format (that can subsequently be transformed into other formats that people might prefer to work with in their own projects).

- Ze'ev

WLC-Strongs.png

Troy A. Griffitts

unread,

Dec 21, 2009, 11:17:59 PM12/21/09

to open-sc...@googlegroups.com

Just a quick comment on the problem of encoding XML fragments.

Again, to restate the problems briefly:

Use Case: Retrieve from an API the contents of KJV:Matt.7.9

http://crosswire.org/study/fetchdata.jsp?format=raw&mod=KJV&key=Matt.7.9

Problem 1) Matt.7.9 is in the middle of a LONG quote of Jesus starting
in Matt.5.3. When the fragment is returned, how is the recipient to know
that the data is part of a quote? A <div>, a , etc? All tags that
opened much earlier in the document than the fragment requested.

Problem 2) if using a validating parser, for an OSIS document to be
valid, it needs at least a basic header.

This was discussed at a few of the OSIS meetings.

One proposal was to use what TEI has defined for sending document
fragments. They also define a way to specify a URI to the document header:
http://www.tei-c.org/Guidelines/P4/html/SH.html

To support this in OSIS the schema would need to be updated to require
either a header or a URI to presumably a valid OSIS doc with just the
header.

Another proposal was to include a global "shadow" attribute, so a
fragment could be returned with all the preceding context tags could be
included with the fragment and indicated with this global attribute,
e.g., <q who="jesus" shadow="true">

shadow="true" tags would be provided to open or close required tags to
make the fragment valid, xml and also to supply needed context, but they
would be marked as "merely given for context purposes" with the shadow
attribute.

All this to say, we have thought about the problem and never finalized a
solution. Now that you have Steve sounding off, maybe he can remember
which direction we were leaning.

Hope this helps a little,

Troy

Robert Hunt

unread,

Dec 27, 2009, 3:43:01 AM12/27/09

to open-sc...@googlegroups.com

I think I'm still not really understanding this. According to http://openscriptures.org/blog/2009/03/initial-project-writeup/, this group seems to want an internationally universal XML DB and associated software/webware so you can tie (mash) English or Arabic to the Greek, say, to use the example given. But it seems to me that (OSIS) verse identifiers are not specific enough. Even some English versions disagree about whether some phrases are at the end of verse x or at the beginning of verse x+1. (Can't find a quick way to locate a specific example.) And then your mash wouldn't be so tidy.

And then it's made even worse by versification systems from different traditions. It seems to me to be a wrong principle to base so much referencing around verse numbers, even if every versification system could be encoded into the OSIS file as per the above email. (Is that the intention?)

Also some translations reorder the original material beyond the verse boundaries, so they might have no v4 and v5 but rather v3-5. Or they might relegate v28, say, to a footnote or omit it completely depending on their source text reading.

And then you still need external indexing, because different traditions put different "books" in different orders, or even breaks/combines "books" in different ways. That information should be in the DB not in the applications shouldn't it? Or does OSIS already handle that somehow?

Will open-scriptures specify just one of the hierarchies mentioned in the email above (thus an OSIS subset) or allow or provide both?

It seems to me that an application that is international and universal that displays aligned parallel versions must get VERY complex?

Sorry for asking so many questions, but it seems to me that there's a lot more to this than first meets the eye.

Robert.

Kari Valkama

unread,

Dec 27, 2009, 8:59:09 AM12/27/09

to open-sc...@googlegroups.com

Hi Robert,

Here is the OXES manual from SIL's Insite web site.

It would be interesting to know, if this is better than OSIS for our purposes.

Yours,

Kari

oxes 1.1.3.pdf

Kari Valkama

unread,

Dec 27, 2009, 9:02:46 AM12/27/09

to open-sc...@googlegroups.com

Hi all,

I am sorry for sending this to the list.

It was supposed to go to just Robert.

I apologize.

The phrase our purposes refers to an open source version of BART, which is a biblical analysis tool made by SIL.

Yours,

Kari

--
You received this message because you are subscribed to the Google Groups "Open Scriptures" group.
To post to this group, send email to open-sc...@googlegroups.com.
To unsubscribe from this group, send email to open-scriptur...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/open-scriptures?hl=en.

<oxes 1.1.3.pdf>

Weston Ruter

unread,

Dec 27, 2009, 7:24:50 PM12/27/09

to Robert Hunt, open-scriptures

Hi Robert, thanks for the good observations:

But it seems to me that (OSIS) verse identifiers are not specific enough. Even some English versions disagree about whether some phrases are at the end of verse x or at the beginning of verse x+1. (Can't find a quick way to locate a specific example.) And then your mash wouldn't be so tidy. And then it's made even worse by versification systems from different traditions. It seems to me to be a wrong principle to base so much referencing around verse numbers, even if every versification system could be encoded into the OSIS file as per the above email. (Is that the intention?)

Yes, each word must be individually identified. Versification systems cannot be relied upon for alignment between texts, as you have said: works use different systems, have different orderings, and may lack a system altogether. Two words from arbitrary locations in a source and its translation should be linkable. This not only is advantageous for aligning a translation with its source, but also for storing cross-references such as quotations from the Old Testament in the New, which are often so loose as to simply be allusions which cannot be mapped one-to-one.

This thread should provide more answers to your questions: http://groups.google.com/group/open-scriptures/browse_thread/thread/e5e0cb29497b0d1

Does this answer your concerns? Any areas for improvement? Again, thank you for offering the fresh perspective and for joining in on the discussion!

Weston

--

Weston Ruter

unread,

Dec 27, 2009, 7:30:07 PM12/27/09

to Troy A. Griffitts, open-scriptures, David Eyk

Thanks for this helpful information, Troy. So it looks like this shadow=true attribute is equivalent to the "virtual" attributes in Crossway's XML schema?

2009/12/21 Troy A. Griffitts <scr...@crosswire.org>

>> <w lemma="Strong:G1234 louwNida:1.1a TDNT:23b ANLEX:νονσενς">νονσεντις</w>

Reply all

Reply to author

Forward