Unicode conversion of the CATSS LXXM text

106 views
Skip to first unread message

Nathan Smith

unread,
Mar 6, 2013, 12:16:43 AM3/6/13
to openscr...@googlegroups.com
I had been looking for a morphologically-tagged LXX for research and
came across the CATSS LXXM text [1]. The one thing lacking for my use of
this text was that it was in betacode and not in unicode.

By searching I have found that many people have taken this text and
converted it to unicode for embedding in web sites, but to my knowledge
nobody is publishing the equivalent plain text files. The Unbound Bible
comes closest, but it publishes the text and the morphological analysis
in two separate files, which is suboptimal. So I decided to embark on
converting the LXXM to unicode.

Luckily James Tauber has shared a Greek betacode to unicode conversion
script [2] which took care of most of the hard work for me. Using this,
I was able to convert all of the texts to betacode to unicode. I am
sharing the result as a git archive [3]. Please take a look.

My long term goal is to edit the text to add some features and clean
things up a bit. Basically I'd like the text to look more like the
MorphGNT. A part of that may include adding this text to the Open
Scriptures project. This will probably result in a lot of interesting
discussion on this list (e.g. regarding LXX versification, which
includes prologues, song titles, abc verses, verse "13/14", etc.).

Please note that this resource has a rather novel license which
requires users to fill out a user declaration and send it in to the
CCAT program at the University of Pennsylvania (see
0-user-declaration.txt in the repo).

Let me know if you find anything to be corrected or have questions.

[1] http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/
[2] http://jtauber.com/blog/2005/01/27/betacode_to_unicode_in_python/
[3] https://gitorious.org/lxxmorph-unicode

--
Nathan Smith
http://nathan.smithfam.info/
PGP key ID 0x147aed15

David Troidl

unread,
Mar 6, 2013, 9:39:11 AM3/6/13
to openscr...@googlegroups.com
Hi Nathan,

I've been using the LXXM for years, privately. I had converted it to
Unicode, and marked it up in OSIS. The license is the major problem,
that basically prevents sharing it with an open license. The text they
have that joins the LXX to the MT is also of great interest, but with
the same license problem. If there were a truly open LXXM or merging of
the LXX with the MT, I'd be doing a lot of work with it.

Peace,

David

Nathan Smith

unread,
Mar 6, 2013, 2:52:54 PM3/6/13
to openscr...@googlegroups.com
On 3/6/13 6:39 AM, David Troidl wrote:
> Hi Nathan,
>
> I've been using the LXXM for years, privately. I had converted it to
> Unicode, and marked it up in OSIS. The license is the major problem,
> that basically prevents sharing it with an open license. The text they
> have that joins the LXX to the MT is also of great interest, but with
> the same license problem. If there were a truly open LXXM or merging of
> the LXX with the MT, I'd be doing a lot of work with it.
>
> Peace,
>
> David

I am not a lawyer. :-)

The license definitely does not allow sharing with an open license. But
it does not prevent sharing altogether. In a way it is analogous to the
CC BY-SA-NC license: no commercial use, and redistribution needs to be
under the same terms. The kicker is of course the notification requirement.

What is not clear from the license is the stance on derivative works. I
am working on contacting Robert Kraft or someone who is active at CCAT
to see if they can offer clarification on whether or not a derivative
work (like an OSIS-formatted version) would be permissible for
redistribution.

Short of that, I am considering just distributing a script which would
download the files from CCAT, post the caveat about user declaration,
and do the conversion to the desired format. In this way we could
provide the improved format as a service without actually distributing
anything, thereby steering clear of copyright constraints.

I'll refrain from publishing any derivative works until I have some
clarification from the copyright holders.

--
Nathan Smith
http://thelibrarybasement.com/

Jesse Griffin

unread,
Mar 6, 2013, 9:57:22 PM3/6/13
to openscr...@googlegroups.com
On Wed, Mar 6, 2013 at 12:52 PM, Nathan Smith <nat...@smithfam.info> wrote:
Short of that, I am considering just distributing a script which would
download the files from CCAT, post the caveat about user declaration,
and do the conversion to the desired format. In this way we could
provide the improved format as a service without actually distributing
anything, thereby steering clear of copyright constraints.

​That sounds like a great idea.  Even if you are able to get the permission to redistribute, having the scripts available is a big win for someone who wants to do something slightly different.​


Thank you,
Jesse Griffin​​

Nathan Smith

unread,
Mar 13, 2013, 1:20:27 AM3/13/13
to openscr...@googlegroups.com
I am in the midst of some correspondence with the folks behind this
text to submit the corrections I found and see if they are interested
in hosting the unicode version on the CCAT server.

In the meantime, I have put the script I used up in a git repo:
lxxm-convert.py in
https://gitorious.org/biblical-studies/biblical-studies/trees/master/catss

If you have Python 2.7, the script handles downloading the
sources, applying patches, converting to unicode, and renaming the
files (since some files were split to fit onto 1.4MB floppies,
apparently).

I plan on making some improvements, including adding some different
outputs. But this is a start, and can provide a basis for providing an
"improved" LXXM text without actually distributing the sources.

--
Nathan D. Smith
Reply all
Reply to author
Forward
0 new messages