NLTK + SBLGNT

151 views
Skip to first unread message

Seth Washeck

unread,
Apr 19, 2012, 5:04:47 PM4/19/12
to openscr...@googlegroups.com
I'm assuming that this is the "go-to" place for things along the lines of analyzing a Greek NT corpus with NLTK. The problem is, though, that I have no idea how to even begin doing this. I was wondering if I could get some pointers on doing this. 

From what I understand, the value of such an activity would be if there was a treebank. Does that sound correct? 

Nathan Smith

unread,
Feb 24, 2013, 6:46:04 PM2/24/13
to openscr...@googlegroups.com
On Thursday, April 19, 2012 2:04:47 PM UTC-7, Seth Washeck wrote:
I'm assuming that this is the "go-to" place for things along the lines of analyzing a Greek NT corpus with NLTK. The problem is, though, that I have no idea how to even begin doing this. I was wondering if I could get some pointers on doing this. 

From what I understand, the value of such an activity would be if there was a treebank. Does that sound correct? 

 Sorry for the very late reply. :-)

I have taken the first step towards using the SBLGNT within NLTK today. Basically I have loaded the plain text into NLTK as a corpus [0]. This gives some basic natural language processing utility for this text.

However, you are right, in order to get more out of it, we'll need more advanced resources. I plan on taking a crack at using the new SBLGNT-based MorphGNT to create a tagged corpus. As for a treebank, with its syntactical analysis, we'd either need an existing free, electronic resource (anyone know of such a thing?), or one would have to be created. And as many who have studied NT grammar know, diagramming Greek sentences can be a subjective affair. :-)

[0] http://thelibrarybasement.com/2013/02/24/prep-the-sblgnt-for-use-as-an-nltk-corpus/

--
Nathan Smith

Michael Aubrey

unread,
Feb 24, 2013, 7:30:15 PM2/24/13
to openscr...@googlegroups.com
> However, you are right, in order to get more out of it, we'll need more advanced resources.
> I plan on taking a crack at using the new SBLGNT-based MorphGNT to create a tagged
> corpus. As for a treebank, with its syntactical analysis, we'd either need an existing free,
> electronic resource (anyone know of such a thing?), or one would have to be created. And
> as many who have studied NT grammar know, diagramming Greek sentences can be a
> subjective affair. :-)

Greek grammars hugely overemphasize the subjectivity of diagramming Greek syntax. It's as subjective as any other language. And like the vast majority of languages, the status of constituents at the level of grammatical relations in Greek is quite straight forward. The real problem isn't the subjectivity, but the nature of the internal phrase structure itself, particularly in the more verbose authors whose NP's can involve huge amounts of recursion.

There is no freely available tree bank. This is partially due to the fact that (1) there are few people who have the ability to create one in the first place and (2) even fewer (i.e. none) who have the time to do it for free. Crowd-sourcing has potential, but it would require significant training for the workers and I'm not sure how many scholars would trust such a treebank to be reliable without serious constraints and oversight.

The most open Treebank for the NT that I know of is probably the PROIEL database: http://www.hf.uio.no/ifikk/english/research/projects/proiel/

Mike Aubrey

Nathan Smith

unread,
Feb 25, 2013, 12:10:58 AM2/25/13
to openscr...@googlegroups.com
On Sun, 2013-02-24 at 16:30 -0800, Michael Aubrey wrote:

> Greek grammars hugely overemphasize the subjectivity of diagramming
> Greek syntax. It's as subjective as any other language. And like the
> vast majority of languages, the status of constituents at the level of
> grammatical relations in Greek is quite straight forward. The real
> problem isn't the subjectivity, but the nature of the internal phrase
> structure itself, particularly in the more verbose authors whose NP's
> can involve huge amounts of recursion.
>
>
> There is no freely available tree bank. This is partially due to the
> fact that (1) there are few people who have the ability to create one
> in the first place and (2) even fewer (i.e. none) who have the time to
> do it for free. Crowd-sourcing has potential, but it would require
> significant training for the workers and I'm not sure how many
> scholars would trust such a treebank to be reliable without serious
> constraints and oversight.
>
>
>
> The most open Treebank for the NT that I know of is probably the
> PROIEL database:
> http://www.hf.uio.no/ifikk/english/research/projects/proiel/
>
>

Thanks Mike. I think the subjectivity I have perceived probably comes
from the need for training which you mentioned.

Short of a full treebank, are you aware of anyone having published a
context-free grammar of Koine Greek?

--
Nathan Smith

Dag Haug

unread,
Feb 25, 2013, 2:28:35 AM2/25/13
to openscr...@googlegroups.com
Hi,

> The most open Treebank for the NT that I know of is probably the
> PROIEL database:
> http://www.hf.uio.no/ifikk/english/research/projects/proiel/
>
>
I think we are pretty close to being an open treebank for the NT. There
are two limitations:

1) It's CC BY-SA-NC. I know the NC bothers some people, but it should be
ok for your purposes?

2) It's not complete. But it covers 112172 of 137680 tokens, including
all of the gospels. And it has more than just syntax: we just released
the information structure annotation, which includes coreference
resolution in the gospels. As for the syntax, I hope we will finish it
this year.

If you are interested, it can be downloaded in various formats from the
following URLs:

Proprietary "proiel"-xml (this is the only version with coreference
resolution):
http://foni.uio.no:3000/exports/gnt-greeknt.xml
Tiger-xml
http://foni.uio.no:3000/exports/gnt-greeknt-tiger.xml
conll:
http://foni.uio.no:3000/exports/gnt-greeknt.conll


Although you need to register to enter the treebank pages, you can
download data from these sources without registering.

All best,
Dag



> Mike Aubrey
>
> --
> You received this message because you are subscribed to the Google
> Groups "Open Scriptures" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openscripture...@googlegroups.com.
> To post to this group, send email to openscr...@googlegroups.com.
> Visit this group at
> http://groups.google.com/group/openscriptures?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>


Nathan Smith

unread,
Feb 25, 2013, 10:51:31 AM2/25/13
to openscr...@googlegroups.com
Thanks for the follow-up. This looks like a really fabulous resource!
The non-commercial clause should not be a problem. I'm glad to hear
about this project.

--
Nathan Smith

Nathan D. Smith

unread,
Mar 13, 2013, 5:58:28 PM3/13/13
to openscr...@googlegroups.com
On 2/24/13 3:46 PM, Nathan Smith wrote:
> However, you are right, in order to get more out of it, we'll need more
> advanced resources. I plan on taking a crack at using the new
> SBLGNT-based MorphGNT to create a tagged corpus.

I have created an sblgnt-corpus archive which can be easily loaded as a
tagged, categorized corpus into NLTK.

Repo: https://gitorious.org/sblgnt-corpus
Blog with examples:
http://thelibrarybasement.com/2013/03/13/a-categorized-tagged-greek-new-testament-corpus/

I think I'll take a look at working with the PROEIL corpus next.

--
Nathan D. Smith
http://nathan.smithfam.info/
PGP key ID 0x147aed15

jonatha...@gmail.com

unread,
Sep 6, 2013, 7:35:46 AM9/6/13
to openscr...@googlegroups.com
On Sunday, February 24, 2013 6:46:04 PM UTC-5, Nathan Smith wrote:
However, you are right, in order to get more out of it, we'll need more advanced resources. I plan on taking a crack at using the new SBLGNT-based MorphGNT to create a tagged corpus. As for a treebank, with its syntactical analysis, we'd either need an existing free, electronic resource (anyone know of such a thing?), or one would have to be created. And as many who have studied NT grammar know, diagramming Greek sentences can be a subjective affair. :-)

[0] http://thelibrarybasement.com/2013/02/24/prep-the-sblgnt-for-use-as-an-nltk-corpus/

biblicalhumanities.org has released syntax trees for the Greek New Testament. I believe these are high quality:


Jonathan 

jonatha...@gmail.com

unread,
Sep 6, 2013, 7:38:29 AM9/6/13
to openscr...@googlegroups.com
On Wednesday, March 13, 2013 5:58:28 PM UTC-4, Nathan Smith wrote:
On 2/24/13 3:46 PM, Nathan Smith wrote:
> However, you are right, in order to get more out of it, we'll need more
> advanced resources. I plan on taking a crack at using the new
> SBLGNT-based MorphGNT to create a tagged corpus.

I have created an sblgnt-corpus archive which can be easily loaded as a
tagged, categorized corpus into NLTK.

Repo: https://gitorious.org/sblgnt-corpus
Blog with examples:
http://thelibrarybasement.com/2013/03/13/a-categorized-tagged-greek-new-testament-corpus/

This is great!

How can we leverage the biblicalhumanities.org syntax trees together with this? I've just started playing with NLTK, and I'd love to be able to use it with the GNT.

Jonathan 

Nathan Bierma

unread,
Sep 12, 2013, 10:30:34 AM9/12/13
to openscr...@googlegroups.com, jonatha...@gmail.com
The syntax trees look like a fantastic resource, many thanks for sharing. What options or applications are there for displaying trees to users? Adding clause data as attributes in marked-up html, to which css indenting rules are assigned? A JS visualization library like D3 (which I think uses JSON)? Interested in the possibilities.. 


Jonathan Robie

unread,
Sep 12, 2013, 11:27:19 AM9/12/13
to Nathan Bierma, openscr...@googlegroups.com
The data is there to be played with, the best way to display it probably depends on the environment in which it is being displayed.  

We are currently working on SVG representations, you can see some alternative representations (PDF using tikz-qtree, an HTML representation, a text-based representation) here: 

http://biblicalhumanities.org/viewtopic.php?f=34&t=135

I hadn't thought of using css indenting rules like this, D3 might well be worth exploring for this.

What do you think the requirements should be? Feel like mocking up an approach you like and sharing it?

Jonathan


Reply all
Reply to author
Forward
0 new messages