Working with nltk.Tree and VSO languages

42 views
Skip to first unread message

Peter Bekins

unread,
Jun 1, 2017, 9:30:36 AM6/1/17
to nltk-users
Hello,

I work with Classical Hebrew and would like to experiment with using NLP techniques for corpus analysis over the summer. I have worked through the NLTK book during the last couple weeks and am ready to play with some real data. 

I have a set of texts tagged using a hybrid phrase/dependency schema (don't ask; it wasn't my project). For instance:

[N w [P1 yamr ] [S yhwh ] [P1 [C l nj ]]]
[N and [P1 said ] [S the Lord ] [P1 [C to C Noah ]]]
'And the LORD said to Noah'

where: N = sentence; P = predicate phrase; S = subject; C = complement

Classical Hebrew has a high proportion of VS clauses like this one. The above schema solves the problem of discontinuity by co-indexing the predicate phrase (P1...P1). In the future I may write a script to convert to a more pure dependency grammar, but for now I would like to play with the tagged texts as they are. 

My question then is whether there is a best practice for using nltk.Tree with languages having free word order? I haven't found much help googling, so perhaps it is just avoided? For now I have been making the trees as they are (i.e., two P1 nodes under N) and accounting for the discontinuity in the scripts I use to traverse the tree, but it gets a little messy in more complicated examples. 

Many thanks,
Peter

Alex Rudnick

unread,
Jun 1, 2017, 2:49:30 PM6/1/17
to nltk-...@googlegroups.com
Hey Peter,

I guess it depends on the concrete thing you're trying to do! But also
it's more of a theoretical/syntactic question than an NLTK question as
such, you know?

If you believe the syntactic relationships you're dealing with can be
described by trees, then NLTK trees should be OK for the purpose --
they're really very general. Basically just nested lists.

Can you say a little more about what you want to do with the data set?
> --
> You received this message because you are subscribed to the Google Groups
> "nltk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to nltk-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
-- alexr

Peter Bekins

unread,
Jun 3, 2017, 4:19:18 AM6/3/17
to nltk-users

Alex, thanks for the help

On Thursday, June 1, 2017 at 2:49:30 PM UTC-4, Alex Rudnick wrote:
Hey Peter,

I guess it depends on the concrete thing you're trying to do! But also
it's more of a theoretical/syntactic question than an NLTK question as
such, you know?

I am interested in verbal semantics and argument structure, so I would like to first pull a list of verbs out of the corpus, with a list of arguments and their prepositional marking (if any) for each verb. I would then like to see how well an algorithm might do at assigning basic semantic roles.  
 
If you believe the syntactic relationships you're dealing with can be
described by trees, then NLTK trees should be OK for the purpose --
they're really very general. Basically just nested lists.


Right. So a dependency tree would be better, but as I said, I have a corpus already tagged according to phase structure. This works fine, *if* branches can cross, but I don't see any way in the nltk tree object to accommodate crossing branches. So for the example I gave earlier:

     [N w [P1 yamr ] [S yhwh ] [P1 [C l nj ]]]
     'And said the Lord to Noah'

I need to be able to read the structure as if it were:

    [N w                                              'And'
        [P yamr                     [C l nj ]]    'He said to Noah'
                       [S yhwh]]                    'The Lord'
    ]

Does this make sense? For now, I wrote a script to treat two P1 subtrees as a single tree if they are at the same height. This seems to work okay, but I haven't worked through whether there might be problems as the trees start to get more complicated (i.e., with multiple embedded clauses). 

I was just wondering if there was a more natural way to do this.  

Thanks!
Peter 

 

Alex Rudnick

unread,
Jun 5, 2017, 1:59:04 PM6/5/17
to nltk-...@googlegroups.com
Ooh, crossing dependencies, interesting!

So as I understand it, if you've got crossing dependency branches,
then that really cannot be represented with trees (or equivalently,
nested phrases) -- it sounds like more of a problem with trees in
general than with NLTK trees particularly! A more expert syntax person
can maybe correct/clarify if I'm misremembering.

You might be able to do something that works in practice, but the
formalism seems wrong for your problem...

On Fri, Jun 2, 2017 at 4:40 AM, Peter Bekins <peter....@gmail.com> wrote:
> Right. So a dependency tree would be better, but as I said, I have a corpus
> already tagged according to phase structure. This works fine, *if* branches
> can cross, but I don't see any way in the nltk tree object to accommodate
> crossing branches. So for the example I gave earlier:
>
> [N w [P1 yamr ] [S yhwh ] [P1 [C l nj ]]]
> 'And said the Lord to Noah'
>
> I need to be able to read the structure as if it were:
>
> [N w 'And'
> [P yamr [C l nj ]] 'He said to Noah'
> [S yhwh]] 'The Lord'
> ]
>
> Does this make sense? For now, I wrote a script to treat two P1 subtrees as
> a single tree if they are at the same height. This seems to work okay, but I
> haven't worked through whether there might be problems as the trees start to
> get more complicated (i.e., with multiple embedded clauses).

--
-- alexr

Dimitriadis, A. (Alexis)

unread,
Jun 23, 2017, 6:01:56 PM6/23/17
to nltk-...@googlegroups.com
Crossing dependencies are nothing new, even for English. We use trees anyway because they express the main principle of sentence organization. The Penn Treebank expresses non-nested dependencies by means of special “trace” nodes. That’s probably similar to what you’re doing.  The Alpino corpus of Dutch uses a different approach: Its trees show words in nested dependency order, so that if you print out the leaves in order you get nonsense. (Each leaf includes its linear position as an attribute.) Both of these corpora are available with the nltk, so you can poke around and see what you think. (They’re named “treebank” and “alpino”).

Best,

Alexis


Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

Reply all
Reply to author
Forward
0 new messages