Cohort merging

21 views
Skip to first unread message

lupu...@gmail.com

unread,
Oct 27, 2013, 12:31:04 PM10/27/13
to constrain...@googlegroups.com
Hello,

I currently try to use VISLCG in our Russian NLP project at the Ural Federal University (Yekaterinburg, Russia).
Please, answer the following question. We can add/remove/reorder cohorts, but how to merge two wordforms?
For example, I wish to write a rule (or rules) that prescribes to merge "will" + any infinitive into a single wordform (i. e. "<will>" + "<read>" -> "<will read>" etc.).
The contestual test is simple, but we require to read the current wordform, and I couldn't find any way to do it.
Is it possible to merge a collocation into a single wordform/cohort in VISLCG using general rules?

Thanks for any response,
Yury Lukach

Francis Tyers

unread,
Oct 28, 2013, 12:14:18 PM10/28/13
to constrain...@googlegroups.com
El dg 27 de 10 de 2013 a les 09:31 -0700, en/na lupu...@gmail.com va
escriure:
I would probably just do something like:


SECTION

SUBSTITUTE (inf) (fut) (inf) (-1 ("will"));
REMCOHORT ("will") (1 (fut));

$ echo "^will/will<vaux>$ ^read/read<vblex><inf>$" | cg-conv -a slcg3
--trace --grammar /tmp/rule
; "<will>"
; "will" vaux REMCOHORT:5
"<read>"
"read" vblex fut SUBSTITUTE:4

Regards,

Fran


lupu...@gmail.com

unread,
Oct 28, 2013, 4:02:27 PM10/28/13
to constrain...@googlegroups.com
Thank you, Fran, but this approach corrupts the original text ("will" is lost).
Almost all Indo-European languages have both analytic and synthetic wordforms, so it's naturally enough to merge analitic forms and decrease the complexity of sentences.

And another small but irritating problem: currently VISLCG does not skip BOM in UTF-8 files although the most of Windows text editors save files with BOM.

Best regards,
Yury

Francis Tyers

unread,
Oct 28, 2013, 5:12:25 PM10/28/13
to constrain...@googlegroups.com
El dl 28 de 10 de 2013 a les 13:02 -0700, en/na lupu...@gmail.com va
escriure:
> Thank you, Fran, but this approach corrupts the original text ("will"
> is lost).
> Almost all Indo-European languages have both analytic and synthetic
> wordforms, so it's naturally enough to merge analitic forms and
> decrease the complexity of sentences.

In that case, I would do this in the morphological analyser, or a
preprocessing stage.

Fran


lupu...@gmail.com

unread,
Oct 28, 2013, 9:20:14 PM10/28/13
to constrain...@googlegroups.com
No doubt when you parse English texts, but in Slavic languages such analytic forms can be disjoint. And we have to analyze left and right contexts.

Francis Tyers

unread,
Oct 29, 2013, 9:35:50 AM10/29/13
to constrain...@googlegroups.com
El dl 28 de 10 de 2013 a les 18:20 -0700, en/na lupu...@gmail.com va
escriure:
> No doubt when you parse English texts, but in Slavic languages such
> analytic forms can be disjoint. And we have to analyze left and right
> contexts.

You mean like discontiguous NPs (1) and embedded clitic pronouns (2) ?

1) Интересную они предложили моей дочке работу.
|_____________________________________|

2) Taj mi je pesnik napisao knjigu.
|__________|

In this case if you want to reorder them to some "canonical" order, then
I'd write a dependency parser with CG and do some move rules[1] on the
dependency tree.

Although it really depends on what your final application is. Any tips ?

Fran

1. http://beta.visl.sdu.dk/cg3/single/#keyword-move

Trosterud Trond

unread,
Oct 29, 2013, 10:35:37 AM10/29/13
to constrain...@googlegroups.com

Francis Tyers <spe...@ivixor.net> kirjoitti 29. okt. 2013 kello 15:35:

> El dl 28 de 10 de 2013 a les 18:20 -0700, en/na lupu...@gmail.com va
> escriure:
>> No doubt when you parse English texts, but in Slavic languages such
>> analytic forms can be disjoint. And we have to analyze left and right
>> contexts.

>
> 1) Интересную они предложили моей дочке работу.
> |_____________________________________|
> 2) Taj mi je pesnik napisao knjigu.
> |__________|


It seems to me you are looking at the wrong application here.

On a morphological and syntactical level, Aux + V, V + Particle, Det + N, etc. are distinct. Now, for semantic reasons, you might want to join them, consider them as one, or whatever.

This should not be done in the syntactic analysis of vislcg3. Here we want to say that "ok, these are the words we get as input, what are they and how do they interact with each other?"

If you say have to disjoint analytic forms ("some _dis-_ and in otherways badly _located_ forms", or when "We _roll_ our analysis _out_"), then I would have analysed them, one at a time, and then tagged them e.g. for "out" with tags like @V<Pcle (particle linked to mother verb) combined with dep-tag, like #5->2. Eventual merge (dislocated, roll out) could then be done at lib (or have in a way already been done), but I do not think the CG ruleset itself should conduct this merge.

Trond.

Kevin Brubeck Unhammer

unread,
Oct 30, 2013, 5:34:31 AM10/30/13
to constrain...@googlegroups.com
Trosterud Trond <trond.t...@gmail.com> writes:

> Francis Tyers <spe...@ivixor.net> kirjoitti 29. okt. 2013 kello 15:35:
>
>> El dl 28 de 10 de 2013 a les 18:20 -0700, en/na lupu...@gmail.com va
>> escriure:
>>> No doubt when you parse English texts, but in Slavic languages such
>>> analytic forms can be disjoint. And we have to analyze left and right
>>> contexts.
>
>>
>> 1) Интересную они предложили моей дочке работу.
>> |_____________________________________|
>> 2) Taj mi je pesnik napisao knjigu.
>> |__________|
>
>
> It seems to me you are looking at the wrong application here.
>
> On a morphological and syntactical level, Aux + V, V + Particle, Det +
> N, etc. are distinct. Now, for semantic reasons, you might want to
> join them, consider them as one, or whatever.
>
> This should not be done in the syntactic analysis of vislcg3. Here we
> want to say that "ok, these are the words we get as input, what are
> they and how do they interact with each other?"
>
> If you say have to disjoint analytic forms ("some _dis-_ and in
> otherways badly _located_ forms", or when "We _roll_ our analysis
> _out_"), then I would have analysed them, one at a time, and then
> tagged them e.g. for "out" with tags like @V<Pcle (particle linked to
> mother verb) combined with dep-tag, like #5->2.

Or even arbitrary ADDRELATIONS, where dependencies don't make sense but
you still want to precisely attach and not just point out the direction
(e.g. word 3 semantically depends on word 5 but syntactically
depends on word 2).


--
Kevin Brubeck Unhammer

GPG: 0x766AC60C

lupu...@gmail.com

unread,
Oct 30, 2013, 5:54:31 PM10/30/13
to constrain...@googlegroups.com
It depends on the understanding of a "word". Let's consider three sentences:
(1) I crossed the street.
(2) I have crossed the street.
(3) I shall cross the street.

From my point of view all three are syntactically identical: S + V + O, and therefore they must have the same dependency tree.
By some historical reasons the verb in (1) is written without spaces while in (2) and (3) it contains spaces. But it's not matter of syntax, it's matter of orthography only.
Maybe, some day people begin to write "have-been-crossed" and "will-crossed": it will change nothing in the English syntax.
The same can be said about compound prepositions ("instead of") and compound conjunctions ("as soon as").

If such a collocation has unambiguous morpological characteristics and unambiguous syntactic function, I consider it as a word.
In practice merging of these collocation makes resulting dependency trees much more clear and compact.

But it's only my opinion and your mileage may vary.
All that I propose: to add to CG3 a syntactic constructions permitting to read wordform of the cohort and baseform of the reading found by current contextual test.
Some of linguists will use this ability, some will not, depending on their approach.

Yury Lukach

Kevin Brubeck Unhammer

unread,
Oct 31, 2013, 4:49:48 AM10/31/13
to constrain...@googlegroups.com
lupu...@gmail.com writes:

> It depends on the understanding of a "word". Let's consider three
> sentences:
> (1) I crossed the street.
> (2) I have crossed the street.
> (3) I shall cross the street.
>
> From my point of view all three are syntactically identical: S + V +
> O, and therefore they must have the same dependency tree.
> By some historical reasons the verb in (1) is written without spaces
> while in (2) and (3) it contains spaces. But it's not matter of
> syntax, it's matter of orthography only.

You can put words in between them, and this works productively:

I have not ever crossed the street.
I shall not ever cross the street.
I will not ever cross the street.

You can utter them in isolation:

I have.
I shall.
I will.

A native speaker asked to read slowly will pause before and after "have"
(but not between "cross" and "ed"). And I'm sure a phoneticist will be
able to point out the phonetic boundaries, how they correspond with
other things we call words in the language, but not with the within-word
contour.

Word boundaries can be a difficult question some times, but not in this
case.

[…]

> But it's only my opinion and your mileage may vary.
> All that I propose: to add to CG3 a syntactic constructions permitting
> to read wordform of the cohort and baseform of the reading found by
> current contextual test.

You can split it into three modules:

First you have a vislcg3 file containing rules like

LIST merge-if-before-verb = "<shall>" "<will>" "<have>" ; # etc
ADD (mergeright) merge-if-before-verb (1 (verb)) ;

Then a script (awk/perl/what have you) to turn

"<will>"
"will" verb pres mergeright
"<cross>"
"cross" verb inf

into

"<will cross>"
"will cross" verb pres verb inf

or whatever you feel that should be, and then your dependency CG.
Assuming input to the first module is disambiguated
(one-reading-per-cohort), that's a rather simple script.

lupu...@gmail.com

unread,
Oct 31, 2013, 1:03:33 PM10/31/13
to constrain...@googlegroups.com
Thank you, I know how to walk on crutches :)
But I'd prefer instead to develop my own CG engine with extended functionality.

Tino Didriksen

unread,
Oct 31, 2013, 6:38:48 PM10/31/13
to constrain...@googlegroups.com
On 31 October 2013 18:03, <lupu...@gmail.com> wrote:
Thank you, I know how to walk on crutches :)
But I'd prefer instead to develop my own CG engine with extended functionality.

Now that I'm finally back in a place that has a net connection the whole day...

What would you like to be able to write? Give me a concrete example rule or two and input with expected output - then I'll see how feasible it is to add to CG-3.

Your exact example could be done with a Substitute and RemCohort combination, using varstrings and regex capture in the substitute to create a new wordform from the old ones, but I'll admit it is currently messy.

-- Tino Didriksen
Reply all
Reply to author
Forward
0 new messages