tag order sensitivity in CG

3 views
Skip to first unread message

Eckhard Bick

unread,
Jun 10, 2017, 1:40:46 AM6/10/17
to Constraint Grammar
Hello everybody,

I would like to suggest a temporary fix to the tag order sensitivity
problem, while we wait for a permanent solution. Please forgive me, if
anybody has suggested this method before, but I think the idea is new:

1. We introduce a magic tag LINE, maintained by the compiler,
constituted by the *whole* reading line (plus the word form at the
start) as *one* tag, i.e. *without breaking on space*.

2. If LIST or on-the-fly definitions use a tag parenthesis with space,
e.g. (Tag1 Tag2), in a rule with the flag TAGORDER, this will be
converted internally to /^(.* )?Tag1 Tag2( .*)?$/r.

REMOVE TAGORDER (Tag3) IF (*1 (Tag1 Tag2)) ;

3. This complex and order-sensitive tag is then checked against the LINE
tag as a normal reg.ex. match, like also otherwise done for //r tags.

With this solution we need only very little coding (I hope) for now.
This is, admittedly, an unoptimized hack, but if it works, people can
start writing the order-sensitive rules they need, while we wait for an
integrated and optimized solution later.

In addition to, or instead of, TAGORDER at the rule level, we could also
introduce the concept of a "nonbreaking space character", e.g. ·
(mini-bullet) or double underscore, to allow flexible use of tag order
down at the level of individual contexts: (Tag1·Tag2) or (Tag1__Tag2).

Those of you who need tag order, what do you think?

Tino, is my intuition correct that it would not be so hard to turn this
algorithmical idea into code? And what would it cost, speed-wise? Given
that it would be relevant only for some rules, I guess, it can't be too bad.

Best

Eckhard

--
Eckhard Bick,
cand.med., dr.phil.
University of Southern Denmark
e-mail: eckhar...@gmail.com
web: http://beta.visl.sdu.dk

Eckhard Bick

unread,
Jun 10, 2017, 1:52:51 AM6/10/17
to constrain...@googlegroups.com
An additional remark: Of course the tags don't have to be adjacent, I
meant /^(.* )?Tag1( | .+ )Tag2( .*)?$/r.

-- Eckhard

Kevin Brubeck Unhammer

unread,
Jun 10, 2017, 9:27:12 AM6/10/17
to constrain...@googlegroups.com
Eckhard Bick <eckhar...@mail.dk> čálii:

> Hello everybody,
>
> I would like to suggest a temporary fix to the tag order sensitivity
> problem, while we wait for a permanent solution. Please forgive me, if
> anybody has suggested this method before, but I think the idea is new:
>
> 1. We introduce a magic tag LINE, maintained by the compiler,
> constituted by the *whole* reading line (plus the word form at the
> start) as *one* tag, i.e. *without breaking on space*.
>
> 2. If LIST or on-the-fly definitions use a tag parenthesis with space,
> e.g. (Tag1 Tag2), in a rule with the flag TAGORDER, this will be
> converted internally to /^(.* )?Tag1 Tag2( .*)?$/r.
>
> REMOVE TAGORDER (Tag3) IF (*1 (Tag1 Tag2)) ;

If this was in response to my question about COPY order, I think it may
be solving a different problem. At least for the use-case in the Divvun
grammar checker (and in Apertium MT), we typically only care about order
when the string of tags changes (tags are inserted or removed, as in
SUBSTITUTE/COPY), not when context conditions are matched. The reason is
that this string of tags is sent to an order-sensitive FST, e.g. for
word-form generation.

That said, before we had subreadings, a TAGORDER option might have
solved the issue with overlapping tags in compounds, previously
represented as

"<kaffikake>"
"kaffi" n m sg ind + "kake" n m pl def

(though we'd still have to match on LINE using a regex, to differentiate
def-at-end-of-LINE vs def-in-the-middle). Now we represent compounds as

"<kaffikake>"
"kake" n m pl def
"kaffi" n m sg ind

and the SUB:-option lets us pick the right subreading.

But perhaps there are other use cases I haven't thought of?


--

Kevin Brubeck Unhammer

Tino Didriksen

unread,
Jun 13, 2017, 2:46:57 PM6/13/17
to constrain...@googlegroups.com
Replied inline...

On 10 June 2017 at 07:40, Eckhard Bick <eckhar...@mail.dk> wrote:
1. We introduce a magic tag LINE, maintained by the compiler, constituted by the *whole* reading line (plus the word form at the start) as *one* tag, i.e. *without breaking on space*.

That part is easy and basically already done. The reading already stores an ordered list of tags - this is not the problem.


2. If LIST or on-the-fly definitions use a tag parenthesis with space, e.g. (Tag1 Tag2), in a rule with the flag TAGORDER, this will be converted internally to /^(.* )?Tag1 Tag2( .*)?$/r.

    REMOVE TAGORDER (Tag3) IF (*1 (Tag1 Tag2)) ;

That's where it breaks. To change how some sets are compiled (or even worse, recompiled for non-inline sets) based on a rule flag is a major change and kludge. There is currently zero interaction between these two parts, and there shouldn't be. Sets don't know they are being parsed in the context of a rule, and it would be messy to add markers to not deduplicate these new kinds of sets.

 
In addition to, or instead of, TAGORDER at the rule level, we could also introduce the concept of a "nonbreaking space character", e.g. · (mini-bullet) or double underscore, to allow flexible use of tag order down at the level of individual contexts: (Tag1·Tag2) or (Tag1__Tag2).

In the final solution, I will need to introduce regex-like * . .+ .* (or whatever) as placeholders for zero, one, one-or-more, zero-or-more any-tags to let writers express everything.

 
Tino, is my intuition correct that it would not be so hard to turn this algorithmical idea into code? And what would it cost, speed-wise? Given that it would be relevant only for some rules, I guess, it can't be too bad.

Code-wise, not worth delaying the actual implementation for. It'd reach far into many corners, without actually getting us any closer to the correct solution.

Speed-wise, it would be bad.

-- Tino Didriksen

Eckhard Bick

unread,
Jun 14, 2017, 1:16:22 AM6/14/17
to constrain...@googlegroups.com
Not sure I see the problem ...

I wasn't talking about changing sets on the fly at run time, only a
*one-time* rewriting: Tag1·Tag2 tag strings being rewritten as //r
reg.ex'es in order to preserve order. This would only be rewritten once,
at compile time, not run time. And if we basically have a LINE tag
already, the re-written tag strings could be matched against this LINE.

The tag string rewriting is so simple that it could be done by a grammar
preprocessor rather than the actual compiler.

What would need to be adapted is only that this ordered-tag reg.ex'es
would be recognized as such and tested against LINE rather than ordinary
tags or set in the cohort. Isn't that just a simple IF branch during
target/context matching? A bit like treating $$ unification sets
differently?

Of course I'm talking algorithmically here, not claiming to predict the
complexity of an implementation. But it looks feasible to me.

-- Eckhard


On 06/13/2017 08:46 PM, Tino Didriksen wrote:
> Replied inline...
>
> On 10 June 2017 at 07:40, Eckhard Bick <eckhar...@mail.dk
> --
> You received this message because you are subscribed to the Google
> Groups "Constraint Grammar" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to constraint-gram...@googlegroups.com
> <mailto:constraint-gram...@googlegroups.com>.
> To post to this group, send email to
> constrain...@googlegroups.com
> <mailto:constrain...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/constraint-grammar.
> For more options, visit https://groups.google.com/d/optout.

Tino Didriksen

unread,
Jun 14, 2017, 7:25:24 AM6/14/17
to constrain...@googlegroups.com
A temporary hack is now implemented in svn and latest nightly packages.

Regexes in the form /.../l (lower-case L) will match against all tags on a reading.

The reading is appended to a single string with single space between tags, so to match a literal space escape it with \

__ (two underscores) is treated special and means space or start or end or any non-space tags - it is shortcut for (^|$| | .+? )

Examples:
- /A\ B/l will match input tags DCA BCD due to substring "A B"
- /__A\ B__/l will match input tags A B but not A C B or DCA BCD
- /__A__B__/l will match input tags A C D E B but not B C A

This is a temporary hack that will be removed later.

-- Tino Didriksen

Eckhard Bick

unread,
Jun 14, 2017, 10:44:14 AM6/14/17
to constrain...@googlegroups.com, Per Langgård, Machine Translation
On 06/14/2017 01:24 PM, Tino Didriksen wrote:
> - /__A__B__/l will match input tags A C D E B but not B C A
>
Great! This should be useful for Greenlandic :)

On the same note, I'm assuming that for two lists

LIST L1 = a b c ;

LIST L2 = A B C ;

order sensitivity of type (*1 L1+L2) - meaning an L1 tag appearing in
the same reading before an L2 tag, can be expressed as:

/__(a|b|c)__(A|B|C)__/l

possibly with backslashes for the parentheses and/or the vertical bars.

With an (external or internal) grammar-preprocessor I guess this could
even be written as:

/__$$L1__$$L2__/l

using the $$ convention for list variables known from unification, or
possibly some tailor-made new prefix such as 'List:' instead of $$.

Still far from the full combinatorial context possibilities of a full
tag order sensitivity implementation, but - tell me if I'm wrong, Per -
a substantial part of what's on the wishlist for Greenlandic.

-- Eckhard
Reply all
Reply to author
Forward
0 new messages