Feature request: [#15] Pass-through tags {{...}}

85 views
Skip to first unread message

Nebojsa Vasiljevic

unread,
Mar 4, 2014, 3:40:10 AM3/4/14
to unitex-...@googlegroups.com
Pass-through tags are intended to be used in scenario: 
custom pre-processing -> Unitex processing (Locate, Cassys,...) -> custom post-processing

In this scenario you may need to insert some information in pre-processing phase that will pass-through Unitex processing and will be used again in post-processing phase. Normally a pass-through tag will not be recognized as a token, but it will be considered as a part of a token delimiter like space and new-line characters.

The proposed structure of a pass-through tag is {{...}} (anything inside double curly brackets).

Currently we can use offset-tracking feature for the similar purposes, but offset-tracking could be limited and too complicated in the case of custom pre/post-processing.

Implementation remarks:

Unitex currently expects only single-character token delimiter (space or new line) in a normalized text. Pass-through tags feature will require:
- some change in the Unitex architectrure to allow multi-character token delimiters or 
- to consider pass-through tags as special type tokens an treat them appropriately in the graph matching algorithm

Advanced options:

We can consider to allow explicit pass-through tag matching in graphs.

This feature request is also registered in the Unitex tracking system:


Regards,
Nebojša

Denis Maurel

unread,
Mar 4, 2014, 4:14:49 AM3/4/14
to Nebojsa Vasiljevic, unitex-...@googlegroups.com


Hi Nebojsa,

I agree with the idea to revise the Unitex treatment of space, tabs, CRLF... caracters. For post-processing, but also for parsing. It miss actualy the possibility to parse CRLF cararcter. But the idea that these are a new class of tokens is good and allows to treat them in a same way, if necesary, as actualy.

Another idea about token is the number treatment. Why do not treat numbers as letters, ie to consider number sequences as token and not number alone? e.g. "100 meters" correspond to 4 tokens "1", "0", "0" and "meters"; I suggest to correspond in the future of Unitex to 2 tokens "100" and "meters", as "few meters"... Advantage: to use "100" in a box for parsing. Actualy, a box "100" recognizes 100, but also 2100, 56100 and so on.

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/a2512738-d7a3-4c4c-b671-dd090586b90b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

eric.laporte

unread,
Mar 4, 2014, 12:20:41 PM3/4/14
to unitex-...@googlegroups.com
Dear Nebojsa,

I am not comfortable with the idea of opening feature requests without prior discussion in the forum.
A new feature can have several costs: speed, readability of resources... A feature that would offer an additional possibility to do something that was already possible in another way might be a useless (and, therefore, harmful) complexification.
The forum provides the possibility for both users and developers to discuss their advantages and drawbacks.
I suggest we open feature requests only after a consensus is met through the forum.
Thus, developers can safely tackle feature requests without the risk that their generous work might be rejected by the community after.

Best regards,

Eric

Nebojsa Vasiljevic

unread,
Mar 4, 2014, 3:14:47 PM3/4/14
to unitex-...@googlegroups.com
Eric,

Sorry for misunderstanding, maybe my previous post sounds too formal, but I've just tried to use the tracking system that was announced a month ago on this forum.

Normally, a tracking system should be a place were new feature proposals are initiated, discussed and even voted by users. You obviously suppose that only agreed feature requests should be placed in the tracking system.  We can agree on whatever role of the tracking system, but I want to note that my intention was not to force this request by putting it in the tracking system.

So, I've deleted the item from the tracking system, and I would like to go back to discuss on the proposed feature itself.

Regards,
Nebojsa

eric.laporte

unread,
Mar 6, 2014, 10:36:22 AM3/6/14
to unitex-...@googlegroups.com
Dear Nebojsa,
I am not enthusiastic about the {{...}} syntax for tags to be ignored. Why not use XML tags? In the 'Suggestions for Unitex and Gramlab' topic, take a look at the 'Processing XML documents' post. If someone implements it, will your suggestion still be valid?
Best,
Eric

Nebojsa Vasiljevic

unread,
Mar 8, 2014, 8:18:40 PM3/8/14
to unitex-...@googlegroups.com
Dear Eric,

I had considered ideas from 'Processing XML documents' post, but there's a difference in focus between those two initiatives. In the  'Processing XML documents' post, XML support is not native, but it's implemented with the additional pre/post-processing and it's based on the offset tracking as an native Unitex feature. On the other hand, I am talking about minimal native markup support in Unitex, primarily  as an alternative to offset tracking and methods based on offset tracking.

Generally, I think that offset tracking is a workaround and a native markup support is a proper method for attaching some additional information with a text. Particularly, XML based on offset tracking is limited in different ways:
-  There are cases that just can't be covered with offset tracking. For instance, when a piece of input text ends in grammar output (in replace mode), offset tracking can't cover this piece of text (native markup could survive in this case). Offset tracking covers only pieces of text that are not affected by grammar processing.
- When you use offset tracking, all additional information in an original source is completely out of reach of grammar processing (but sometimes we may need to explicitly match an markup in the graph)
- Yo have very different views when you look at original text source and when you look at .snt file, and it is just an additional level of complexity to a user. 
-  If something goes wrong in any step in the offset tracking chain, it may be hard to resolve the problem.
- It is not easy to implement offset tracking in your custom processing (you need to understand Unitex internals). Alternatively, you may lay on some already implemented method based on offset tracking (like the method from the  'Processing XML documents' post) , but then you have an additional layer of pre/post-processing all the time, and still you have all other limitations of offset tracking based methods.

So, that's why I'm looking for the proper native markup support in Unitex. I suggested {{...}} structure because it can't interfere with the normal text, since it is extension of the already existing tag notation in Unitex. If I use standard XML tags, then I will also need to avoid some characters in the rest of the text. For instance I will need to replace "<" and ">" with "&lt;" and "&gt;" and to replace "&" with "&amph;". And this will lead to the switch  to XML as a native text format in Unitex. I don't think this is bad idea, but this is a big step. If we prefer small steps approach, then we should extend existing tag notation to implement a new kind of tags.

Regards,
Nebojša

Denis Maurel

unread,
Mar 9, 2014, 12:59:49 PM3/9/14
to Nebojsa Vasiljevic, unitex-...@googlegroups.com


Hi Nebojsa and Eric,

I suggest for parsing XML text to use CasSys cascades, as in the tutorial of last Unitex workshop. This tutorial is available at http://tln.li.univ-tours.fr/Tln_CasEN.html.

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Dear Eric,


--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.

Denis Maurel

unread,
Mar 10, 2014, 4:26:49 AM3/10/14
to Nebojsa Vasiljevic, unitex-...@googlegroups.com


Hi Nebojsa,

Your proposal is very interesting. I agree with it.
The problem is sometimes to hide XML tags and sometimes to use them.
With cascade, we can use them but we have always  to include them in local grammars...

When you parse transcriptions of oral texts, we have the same problem with disfluences like
"his name is hum John"
Your proposal allows to rewrite this text as "his name is {{hum}} John" and to parse only "his name is John" with local grammar! very usefull.
I hope this improvement!

Thank you very much.

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Denis,

First graph in the first cascade (xml.grf) packs all XML tags into Unitex lexical tags. So, cascades relies on Unitex's native support for lexical tags. My idea is to improve this native level of tagging. The particular feature proposal is to introduce native tags that are skipped in graph matching, but kept in the matched text. For instance, if we have the sentence:

Don't use {<I>,.xml+formatting}italic{</I>,.xml+formatting} too much.

and graph (imagine boxes in the place of square brackets): 

[use]-[italic]-[too]-[much]

Then this graph don't match any text in the previous sentence. I would like to be able to write something like:

 Don't use {{<I>}}italic{{</I>}} too much.

and to find "use {{<I>}}italic{{</I>}} too much" in the concordances for the previous graph. And I also would like to be able to explicitly put {{<I>}} in a graph box if I need to match it explicitly. 

So, there are annotations that are relevant in some context and not relevant in some other context, but I want to have them in the text all the time.


Regards,

Nebojsa Vasiljevic

unread,
Mar 10, 2014, 4:29:05 AM3/10/14
to unitex-...@googlegroups.com, Nebojsa Vasiljevic, denis....@univ-tours.fr
Denis,

In the tutorial examples you mentioned,  first graph in the first cascade (xml.grf) packs all XML tags into Unitex lexical tags. So, cascades relies on Unitex's native support for lexical tags (extended with some kind of nesting as I understand). My idea is to improve tagging support on the native level. The particular feature proposal is to introduce native tags that can be implicitly skipped in graph matching, but kept in the matched text. For instance, if we have the sentence:

Don't use {<I>,.xml+formatting}italic{</I>,.xml+formatting} too much.

and graph (imagine boxes in the place of square brackets): 

[use]-[italic]-[too]-[much]

Then this graph don't match any text in the previous sentence, since the graph doesn't recognize inserted lexical tags. I would like to be able to write something like:

Don't use {{<I>}}italic{{</I>}} too much.

and to find "use {{<I>}}italic{{</I>}} too much" in the concordances for the previous graph (and potentially in an input variable). And I also would like to be able to explicitly put {{<I>}} in a graph box if I need to match it explicitly. 

So, there are annotations that are relevant in some context and not relevant in some other context, but I want to have them in the text all the time.


Regards,
Nebojša Vasiljević

eric.laporte

unread,
Mar 17, 2014, 1:02:00 PM3/17/14
to unitex-...@googlegroups.com
Dear Nebojša,

I understand that the proposed stripping of XML tags (and restoration after Unitex processing) does not fit your needs.
Summing up the functionality your are suggesting:
- The {{...}} annotations in the text would be invisible to Locate Pattern when the next symbol in the grammar is not of the {{...}} form.
- They would be visible to Locate Pattern when the next symbol in the grammar is of this form.
So it would be possible for a grammar to check that the next symbol in the text is a pass-through one, but impossible to check that it is not, except with special devices such as negative contexts.
I fear grammars using this functionality would be less readable.
Best,

Eric

Nebojsa Vasiljevic

unread,
Mar 17, 2014, 4:35:47 PM3/17/14
to unitex-...@googlegroups.com
Dear Eric,

The {{...}} annotations in the text would be invisible to Locate Pattern also in the case when you have more {{...}} annotations in sequence. The only exception could be  when you put  a {{...}} annotation explicitly in a graph box.

Regards,
Nebojsa

eric.laporte

unread,
Mar 20, 2014, 1:16:07 PM3/20/14
to unitex-...@googlegroups.com
Dear Nebojša,

It would be possible for a grammar to check that the next symbol in the text is a pass-through one, but impossible to check that it is not, except with special devices such as negative contexts.
I fear grammars using this functionality would be less readable than our present standards.
Best,

Eric

Nebojsa Vasiljevic

unread,
Mar 20, 2014, 3:42:47 PM3/20/14
to eric.laporte, unitex-...@googlegroups.com
Eric,

You don't need negative context. Let <DELIMITER> means any sequence of spaces, new lines (other white-space characters) and  {{...}} annotations. 

The desired semantic is equivalent to implicitly placing  <DELIMITER> before and after each box in the graph. Basically, we need to introduce special class of tokens that we will call delimiter, and to enable multiple delimiters between regular tokens. And we need this improvement anyway to be able to keep multiple new lines, tabs, etc in text and even to match them when necessary. If we implement delimiters as special kind of tokens, then {{...}} annotations could be just kind of delimiters.

Regards,
Nebojša

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.

Denis Maurel

unread,
Mar 20, 2014, 4:52:45 PM3/20/14
to Nebojsa Vasiljevic, eric.laporte, unitex-...@googlegroups.com


Hi Nebojsa,

I sipport this idea!
Thanks



Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Eric,

eric.laporte

unread,
Mar 21, 2014, 9:00:23 AM3/21/14
to unitex-...@googlegroups.com, eric.laporte
Dear Nebojsa,
With this feature, how do you make a path in a graph where you make sure that a given token, say <John>, is **not** followed by any {{...}} annotation, without a negative context in the sense of section 6.3.1 of the manual?
Best,
Eric

Nebojsa Vasiljevic

unread,
Mar 22, 2014, 7:55:14 AM3/22/14
to eric.laporte, unitex-...@googlegroups.com
Eric,

With the semantic I explained in my previous post, you just can't do that. In this semantic, whatever text a graph recognize, if you modify this text in a way that you insert a delimiter immediately before or after any regular token that is recognized in normal context, the same graph will also recognize the modified text. 

So, the primary idea about  {{...}} annotations is to be ignorable, and the secondary idea is to be matchable (to the extent that is bounded by the primary idea).

If we need full matchability for an annotation then we need an additional feature to declare this annotation (or a set of delimiters, or all  {{...}} annotations) to be considered as a regular token in a particular part of graph. We could implement this with a new open-close pair of a special kind of boxes (like we have for input/output variables, morphological mode, right context, etc).

If all of that seams to be too complex, we could start just with the ignorability part: introduce delimiters as special kind of tokens; enable multiple delimiters between regular tokens; consider {{...}} annotations as delimiters; and match only regular tokens graphs. The other two possibilities (to explicitly match a delimiter and to declare some delimiters to be considered as regular tokens in some parts of a graph) could be postpone for the further development if we recognize them important.

Also, we could consider {<...>} notation to be more XML-like.

Regards,
Nebojša Vasiljević


Reply all
Reply to author
Forward
0 new messages