Locate numbers in tagged texts

115 views
Skip to first unread message

Laurent Kevers

unread,
Nov 21, 2014, 11:33:20 AM11/21/14
to unitex-...@googlegroups.com
Hello,

I try to write a graph which has to match some tokens that contain numbers (days of the month: 1-31). The text is already tagged, so I have this kind of token :

{31,31.ADJNUM}
{28,28.NCMIN}
{1,1.ADJNUM}

I tried different things (see attached graphs):
- test1: one box by char
- test2: same, but with '#' in order to avoid spaces between boxes
- test3: with morphological mode $<  $>

All these 3 graphs have failed to retrieve the described tokens.

I finally succeeded with morphological filters (test4).

My question : is the morphological filters the only way to detect these kind of tokens in a tagged text ?

Thanks.

Laurent
test1.grf
test2.grf
test3.grf
test4.grf

Nebojsa Vasiljevic

unread,
Nov 22, 2014, 10:18:57 AM11/22/14
to unitex-...@googlegroups.com
The other way is a single box containing:

<.ADNUM>+<.NCMIN>

But if filtering by syntax/semantic and flective codes are not enough, then you need to go down to the morphological level since a lexical tag is tokenized as a single token.

Regards,
Nebojša

Laurent Kevers

unread,
Nov 24, 2014, 3:46:45 AM11/24/14
to unitex-...@googlegroups.com

Ok, I see : in the tok_by_*.txt files I have the full lexical tags like {13,13.NCMIN} so, as you said, it is probably considered as a single token. That explains why my first graphs failed to locate these units.
As I need a fine control on the located units, filtering on the codes will not be enough. I will then use morphological patterns like in my test4 graph (morphological filters).

Thank you.
Best regards,

Laurent

Nebojsa Vasiljevic

unread,
Nov 24, 2014, 4:13:48 AM11/24/14
to Laurent Kevers, unitex-...@googlegroups.com
Just don't forget to do a reasonable filtering on the lexical level before the additional filtering on the morphological level.

Nebojsa

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/726e7c1e-0f38-4b37-aeb1-9b05288e47ff%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Denis Maurel

unread,
Nov 24, 2014, 5:45:35 AM11/24/14
to Laurent Kevers, unitex-...@googlegroups.com


Hi Laurent,

Thez problem is that one number is a token

1#2 recognize 12 but also 912 etc.

the solution is a box:
<NB><<^[1-9]$>>


Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.

Laurent Kevers

unread,
Nov 24, 2014, 6:45:21 AM11/24/14
to unitex-...@googlegroups.com, laurentke...@gmail.com, denis....@univ-tours.fr

Hi Denis,

With tagged numbers I don't have such cases (I currently have a tagged text with numbers like {13,13.NCMIN}). But in the future I will maybe avoid to tag numbers... then I will be annoyed with the problem you described.

I red in the manual (page 120) : "L’emploi des guillemets permet également de forcer le respect des espacements. En effet, Unitex considère, par défaut, qu’un espace est possible entre deux boîtes. Pour
forcer la présence d’un espace, il faut le mettre entre guillemets. Pour interdire la présence d’un espace, il faut utiliser le symbole spécial #."

I tried to use the space character " " to force a space before and after a number : box1=" "; box2=13; box3=" " (I simplified box2 with a single number for this example) but it only works if I add explicitly a <TOKEN> before and after (as box0 and box4).
I thought that I could use this kind of pattern with (left and right) contexts (6.3.1 and 6.3.2 in the manual) in order to avoid these <TOKEN> into the recognized sequence... but I didn't succeed with that.

Best regards,

Laurent

Denis Maurel

unread,
Nov 24, 2014, 10:13:52 AM11/24/14
to Laurent Kevers, unitex-...@googlegroups.com


Hi Kevers,

with numbers  tagged {13,13.NCMIN}, you can use the lemma reseach: <13> to obtain exactly 13, if 113 is tagged {113,113.NCMIN}
you can use also <NCMIN>
 for all numbers tagged {x,x.NCMIN}

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Paea LePendu

unread,
Apr 25, 2015, 10:33:03 AM4/25/15
to unitex-...@googlegroups.com
Hello,

I am stuck on a similar problem. I want to recognize the following string, essentially the digit one followed by a period and preceded by white space, the point being to find the first item of a list:

" 1. "

I cannot seem to do so without also getting:

" 21. "

It seems to do with two things:
1. The <NB> construct is only sometimes contiguous, not always, such that <NB><<^1$>> does no do what I'd expect.
2. The " " construct is ignored at the start of a grammar.

Please help. Thanks!

Paea

eric.laporte

unread,
Sep 22, 2015, 8:17:56 AM9/22/15
to Unitex-GramLab

Dear Paea,

I suggest a graph that would recognize the token before "1.". The only case where it would not recognize an occurrence is if the corpus begins with "1." The following graph recognizes the token before "1." and checks it is not a number:



Best,

Eric

Denis Maurel

unread,
Sep 22, 2015, 10:17:45 AM9/22/15
to eric.laporte, Unitex-GramLab


Dear Paea

For the begining of the corpus, you can use {^}.



Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Dear Paea,

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.
Reply all
Reply to author
Forward
0 new messages