Locate numbers in tagged texts

Laurent Kevers

unread,

Nov 21, 2014, 11:33:20 AM11/21/14

to unitex-...@googlegroups.com

Hello,

I try to write a graph which has to match some tokens that contain numbers (days of the month: 1-31). The text is already tagged, so I have this kind of token :

{31,31.ADJNUM}
{28,28.NCMIN}
{1,1.ADJNUM}

I tried different things (see attached graphs):
- test1: one box by char
- test2: same, but with '#' in order to avoid spaces between boxes
- test3: with morphological mode $< $>

All these 3 graphs have failed to retrieve the described tokens.

I finally succeeded with morphological filters (test4).

My question : is the morphological filters the only way to detect these kind of tokens in a tagged text ?

Thanks.

Laurent

test1.grf

test2.grf

test3.grf

test4.grf

Nebojsa Vasiljevic

unread,

Nov 22, 2014, 10:18:57 AM11/22/14

to unitex-...@googlegroups.com

The other way is a single box containing:

<.ADNUM>+<.NCMIN>

But if filtering by syntax/semantic and flective codes are not enough, then you need to go down to the morphological level since a lexical tag is tokenized as a single token.

Regards,

Nebojša

Laurent Kevers

unread,

Nov 24, 2014, 3:46:45 AM11/24/14

to unitex-...@googlegroups.com

Ok, I see : in the tok_by_*.txt files I have the full lexical tags like {13,13.NCMIN} so, as you said, it is probably considered as a single token. That explains why my first graphs failed to locate these units.
As I need a fine control on the located units, filtering on the codes will not be enough. I will then use morphological patterns like in my test4 graph (morphological filters).

Thank you.
Best regards,

Laurent

Nebojsa Vasiljevic

unread,

Nov 24, 2014, 4:13:48 AM11/24/14

to Laurent Kevers, unitex-...@googlegroups.com

Just don't forget to do a reasonable filtering on the lexical level before the additional filtering on the morphological level.

Nebojsa

Nebojša Vasiljević

nebojsa.v...@gmail.com

http://linkedin.com/in/vasiljevic

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/726e7c1e-0f38-4b37-aeb1-9b05288e47ff%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Denis Maurel

unread,

Nov 24, 2014, 5:45:35 AM11/24/14

to Laurent Kevers, unitex-...@googlegroups.com

Hi Laurent,

Thez problem is that one number is a token

1#2 recognize 12 but also 912 etc.

the solution is a box:

Best regards,

Denis Maurel

____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/

--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.

To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/d73bd5a4-371c-40c0-809a-b476a5bb642d%40googlegroups.com.

Laurent Kevers

unread,

Nov 24, 2014, 6:45:21 AM11/24/14

to unitex-...@googlegroups.com, laurentke...@gmail.com, denis....@univ-tours.fr

Hi Denis,

With tagged numbers I don't have such cases (I currently have a tagged text with numbers like {13,13.NCMIN}). But in the future I will maybe avoid to tag numbers... then I will be annoyed with the problem you described.

I red in the manual (page 120) : "L’emploi des guillemets permet également de forcer le respect des espacements. En effet, Unitex considère, par défaut, qu’un espace est possible entre deux boîtes. Pour
forcer la présence d’un espace, il faut le mettre entre guillemets. Pour interdire la présence d’un espace, il faut utiliser le symbole spécial #."

I tried to use the space character " " to force a space before and after a number : box1=" "; box2=13; box3=" " (I simplified box2 with a single number for this example) but it only works if I add explicitly a <TOKEN> before and after (as box0 and box4).
I thought that I could use this kind of pattern with (left and right) contexts (6.3.1 and 6.3.2 in the manual) in order to avoid these <TOKEN> into the recognized sequence... but I didn't succeed with that.

Best regards,

Laurent

Denis Maurel

unread,

Nov 24, 2014, 10:13:52 AM11/24/14

to Laurent Kevers, unitex-...@googlegroups.com

Hi Kevers,

with numbers tagged {13,13.NCMIN}, you can use the lemma reseach: <13> to obtain exactly 13, if 113 is tagged {113,113.NCMIN}

you can use also <NCMIN>

for all numbers tagged {x,x.NCMIN}

Best regards,

Denis Maurel

____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/

Paea LePendu

unread,

Apr 25, 2015, 10:33:03 AM4/25/15

to unitex-...@googlegroups.com

Hello,

I am stuck on a similar problem. I want to recognize the following string, essentially the digit one followed by a period and preceded by white space, the point being to find the first item of a list:

" 1. "

I cannot seem to do so without also getting:

" 21. "

It seems to do with two things:
1. The <NB> construct is only sometimes contiguous, not always, such that <NB><<^1$>> does no do what I'd expect.
2. The " " construct is ignored at the start of a grammar.

Please help. Thanks!

Paea

eric.laporte

unread,

Sep 22, 2015, 8:17:56 AM9/22/15

to Unitex-GramLab

Dear Paea,

I suggest a graph that would recognize the token before "1.". The only case where it would not recognize an occurrence is if the corpus begins with "1." The following graph recognizes the token before "1." and checks it is not a number:

Best,

Eric

Denis Maurel

unread,

Sep 22, 2015, 10:17:45 AM9/22/15

to eric.laporte, Unitex-GramLab

Dear Paea

For the begining of the corpus, you can use {^}.

Best regards,

Denis Maurel

____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/

Dear Paea,

--

You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.

To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/63d5f226-d0ea-4dc5-ae1c-322e0f1de74f%40googlegroups.com.

Reply all

Reply to author

Forward