Errors in detection/annotation when using <TOKEN>

50 views
Skip to first unread message

Drat Lucie

unread,
Jul 2, 2014, 11:39:32 AM7/2/14
to unitex-...@googlegroups.com
Hi,

I'm using GramLab for developping a grammar to detect temporal expressions in tourism documents. The annotations added in output are in timeML and it has tags for linking two elements together thanks to their IDs.

First, I've described all the kind of temporal expressions that are present in our corpora. So, I've got a main graph that calls each graphs describing and annotating them. There was no concern until I tried to modify it to add the links between the elements.
I wanted to add a link in output (just <TLINK/> for tests) if a date was followed by an expression of time to link the two of them. As they can be separated by any kind of characters (letters, numbers, punctuations etc.), I used <TOKEN>*. 

But, it has detected almost or all the text following the date. So, I've been advised to changed <TOKEN>* for (![<TIME>] <TOKEN>)*, for example, to prevent it from detecting something described in the subgraph TIMEX3_TIME.
It worked but now I've got errors in detecting and annotating which I cannot understand. In the first example below, the TIMEX3 for the date should contain the year "2014", and the second should be "lundi 24 février" and not "lundi 2".

All those king of temporal expression have been described :
day_of_week date month year
day_of_week date month
day_of_week date
And I apply the grammar in "longest matches" mode, so it should have detected "dimanche 1er décembre 2014" and "lundi 24 février".

I think, there is still an issue with <TOKEN> including a part of the temporal expressions, but I don't know how to prevent that. Is there someone who knows how to deal with that ? 
Thanks,

Lucie


Dimanche 1er décembre 2014 : COUSSEGREY - 40ème Brevet des Dagoniots. Randonnée Pédestre de 8-15-20 km. Départ à 10 h sur la place.
<TIMEX3 tid="t1" type="DATE" value="XXXX-XX-XX" temporalFunction="true" valueFromFunction="tf1">
Dimanche
1er
décembre
</TIMEX3>
 2014 : COUSSEGREY - 40ème Brevet des Dagoniots. Randonnée Pédestre de 8-15-20 km. Départ à
<TIMEX3 tid="t1" type="TIME" value="TXX:XX">
10 h
</TIMEX3>
<TLINK/>
 sur la place.


Lundi 24 février : SAINT-MARDS-EN-OTHE - CINEMA - Sur la terre des Dinosaures à 17 h au lavoir. 
<TIMEX3 tid="t1" type="DATE" value="XXXX-XX-XX" temporalFunction="true" valueFromFunction="tf1">
Lundi
2
</TIMEX3>
4 février : SAINT-MARDS-EN-OTHE - CINEMA - Sur la terre des Dinosaures à
<TIMEX3 tid="t1" type="TIME" value="TXX:XX">
17 h
</TIMEX3>
<TLINK/>
 au lavoir.
graphe_link_date-time.JPG

Denis Maurel

unread,
Jul 4, 2014, 11:28:26 AM7/4/14
to Drat Lucie, unitex-...@googlegroups.com


Hi Lucie,

The parsing of numbers are not simple, because each figure is a token!

We use a cascade to first build numbers as multiwords. For instance
First graph: 24 -> {24,.number+1a31}
Second one: 24 février is recognized by: <number+1a31> février

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Hi,

I'm using GramLab for developping a grammar to detect temporal expressions in tourism documents. The annotations added in output are in timeML and it has tags for linking two elements together thanks to their IDs.

First, I've described all the kind of temporal expressions that are present in our corpora. So, I've got a main graph that calls each graphs describing and annotating them. There was no concern until I tried to modify it to add the links between the elements.
I wanted to add a link in output (just <TLINK/> for tests) if a date was followed by an expression of time to link the two of them. As they can be separated by any kind of characters (letters, numbers, punctuations etc.), I used <TOKEN>*. 

But, it has detected almost or all the text following the date. So, I've been advised to changed <TOKEN>* for (![<TIME>] <TOKEN>)*, for example, to prevent it from detecting something described in the subgraph TIMEX3_TIME.
It worked but now I've got errors in detecting and annotating which I cannot understand. In the first example below, the TIMEX3 for the date should contain the year "2014", and the second should be "lundi 24 février" and not "lundi 2".

All those king of temporal expression have been described :


--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/1c7b2afb-4d87-4e40-a448-8420449ea5b8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

eric.laporte

unread,
Jul 7, 2014, 5:12:22 AM7/7/14
to
Hi Lucie,

The problem is that the transducer with (![<TIME>] <TOKEN>)* is ambiguous: several matches for the same text with the same length give different outputs. In that case, when you produce an output, the system chooses any of the matches with the same length. The match that reaches the end of the date is not particularly preferred by the system as compared to those that shift to (![<TIME>] <TOKEN>)* from inside the date. The Longest match option does not help, because all these matches end at the same point in the text.
You have two solutions:
- Use a cascade (user manual chapter 12 p. 245), as Denis suggests. In a first pass, you recognise the dates and the times separately. In a second pass, you link dates with times.
- Add weights (user manual section 5.2.4 page 104) to control the choice among the paths.
Best regards,

Eric Laporte

Drat Lucie

unread,
Jul 10, 2014, 10:06:52 AM7/10/14
to unitex-...@googlegroups.com
Hi,
I've tried with CasSys and it worked, thanks a lot.

Drat Lucie
Reply all
Reply to author
Forward
0 new messages