Sentence segmentation issue

60 views
Skip to first unread message

Patrick Watrin

unread,
Jan 22, 2014, 12:28:29 PM1/22/14
to unitex-...@googlegroups.com
Hello,

(Houston) We have a problem.

Consider these two texts :

1. {Napoléon,.N+PERS} donne aussitôt les ordres pour le départ. {Napoléon,.N+PERS} se précipite aux avant - postes et , sans doute à la faveur d’ une éclaircie , parvient à distinguer nettement le mouvement de l’ armée ennemie.

2. {Napoléon,.N+PERS} donne aussitôt les ordres pour le départ. Il se précipite aux avant - postes et , sans doute à la faveur d’ une éclaircie , parvient à distinguer nettement le mouvement de l’ armée ennemie.

In the first case, the second sentence begins with a pre-tagged token. This is not the case for the second text.

When you these two texts, you obtain one sentence for the first text and two for the second. This is quite a big problem for me.

I'm pretty sure the sentence graph is not responsible of this "feature". Actually no... but I can't figure out what happens.

Thanks in advance for your help.

Patrick.

Denis Maurel

unread,
Jan 22, 2014, 3:16:59 PM1/22/14
to patrick watrin, unitex-...@googlegroups.com


Hi Patrick,

Of course it is the sentence graph that not uses dictionaries, so no dictionary tags!

I added a subgraph to describe dictionary tags before MAJ and so on.

We will put it in the Unitex package if no one has problem with it... Could you try it for other examples?

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/CAMtMa64mn4RVaa21UaNdZgvAdQqyDFL1-U0NkJHMA6nF3QYF1A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sentence.grf
Etiquette.grf
Sentence.fst2

Patrick Watrin

unread,
Jan 22, 2014, 3:40:00 PM1/22/14
to denis....@univ-tours.fr, unitex-...@googlegroups.com
Hello Denis,

Thanks a lot! I try your modifications on the whole corpus this evening or early tomorrow and I came back to the list just after.

Thanks again for your help.

Patrick.

Denis Maurel

unread,
Jan 23, 2014, 3:03:40 AM1/23/14
to patrick watrin, unitex-...@googlegroups.com


Hi Patrick,

Sorry, tomorrow evening, I forget the first <MAJ>+<PRE>+<NB> in the subgraph!
Sentence.fst2
Etiquette.grf

Patrick Watrin

unread,
Jan 24, 2014, 8:10:36 PM1/24/14
to denis....@univ-tours.fr, unitex-...@googlegroups.com
Hello,

Thanks a lot Denis, your solution was perfect. We just modified the "etiquette" graph to fit our needs. You will find it in attachment.

Patrick.
Etiquette.grf

eric.laporte

unread,
Jan 26, 2014, 5:28:58 AM1/26/14
to unitex-...@googlegroups.com, denis....@univ-tours.fr, patrick...@knowbel.com
Hi,
here is a version or the Etiquette subgraph (attached graph) that should fit Patrick's needs (including his example with Napoleon) but also checks the structure of the lexical tag, as in Denis' version.
Eric Laporte
Etiquette.grf

Denis Maurel

unread,
Jan 26, 2014, 12:12:58 PM1/26/14
to eric.laporte, unitex-...@googlegroups.com, patrick watrin


Hi Eric,

You and me forget "=" for feature!
Is it really possible to use ',?,\,/ etc. in features?

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Hi,

eric.laporte

unread,
Mar 1, 2014, 11:57:13 AM3/1/14
to unitex-...@googlegroups.com, patrick watrin, denis....@univ-tours.fr
Dear Denis,
You are right. I added '=' among the possible characters in Etiquette.grf (attached file). I don't know about the other characters in semantic features (' ? \ /), but I think they are harmless in this graph. I also allowed commas and dots in words (O.N.U.). I will commit the new versions soon, unless someone complains.
Thanks,
Eric
Etiquette.grf

Denis Maurel

unread,
Mar 2, 2014, 12:25:04 PM3/2/14
to eric.laporte, unitex-...@googlegroups.com, patrick watrin


Hi Eric,

It is fine to me. Could you also update the Unitex manual, Figure 2.10 page 39?
Thanks

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Dear Denis,
Reply all
Reply to author
Forward
0 new messages