Cannot locate pattern when file to large using <TOKEN>

74 views
Skip to first unread message

Amal Htait

unread,
Apr 24, 2015, 10:09:52 AM4/24/15
to unitex-...@googlegroups.com
Dear All,

I'm trying to locate a certain pattern in an xml data (put in txt file).
The pattern is as in the attached image (flow.png) and can be described as below:

<div>
<bibl>  any character so I used <TOKEN> </bibl>  -> that can be repeated several times ..
</div>

Also attached an example file (data.txt) that I'm using, but the Unitex graph is not detecting anything in it.
It contain 17 line of : <bibl>  any character </bibl>
but when I use only the first 8 (or the last 11), the graph can detect the pattern.

As if when we have too much data, the pattern won't be detected.

Any idea how to fix it?

Thank you,
Amal HTAIT
flow.png
data.txt

Denis Maurel

unread,
Apr 24, 2015, 10:58:56 AM4/24/15
to Amal Htait, unitex-...@googlegroups.com


Dear Amal,

The problem is the following: <TOKEN> is any token, so you think that the end is the first <bibl>, but it can be also the second one and so on. So the maximal possible search is obtained for more than 8 lines...
You have to replace <TOKEN> by <MOT>+<NB>+,+... without "<"

Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



--
You received this message because you are subscribed to the Google Groups "Unitex-GramLab" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unitex-gramla...@googlegroups.com.
To post to this group, send email to unitex-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unitex-gramlab.
To view this discussion on the web visit https://groups.google.com/d/msgid/unitex-gramlab/982e3975-85c5-42d7-a944-1ef5f74a362a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Amal Htait

unread,
Apr 24, 2015, 4:31:54 PM4/24/15
to denis....@univ-tours.fr, unitex-...@googlegroups.com
Dear Denis,

Thank you for your reply :)

About your solution, actually that's what I've been working on before working with <TOKEN>.
But I'm working with a large possibility of characters between <bibl> and </bibl>, also I faced some weird problems like below:

If you replace <TOKEN> by <MOT>+#+\,+\:+\.
and then test it on:
<div>
<bibl>Anis, Bruxelles:</bibl>
</div>

It will detect the pattern.

But when tested on: (I only added space before and after the colon)
<div>
<bibl>Anis, Bruxelles : </bibl>
</div>

It won't detect it.

Thanks,
Amal HTAIT

Paea LePendu

unread,
Apr 25, 2015, 11:02:45 AM4/25/15
to unitex-...@googlegroups.com
You can try a couple things:

1. Shortest match in a cascade (tricky)

2. Morphological filter on token to stop match at the first angle bracket symbol: <TOKEN><<[^<]>> (problemmatic with nested xml)

3. A negative right context to look ahead detect and backtrack on specific stop pattern $![</bibl>$] (also tricky)

Paea

Denis Maurel

unread,
Apr 26, 2015, 3:37:25 PM4/26/15
to Amal Htait, unitex-...@googlegroups.com


Dear Amal,

of course it is not good. My mail were not clear. It was just the idea. You have to complete the path with all the seperators possible, except "<"

Amal Htait

unread,
Apr 27, 2015, 6:42:28 AM4/27/15
to denis....@univ-tours.fr, unitex-...@googlegroups.com
Dear Denis,

Ok thanks, I'll work on that.
But do you think I can add something to <TOKEN> to stop it from skipping the first </bibl> ? (one last try :) )
I've tried: <TOKEN><<^([^\<]|(\<[^\/])|(\<\/[^b])|(\<\/b[^i])|(\<\/bi[^b])|(\<\/bib[^l])|(\<\/bibl\>.+))>>
but it didn't work correctly.

Thank you and best regards,
Amal HTAIT

Stephanie Weiser

unread,
Apr 27, 2015, 9:38:30 AM4/27/15
to Amal Htait, denis....@univ-tours.fr, unitex-...@googlegroups.com
Dear Amal,

In my opinion, the easiest way to solve your problem is to do a negation on the </bibl> as in this example:

​I hope it helps!

Best regards,
Stéphanie Weiser

Denis Maurel

unread,
Apr 28, 2015, 10:17:14 AM4/28/15
to Stephanie Weiser, Amal Htait, unitex-...@googlegroups.com


Dear Amal and Stephanie,

Is it good?
I am not sure because <bibl> is 3 tokens and you compare to one token? (Figure 6.15 of Unitex manual)?

But your idea, Stephanie, is very good with one token. 2 solutions:
1) you have not any "<" before <bibl>: exactly your idea, just with <
2) you have other XML tags and you can use a cascade to recognize XML tags as token (see our cascade, graph "toolXml.grf" at http://tln.li.univ-tours.fr/Tln_CasEN.html)
neg_token.png

Stephanie Weiser

unread,
Apr 28, 2015, 10:36:50 AM4/28/15
to denis....@univ-tours.fr, Amal Htait, unitex-...@googlegroups.com
Dear Denis,
I've tested my graph and it seems to be working, I don't really know how though...
It goes from <bibl> to the next </bibl> even with other xml tags in between.

Denis Maurel

unread,
Apr 28, 2015, 11:05:39 AM4/28/15
to Stephanie Weiser, Amal Htait, unitex-...@googlegroups.com


Dear Stephanie,

ok, it's fine.
But I don't understand Figure 6.15 of Unitex manual? someone has an explaination? why the Stephanie's graph is good?
neg_token.png

eric.laporte

unread,
May 6, 2015, 4:41:48 AM5/6/15
to unitex-...@googlegroups.com
Dear Amal,

In your April 27 post below, your morphological filter assumes </bibl> is a token, but it is not (unless you have redefined the alphabet). Unitex tokenization is alphabet based and each non-letter is a token (manual, 2.5.4), so that </bibl> is a sequence of 4 tokens. A morphological filter which follows <TOKEN> matches only within a single token (manual, 4.7).
Best,
Eric Laporte

eric.laporte

unread,
May 6, 2015, 5:09:39 AM5/6/15
to unitex-...@googlegroups.com, step...@earlytracks.com, denis....@univ-tours.fr
Dear Denis and Stephanie,


On Tuesday, 28 April 2015 17:05:39 UTC+2, Denis MAUREL wrote:
<<

But I don't understand Figure 6.15 of Unitex manual?
>>
In Figure 6.15, the box in the negative right context (with the green symbols) checks whether the right context (from that point in the text) is recognized by  <V:K> (verb in the past participle). The recognition is interrupted if it is recognized. If it is not recognized, then the recognition proceeds with the <A> box, back again from the point in the text where the control was before checking the negative right context. Thus, the graph recognizes an <A> which is not a <V:K>.

<<
why the Stephanie's graph is good?
>>
Stephanie's graph follows the same principle. The path in the negative right context is compared to the text, and if there is no match, the box after it is compared too, beginning again from the same point in the text. The matching with the negative right context begins from the same point in the text as the matching with the box after it. But note that there is no requirement that the two matchings would end at the same point. In Stephanie's graph, the path in the negative context checks up to 4 tokens, whereas the box after it checks one. Similarly, in Figure 6.15, <A> and <V:K> may match with text portions of different lengths, e.g. in case one of them is a simple word and the other is multi-word. For example, the negative right context in Figure 6.15 prevents the recognition of a multi-word <A> which begins with a simple-word <V:K>, or of a simple-word <A> which is the first component of a multi-word <V:K>.
Best,

Denis Maurel

unread,
May 6, 2015, 3:45:27 PM5/6/15
to eric.laporte, unitex-...@googlegroups.com, step...@earlytracks.com


Dear Eric,
The Unitex manual suggests than the two boxes contain one token or the same number of token. Here you compare four and one token. It is amazing. May be you can add this at the Unitex manual.


Best regards,

Denis Maurel


____________________________________
Professor Denis Maurel
Université François Rabelais Tours
LI (Computer Science Research Laboratory)
EPU-DI
64 avenue Jean-Portalis
37200 Tours
France
Phone: 33-2.47.36.14.35
Fax: 33-2.47.36.14.22
mailto:denis....@univ-tours.fr

http://www.univ-tours.fr/maurel

http://www.li.univ-tours.fr
http://tln.li.univ-tours.fr/



Dear Denis and Stephanie,

eric.laporte

unread,
May 12, 2015, 5:06:25 AM5/12/15
to unitex-...@googlegroups.com, denis....@univ-tours.fr, denis....@univ-tours.fr
Dear Denis,
Actually, a negative right context is not a comparison. And, to be sure that the manual does not imply that "the two boxes contain the same number of tokens", I inserted an example similar to Stephanie's (Figure 6.16, section 6.3.1) in the English and French manual. I also added information on weights in right and left contexts (part 6.3) and on Cassys (parts 6.10.7 and 12.2.1). The new version is on line.
Best regards,
Eric
Reply all
Reply to author
Forward
0 new messages