Dear all,
with the below added SRX I'm getting 2 segments when segmenting the sentence
Die Test GmbH + Co. KG mit Sitz in Stuttgart ist cool.
but I would expect to get only one. Is this a bug or do I miss something?
Thank you very much in advance!
best
Marc
=== SRX file ===
<?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20"
xmlns:okpsrx="http://okapi.sf.net/srx-extensions"
version="2.0">
<header segmentsubflows="yes" cascade="no">
<formathandle type="start"
include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated"
include="no"></formathandle>
<okpsrx:options oneSegmentIncludesAll="no"
trimLeadingWhitespaces="yes" trimTrailingWhitespaces="yes"
useJavaRegex="yes" useIcu4JBreakRules="no"
treatIsolatedCodesAsWhitespace="no"></okpsrx:options>
<okpsrx:sample language="de" useMappedRules="yes">Die Test
GmbH + Co. KG mit Sitz in Stuttgart ist
cool.</okpsrx:sample>
<okpsrx:rangeRule></okpsrx:rangeRule>
</header>
<body>
<languagerules>
<languagerule languagerulename="German">
<rule break="no">
<beforebreak>\bCo\.\s</beforebreak>
<afterbreak></afterbreak>
</rule>
<!--sentence final punctuation (incl. quotation marks) -
GERMAN-->
<rule break="yes">
<beforebreak>[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]+\s*[\p{Pe}\p{Pf}\p{Po}"'"'‘’“”]*\s*[\.?!]+\s*[\p{Pe}\p{Pf}\p{Po}"'"'‘’“”]*</beforebreak>
<afterbreak>\s+['"\p{Ps}\(]*[\p{Lu}\p{N}]</afterbreak>
</rule>
</languagerule>
</languagerules>
<maprules>
<languagemap languagepattern="(DE|de).*"
languagerulename="German"></languagemap>
</maprules>
</body>
</srx>
--
You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/okapi-users/f8ea90fc-d17e-4ebe-bace-dfd6bac2796e%40marcmittag.de.
Hi Chase, hi all,
thank you very much, this is very interesting. Thank you for looking into this!
For sure we know now, how to solve this problem. Still I'm not sure, if the srx implementation is then correct that way. If I define a no-rule above the yes rule and the no-rule matches, I would not expect any yes-rule further down to counter that.
Is there anyone else, that can enlighten, how srx is meant to work here?
best
Marc
To view this discussion visit https://groups.google.com/d/msgid/okapi-users/6f72d652-2248-4b7c-9475-2013333cfe09n%40googlegroups.com.
Hi Chase,
that the test-okapi.srx that I send in the last mail did not work was my mistake in the afterbreak-rule of the no-break rule. It simply did not match the part after the dot.
So all good, if both rules match the same position, the no-break rules is considered.
Please see my attached srx, that works.
best
Marc
To view this discussion visit https://groups.google.com/d/msgid/okapi-users/CAGRYq4hz7NR_UsqG5-OVJafvf9sUmeoqY8Rgs665rRe-x6xO3g%40mail.gmail.com.
Dear Chase, dear all,
I just added this to the wiki
https://okapiframework.org/wiki/index.php/SRX#Hint:_Knowing_when_a_no-break_rule_will_match
to make that clear for the future.
I think, it probably makes sense that srx works that way, because you need to choose, whether the first matching char of a regex is taken into account or the last one when compariing if a no-break rule should prevent a split or not. Both ways have their pros and cons, so you simply need to know, how it works.
Hope that documentation makes sense that way. If not, please
enhance it.
best
Marc
To view this discussion visit https://groups.google.com/d/msgid/okapi-users/7ff72dac-9019-451d-9153-3c2f1ad52099%40marcmittag.de.