Bug in Okapi SRX implementation or do I miss something?

9 views
Skip to first unread message

Marc Mittag

unread,
Jan 8, 2026, 12:53:15 PM (4 days ago) Jan 8
to okapi-users

Dear all,

with the below added SRX I'm getting 2 segments when segmenting the sentence

Die Test GmbH + Co. KG mit Sitz in Stuttgart ist cool.


but I would expect to get only one. Is this a bug or do I miss something?

Thank you very much in advance!

best

Marc

=== SRX file ===

<?xml version="1.0" encoding="UTF-8"?>
<srx xmlns="http://www.lisa.org/srx20" xmlns:okpsrx="http://okapi.sf.net/srx-extensions" version="2.0">
<header segmentsubflows="yes" cascade="no">
<formathandle type="start" include="no"></formathandle>
<formathandle type="end" include="yes"></formathandle>
<formathandle type="isolated" include="no"></formathandle>
<okpsrx:options oneSegmentIncludesAll="no" trimLeadingWhitespaces="yes" trimTrailingWhitespaces="yes" useJavaRegex="yes" useIcu4JBreakRules="no" treatIsolatedCodesAsWhitespace="no"></okpsrx:options>
<okpsrx:sample language="de" useMappedRules="yes">Die Test GmbH + Co. KG mit Sitz in Stuttgart ist cool.</okpsrx:sample>
<okpsrx:rangeRule></okpsrx:rangeRule>
</header>
<body>
<languagerules>
<languagerule languagerulename="German">
<rule break="no">
<beforebreak>\bCo\.\s</beforebreak>
<afterbreak></afterbreak>
</rule>
<!--sentence final punctuation (incl. quotation marks) - GERMAN-->
<rule break="yes">
<beforebreak>[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]+\s*[\p{Pe}\p{Pf}\p{Po}"'"'‘’“”]*\s*[\.?!]+\s*[\p{Pe}\p{Pf}\p{Po}"'"'‘’“”]*</beforebreak>
<afterbreak>\s+['"\p{Ps}\(]*[\p{Lu}\p{N}]</afterbreak>
</rule>
</languagerule>
</languagerules>
<maprules>
<languagemap languagepattern="(DE|de).*" languagerulename="German"></languagemap>
</maprules>
</body>
</srx>


Chase Tingley

unread,
Jan 8, 2026, 2:54:14 PM (4 days ago) Jan 8
to Marc Mittag, okapi-users
I don't have a full answer, but I debugged into this a little bit and learned something.


It goes through each rule, and if it finds a match, it records the character offset of the end of that match (not the start!).
In this case, your "no" rule matches the range "Co. " (note the trailing whitespace), and so is marked as being at position 20.  The "yes" rule only matches "Co." (no trailing whitespace), and so it is marked as being at position 19.

The list of final split positions is made by going through the set of positions that have a "yes" rule, which in this case is 19. So the "no" rule ends up doing nothing.  I'm not an expert on SRX, but it looks like with this implementation, it would work correctly if the yes and no rules matched the same range of characters -- in that case, the earlier "no" rule would block the later "yes" rule from being recorded at that location.  So that means either modifying the "no" rule to use a non-capturing lookahead for the whitespace, or tweaking "yes" to capture the whitespace as well.

ct

--
You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/okapi-users/f8ea90fc-d17e-4ebe-bace-dfd6bac2796e%40marcmittag.de.

Marc Mittag

unread,
Jan 8, 2026, 3:34:19 PM (4 days ago) Jan 8
to Chase Tingley, okapi-users

Hi Chase, hi all,

thank you very much, this is very interesting. Thank you for looking into this!

For sure we know now, how to solve this problem. Still I'm not sure, if the srx implementation is then correct that way. If I define a no-rule above the yes rule and the no-rule matches, I would not expect any yes-rule further down to counter that.

Is there anyone else, that can enlighten, how srx is meant to work here?

best

Marc

Marc

unread,
Jan 9, 2026, 8:27:55 AM (3 days ago) Jan 9
to okapi-users
Hi Chase,
sorry, I tested again and this is still unclear. I removed the whitespace from the before-break-rule. Now the before-break rule should also match at position 19. But still it breaks. Does that mean, that the nobreak-rule must match at a position even lower than the break rule to do something?
That would be really weird and error-prown and counter-intuitive.
Please see the attached example srx file.
best
Marc

test-okapi.srx

Chase Tingley

unread,
Jan 9, 2026, 1:02:21 PM (3 days ago) Jan 9
to Marc, okapi-users
Hi Marc,

I'll have to go back to the code to see what effect the <afterbreak/> rule is having -- I'm not sure why it's not working from looking at it.

I was able to get it to work with the attached, using (?=\s) in the <beforebreak> rule to look for the trailing space.  This does match at position 19, and seems to correctly block the other rule.


lookahead.srx

Marc Mittag

unread,
Jan 10, 2026, 2:46:24 AM (2 days ago) Jan 10
to Chase Tingley, okapi-users

Hi Chase,

that the test-okapi.srx that I send in the last mail did not work was my mistake in the afterbreak-rule of the no-break rule. It simply did not match the part after the dot.

So all good, if both rules match the same position, the no-break rules is considered.

Please see my attached srx, that works.

best

Marc

test-okapi.srx

Marc Mittag

unread,
Jan 10, 2026, 3:26:36 AM (2 days ago) Jan 10
to Chase Tingley, okapi-users

Dear Chase, dear all,

I just added this to the wiki

https://okapiframework.org/wiki/index.php/SRX#Hint:_Knowing_when_a_no-break_rule_will_match

to make that clear for the future.

I think, it probably makes sense that srx works that way, because you need to choose, whether the first matching char of a regex is taken into account or the last one when compariing if a no-break rule should prevent a split or not. Both ways have their pros and cons, so you simply need to know, how it works.

Hope that documentation makes sense that way. If not, please enhance it.

best

Marc

Reply all
Reply to author
Forward
0 new messages