Possible state cleaning issue with MarkDownFilter...

19 views
Skip to first unread message

jim

unread,
Jan 28, 2022, 5:00:42 PM1/28/22
to Group: okapi-devel
*Note: the fourth item uses the Unicode character for [Roman numeral four][2].*

First filter of the content above produces correct open/close tags for "*" as expected.

Close the filter but do not re-create it and filtering the same content makes the "*" open/close isolated tags.

I suspect that not all the reset logic is being applied to MarkDownFilter.close()

Jim

krsk...@gmail.com

unread,
Jan 29, 2022, 7:22:54 PM1/29/22
to okapi-devel
Jim,
Do you have a unit test case? I added this test case to MarkdownFilterTest but it passes...

@Test
public void testCloseMightNotResettingStatusProperly() throws Exception {

    String snippet = "*Note: the fourth item uses the Unicode character for [Roman numeral four][2].*";

    try (MarkdownFilter filter = new MarkdownFilter()) {
        ArrayList<Event> events = FilterTestDriver.getEvents(filter, snippet, null);
        ArrayList<Event> events2 = FilterTestDriver.getEvents(filter, snippet, null);
        boolean areEventListsSame = FilterTestDriver.compareEvents(events, events2);
        assertTrue(areEventListsSame);
    }
}

By "isolated tag", I assume you are referring to the marker used in the text. If I remember correctly, MarkdownFilter always uses  TextFragment.MARKER_ISOLATED (0xE103) for "*". 

Chase Tingley

unread,
Jan 29, 2022, 8:30:24 PM1/29/22
to Group: okapi-devel
I believe the filter was changed fairly recently to use paired tags for a number of markdown formatting constructs, including *.  (Regardless, it should be consistent across runs...)

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/6cbaccbb-1041-4cb6-8c64-0388ac950a83n%40googlegroups.com.

krsk...@gmail.com

unread,
Jan 29, 2022, 10:37:21 PM1/29/22
to okapi-devel
I ran this unit test case after pulling from dev and rebuilt Okapi. I inspected the text field and the marker is \xE103.
The corresponding Code has its tagType="OPENING", however. Maybe that is what you are talking about?

jim

unread,
Jan 31, 2022, 2:00:05 PM1/31/22
to okapi...@googlegroups.com, krsk...@gmail.com
Yes, the second pass keeps the "*" tags OPEN/CLOSE but the char marker in TextFragment is changed to isolated.

I found this with the integration tests which reproduce a realistic pipeline - the code matcher found the differences and complained. That's the easiest way to test it. Just drop a new file in with the example string and create a new test case for that specific file.

Jim

jim

unread,
Jan 31, 2022, 2:05:27 PM1/31/22
to okapi...@googlegroups.com, krsk...@gmail.com
Kuro - I would also look at the mutator methods like setParameters to see if those have any side effects. I suspect its just a buffer not cleared out of a flag reset. But just a guess.

I'm slammed with other tasks sorry couldn't give a better unit test.

Jim

krsk...@gmail.com

unread,
Feb 1, 2022, 3:44:59 PM2/1/22
to okapi-devel


On Monday, January 31, 2022 at 11:00:05 AM UTC-8 Jimbo wrote:
Yes, the second pass keeps the "*" tags OPEN/CLOSE but the char marker in TextFragment is changed to isolated.


What is puzzling me is that if I run my unit test case, the isolated marker is used in the first pass too.

 
I found this with the integration tests which reproduce a realistic pipeline - the code matcher found the differences and complained. That's the easiest way to test it. Just drop a new file in with the example string and create a new test case for that specific file.

Could you tell me which integration test I should try? 


jim

unread,
Feb 1, 2022, 3:57:32 PM2/1/22
to okapi...@googlegroups.com, krsk...@gmail.com
Add this to the MarkDownRoundtripIT test. Then you can step through extraction, segmentation, merge etc... Also see the xliff output file.

@Test
    public void debug() throws FileNotFoundException, URISyntaxException {
        final File file = root.in("/markdown/test.md").asFile();
        runTest(false, file, null, null, new FileComparator.EventComparator());
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.

krsk...@gmail.com

unread,
Feb 4, 2022, 2:09:18 AM2/4/22
to okapi-devel
I ran this test case as prescribed by Jim under the debugger.  The test case failed in a different way.

    runTest(false, file, null, null, new FileComparator.EventComparator());
eventually called net.sf.okapi.common.integration.EventRoundTripIT.runTest(EventRoundTripIT.java:79):

                        try (RawDocument ord = new RawDocument(Util.toURI(original), "UTF-8", source, target);
                                        RawDocument trd = new RawDocument(Util.toURI(tkitMerged), "UTF-8", source, target)) {
                                filter.setFilterConfigurationMapper(mapper);
                                final List<Event> o = IntegrationtestUtils.getEvents(filter, ord, params);
                                final List<Event> t = IntegrationtestUtils.getEvents(filter, trd, params);
                                assertTrue("Compare Lines: " + f, comparator.compare(t, o));
                        }

The test failed because the compare(t, o) returned false.

The o[1] was the TextUnit event and it's resourse.parts[0].text.text was using the ISOLATED marker 0xE103 as shown here:
oTextUnit.png
The same place in t[1] didn't have a code marker at all. It was replaced by an open square bracket ("["):
tTextUnit.png

I ran the round trip test using tikal.sh (tikal.sh -x test.md followed by tikal.sh -m test.md.xlf), and symptom was different.
These error messages were displayed at the merge time:

$ ./tikal.sh -m test.md.xlf
...
Merging
...

Code mismatch in TextUnit tu1
Can't find matching Code(s) id='3' originalId='' data='<g id=3>', Closing data='2'
Can't find matching Code(s) id='3' originalId='3' data='</g>'
Can't find matching Code(s) id='1' originalId='' data='*'
Done in 1.635s

$ cat test.out.md
Note: the fourth item uses the Unicode character for [Roman numeral four][2.]
$ cat test.md


*Note: the fourth item uses the Unicode character for [Roman numeral four][2].*

Not only both occurrences of "*" have been removed,  "[2]." has become "[2.]".

I'll create an issue tomorrow and analyze it further.

jim

unread,
Feb 4, 2022, 2:05:08 PM2/4/22
to okapi...@googlegroups.com, krsk...@gmail.com
Kuro,

We have to be very careful when using simplified codes like x and g (is that default for Tikal?). All our integration tests use non-simplified codes. Technically it should work both ways, but I have noticed that simplified codes can be hard to match between source and target because they loose info like the original data.

Can you try tikal without simplified codes and see what difference that makes? I may need to make a special code compare case for simplified codes - but we don't mark them in any special way.

Jim

jim

unread,
Feb 4, 2022, 2:31:24 PM2/4/22
to okapi...@googlegroups.com, krsk...@gmail.com

almost forgot - please use the branch "fix_textunitmerger" (now PR https://bitbucket.org/okapiframework/okapi/pull-requests/590) this may fix the actual code match issue you saw with Tikal.

Jim

krsk...@gmail.com

unread,
Feb 4, 2022, 5:14:28 PM2/4/22
to okapi-devel
Tikal, built from the fix_textunitmerger branch, behaves slightly differently.
These errors were displayed:

Code mismatch in TextUnit tu1

Can't find matching Code(s) id='1' originalId='' data='</Xpt>'


Can't find matching Code(s) id='1' originalId='' data='*'

Code mismatch in TextUnit tu1

Can't find matching Code(s) id='1' originalId='' data='</Xpt>'
Can't find matching Code(s) id='4' originalId='' data=']'

The end result was:

$ cat test.out.md
*Note: the fourth item uses the Unicode character for [Roman numeral four][2.


$ cat test.md
*Note: the fourth item uses the Unicode character for [Roman numeral four][2].*

The first occurrence of "*" was kept but the last square bracket and the last "*" were lost.

I'm not familiar with (or I forgot) the simplified and non-simplified codes. Is this a mode of operation of which steps?
I browsed the tikal code, and I don't see a switch to change the mode. So it must be using whichever the default mode.

Jim Hargrave

unread,
Feb 4, 2022, 5:26:32 PM2/4/22
to krsk...@gmail.com, okapi-devel
The simplified mode is a function of the xliff writer parameters. My opinion tkal should be using the non-simplified mode as default.

jim

unread,
Feb 5, 2022, 1:06:00 PM2/5/22
to okapi...@googlegroups.com, krsk...@gmail.com
Kuro,

If you just run your simple initial unit test can you confirm that the "*" and "*" tags are converted to true OPEN/CLOSE and not ISOLATED? I think you mentioned they were ISOLATED, but this doesn't seem correct to me as they are in the same segment (unless they are separated somehow?). Any tags in the same TextFragment should never be marked as isolated.

Our new logic for code matching is more precise and relies on finding OPEN/CLOSE mates in both source and target. If there are ISOLATED tags we may not get the best match.

Also, here is the code in Tikal to change the xliff placeholder mode. Should we update this permanently? We loose info with simplified placeholders.

        // Filter events to raw document final step (using the XLIFF writer)
        FilterEventsWriterStep fewStep = new FilterEventsWriterStep();
        XLIFFWriterParameters paramsXliff;
        XLIFFWriter writer = new XLIFFWriter();
        fewStep.setFilterWriter(writer);
        paramsXliff = writer.getParameters();
        paramsXliff.setPlaceholderMode(true);  <---- Change to false to get full bpt/ept tags
        paramsXliff.setCopySource(extOptCopy);
        paramsXliff.setIncludeAltTrans(extOptAltTrans);
        paramsXliff.setIncludeCodeAttrs(extOptCodeAttrs);

krsk...@gmail.com

unread,
Feb 8, 2022, 7:46:21 PM2/8/22
to okapi-devel
Jim,
After changing the Tikal code as you suggested, calling paramsXliff.setPlaceholderMode(false), the manual round trip test with tikal succeeded!

But, the generated test.md.xlf suggest that the pair of "*" are still using the isolated marker:

<source xml:lang="en"><it id="1" pos="open">*</it>Note: the fourth item uses the Unicode character for <bpt id="2">[</bpt>Roman numeral four<ept id="2">]</ept><bpt id="3">[</bpt><ept id="3">2</ept><it id="4" pos="close">]</it>.<it id="1" pos="close">*</it></source>

This matches with my memory (dissatisfaction) of MarkdownFilter code: many inline codes, including "*emphasis*" that should be treated as a pair aren't represented as such.

Also notice that "[2]" part is paired incorrectly. The number 2 is marked as a closing tag for "[" and the real closing tag "]" is an isolated closing tag. Perhaps Flexmarker parser that MarkdownFilter uses simply doesn't support Full Reference Link

krsk...@gmail.com

unread,
Feb 8, 2022, 7:47:47 PM2/8/22
to okapi-devel
Forgot to mention that these errors were reported during the merge:

Code mismatch in TextUnit tu1
Can't find matching Code(s) id='1' originalId='' data=']'

Can't find matching Code(s) id='4' originalId='' data=']'

krsk...@gmail.com

unread,
Feb 8, 2022, 8:23:14 PM2/8/22
to okapi-devel

jim

unread,
Feb 9, 2022, 9:18:28 AM2/9/22
to okapi...@googlegroups.com, krsk...@gmail.com
The first pass of the integration test produces bpt/ept codes (segmentation step maybe?), the merge pass produces it codes. I get the same logged warnings as you.

I think this might be fixed with a proper balanceCodes call in the filter.  The open/close tags have the same id.

I will make some adjustments and add it to my branch if possible.

I also think we need to change the Tikal xliff default output - but I will check on that as well as in this case id's should have been enough to match.

thanks!!

Jim

jim

unread,
Feb 9, 2022, 1:00:04 PM2/9/22
to okapi...@googlegroups.com, krsk...@gmail.com
I confirmed the Segmenter is changing the isolated codes to paired codes (bpt/ept). It happens here:    getSource().getSegments().create(segmenter.getRanges());

Since we get both the **source** and target content from the xliff file - then try to match the original file source (unsegmented) on merge - that's why we are getting mismatches.

I think what we need for the MarkDownFilter is a smart post-processor that can repair these illegal codes before they leave the filter.

Jim

krsk...@gmail.com

unread,
Feb 17, 2022, 3:20:01 AM2/17/22
to okapi-devel
I am sure that MarkdownFilter is not handling the full link references correctly and that is partly to blame (if not fully responsible) for a strange behavior. I created issue #1125 Markdown should generate separate translation unit for link titles that probably causing a round-trip test failure, and a related issue #1124 MarkdownFilter is not handling full reference links correctly, generating three codes for [link-label].

jim

unread,
Feb 17, 2022, 8:40:09 AM2/17/22
to okapi...@googlegroups.com, krsk...@gmail.com

krsk...@gmail.com

unread,
Feb 17, 2022, 5:10:18 PM2/17/22
to okapi-devel
Oops, I referenced these two new issues incorrectly.
#1124 MarkdownFilter is not handling full reference links correctly, generating three codes for [link-label] is the direct issue.
Sorry for the confusion.
I will take a look at #1124 in a couple of days.

Reply all
Reply to author
Forward
0 new messages