Step for inline codes

76 views
Skip to first unread message

Alessandro Falappa

unread,
Aug 10, 2021, 3:51:12 AM8/10/21
to Group: okapi-devel
Hello all,
  I was wondering why the inline code finder logic is always woven inside the filters (not all filters have it) rather than factoring it out to an Okapi Step that can then added to any pipeline.
Is there a technical reason or was it an explicit design choice?

I am asking because we find ourselves in situations where more and more we would have the ability to look for different placeholder styles in different formats, having a step would allow us to mix and match inline code finding with filters.

Thank you in advance.

Best Regards,


Alessandro Falappa
Integrations Team Leader

Yves Savourel

unread,
Aug 10, 2021, 4:06:45 AM8/10/21
to okapi...@googlegroups.com

Hi Alessandro,

 

I think we did it as part of the filters initially because we tended to put all “filter-related” action in the filter.

But there is nothing that prevent to do this in a step after extraction.

It seems the tendency nowadays is to have several steps to perform the “full” extraction/filtering. It’s more flexible and powerful.

 

Cheers,

-ys

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAE7HRiL1wY0RQDRdC%3DTivZZ-QKCw3x96Yw2jz2%3DgGCsNK2mUtg%40mail.gmail.com.

jim

unread,
Aug 10, 2021, 3:13:43 PM8/10/21
to okapi...@googlegroups.com, Yves Savourel
Yes, I was going to say the same thing. In the early days steps were "post-processing".

Sergei's "PreprocessingFilter" filter is interesting in that it creates a single filter from a main filter and any number of steps. Code finder could be one of those steps.

Some day we will clean this up and remove CodeFinder from the filters.

Jim

jim

unread,
Aug 10, 2021, 3:26:00 PM8/10/21
to okapi...@googlegroups.com, Yves Savourel
One of the limitations in some filters is that you can't use code finder and a subfilter at the same time. Moving code finder and (sub-)filters to a step would cleanly solve this issue.

I'm hoping in the near future we can address this redesign.

Jim

Mihai Nita

unread,
Aug 11, 2021, 1:01:44 AM8/11/21
to Group: okapi-devel, Yves Savourel
Although "philosophically" the codes are part of the filter in done cases.

For example `%s` world be part of the filter for .po files, and Java properties, and Android strings xml. But not for json, csv, yaml.

Maybe implementation-wise we would have the functionality in a library, called by the filter by default when it's part of the file format definition, and as a step otherwise.

Mihai

Alessandro Falappa

unread,
Aug 11, 2021, 7:41:42 AM8/11/21
to Group: okapi-devel
Hi Mihai,
  yes I agree that some formats explicitly define a base placeholder syntax, but we are seeing more and more users sending us documents in generic formats like JSON, CSV and YAML (and even XLSX!) with placeholders in several syntaxes.

Base placeholder syntax could, and maybe should, be a filter concern but having a Step to put in a pipeline with other filters allows for more flexibility. Of course inline code finding within a filter should be controllable (but I think already is).

Regards


Alessandro Falappa
Integrations Team Leader

jim

unread,
Aug 11, 2021, 10:06:43 AM8/11/21
to okapi...@googlegroups.com
We are seeing the same. That's why I'd like to get the subfilter as step code working so that we can add as many subfilters as needed. inline code finder could be just another subfilter step.

We will need to come up with a nice way to configure all this.

Jim

krsk...@gmail.com

unread,
Aug 11, 2021, 11:33:22 AM8/11/21
to okapi-devel
I wasn't sure what this discussion is really about but if we are talking about making a step that does subfiltering, let me remind you that this was attempted and failed twice.


I thought I wrote difficulties I faced with examples at least once, maybe twice, but I couldn't find that post :-(

I think the Subfiltering Step can work if we accept a restriction - each part of the TU between a pair of codes from the main filter is well-balanced from the subfilter's point of view. What does this mean? Pretending below is an OpenXML text document and we want to apply the HTML filter as a subfilter:

(1) This <em> can be </em> processed by Subfiltering Step.

(2) This <em> cannot be </em> processed by Subfiltering Step.


For (1), the main filter, OpenXML Filter in this case, will create a TU like:
[C1]This [/C1]<em> can be </em> [C2] processed by[/C2] by Subfiltering Step.
where [Cx] is a code representing the beginning of a run of  the same style and [/Cx] is the closing of the run.
If we inspect each part:
This: This doesn't have any HTML tag, so it is balanced.
<em> can be </em>: Balanced.
processed by: Balanced.
by Subfiltering Step.: Balanced

For (2), the main filter produces:
[C1] This <em> cannot[/C1] be </em> processed by Subfiltering Step.
Why? 
This <em> cannot: The ending em tag is missing.
be </em> processed by Subfiltering Step.: The starting em tag is missing.

Even if we accept this restriction, flagging violation of this restriction is difficult.
If a part includes "<p>This may be OK." Is this balanced? HTML does not require </p>, but what if </p> appears in the later part?

Alessandro Falappa

unread,
Aug 11, 2021, 12:02:32 PM8/11/21
to Group: okapi-devel
Hi Kuro,
  no I didn't mean a step that does sub-filtering but a step that does inline codes finding.

The idea presented by Jim is however intriguing.

Regards


Alessandro Falappa
Integrations Team Leader

jim

unread,
Aug 11, 2021, 2:06:39 PM8/11/21
to okapi...@googlegroups.com, krsk...@gmail.com
Kuro - that restriction is already in place with the current subfilter design - so I don't see a new problem. You expect the parent filter to produce a block that will be "wellformed" from the subfilter perspective. Whatever wellformed means for the sub-format.


  >>If a part includes "<p>This may be OK." Is this balanced? HTML does not require </p>, but what if </p> appears in the later part?

It is the responsibility of the parent filter to extract the self-contained block. But the only requirement should be that the block won't cause a parse error given the specific subfilter+configuration. In your specific example the HTML filter can parse "<p>This may be OK." just fine. If there are more tags then the parent filter should include those too.

Some formats may be too difficult to isolate blocks for subfiltering (but again that's not a new problem) - in that case the feature won't be supported. However, it is a good question of how that can be enforced in the pipeline.

One thing we do need to move forward with this is Mihai's stream modifications so we can expand and contract the events. Subfilter would create more events than the parent filter (one to many) etc..

Anyway, this is all a ways off in the future. Plenty of time to discuss later. I just wanted to make sure this is still on the radar.

Jim
 

Mihai Nita

unread,
Aug 12, 2021, 4:04:59 PM8/12/21
to Group: okapi-devel
Absolutely!
I agreed that we should have a step.
And configurable to understand most common "variants" (%s, %2.4d, %1$s, {0}, {foo}, ${foo}, {$foo}, whatever we can identify as widespread)

Mihai

Reply all
Reply to author
Forward
0 new messages