Okapi Parameters schema and config format...

23 views
Skip to first unread message

Jim Hargrave

unread,
Jan 29, 2024, 2:01:35 PM1/29/24
to Group: okapi-devel

Just wanted to clarify a few points on the new Okapi configuration file discussion. Based on our current understanding, there will be two types of configs:

1. Schema File (one file to rule them all).

The current options for the schema file is ProtoBuffers (PB) or JSON Schema. Personally I'm leaning toward ProtoBuffers as we already use it and adding another layer with JSON schema will mean learning yet another technology. PB has an import feature which will allow us to reduce duplication across modules (filters and steps). PB also works across many languages and is well supported by Google.

2. User oriented config file.

This is probably the most controversial. This is a file that a non-programmer will edit. Should have these minimal features.

  • Visually simple and easy to edit manually.
  • Minimal supported types: String, Boolean, Integer, Array, Map and any combination or embedding of these basic types to create complex objects.
  • Editor Support (color coded, high level validation support)
  • Config file is *fully* validated by the schema above.
  • Must support and preserve comments.

YAML is one option. Though limiting the syntax would be needed to prevent strange variations that might confuse the user. I think using YAML would make migration easier for the certain filters that already use it. JSON could be used, but in order to preserve comments we would have to write our own parser and writer (not a big ask).

I'm leaning toward YAML as it is cleaner and designed for human's vs machines. Converting the YAML to JSON, which could then be loaded and validated by PB, would be simple to do. As a programmer, I'm not strongly opposed to using JSON directly - but I think it will be more frustrating for non-programmers as you have to get every quote and curly brace correct. Visually it is more cluttered.

Thoughts?

Jim



Mihai Nita

unread,
Jan 30, 2024, 7:12:13 PM1/30/24
to Group: okapi-devel
If we go with protobuffers for schema, that will generate the Java classes for us, to use programmatically.

But that also gives us serialization to/from json and proto text/binary for free.

I have some prototype, I'll post it in sandbox.

M

--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/f68f6e96-86b2-4b28-a4d2-5871b21b8e3a%40gmail.com.

Mihai Nita

unread,
Jan 30, 2024, 7:24:24 PM1/30/24
to Group: okapi-devel
Note: I all also somewhat unhappy with the json verbosity.

I would like to be able to do:
{
    // line comments; keys without quotes
    foo: "some value",
    /* block comment; trailing comma */
    bar: 42,
}

That would be enough to make me happy :-)


M.

Jim Hargrave

unread,
Jan 31, 2024, 11:27:42 AM1/31/24
to okapi...@googlegroups.com, Mihai Nita

Yeah, that gets closer to JSON-like formats like HOCON. Starting to lean toward TOML again. It wouldn't be hard to have the protobuf code serialize/deseralize to TOML.

We produce json (JSON Lines actually) for extraction and I've been having to read that format - it's not easy - even with editor support.

Jim

Mihai Nita

unread,
Jan 31, 2024, 7:24:51 PM1/31/24
to Jim Hargrave, Group: okapi-devel
The thinks that I dislike (a lot :-) about yaml is the "let's try to be smart" part.
Where it is trying to guess some type based on rules that are very easy to trip you:





Mihai




 

Jim Hargrave

unread,
Jan 31, 2024, 10:08:15 PM1/31/24
to Mihai Nita, Group: okapi-devel

How about TOML?

Jim

Jim Hargrave

unread,
Feb 2, 2024, 3:44:48 PM2/2/24
to Mihai Nita, Group: okapi-devel

Mihai I do agree with you on this. This is an opportunity for us to remove YAML config from Okapi. Unfortunately, we still need to support YamlFilter - which could be a security risk (more on this in the near future).

However, I still believe that using a clean, readable and supported config format would be a big win for the users. OML/HOCON being the main formats. But I'm open to other solutions, even if we have to come up with our own format.

One clarification: I don't see a reason why we can't config Okapi with direct ProtoBuf output as well as a human readable config. We get the best of both worlds. Simply a matter of adding some wrapper code around ProtoBuf to read the human config format.

Hopefully this will resolve most of the concerns.

Jim

Mihai Nita

unread,
Feb 8, 2024, 2:26:01 AM2/8/24
to Jim Hargrave, Group: okapi-devel
Playing with things a bit more I realized that we pretty much must have fields for comments in the Java (and template) layers.

Because no matter what the serialization format is, when we read it we read it in some kind of ParametersNew Java class.
(ParametersNew  is a provisory name so that I can talk about it in this email :-)

And from that object we create a swt dialog.
Or even without UI, we deserialize (from json/yaml/whatever) => Java ParametersNew => prop.setFoo(34) => serialize

So no matter the support for comments in the serializer, the ParametersNew will drop them.

In fact, it would even need some special APIs to deal with comments.
Something like this (just ideas):

setFoo(/*new value*/ 42, /*comment*/ "The answer to everything")
    result> // The answer to everything
    result  > foo = 42
setFoo(/*new value*/ 13, /*preserve existing comment*/ true)
    result  > // The answer to everything
    result  > foo = 13
setFoo(/*new value*/ 13, /*preserve existing comment*/ false)
    result  > foo = 13
getFoo()
    result  > 13
getFooComment()
    result  > "The answer to everything"

Or maybe have a tuple of value + comment for both setters / getters.

But the point is: we need comments at Java level in the ParametersNew class, or we lose them.

And that means (in my mind):
* That they become first class citizens in json and yaml, for example.
   They would be real fields, not // or /* */.
   So we don't depend on the serialization format to support comments.
* That we can only have field-level comments.
   One comment for a full map, or array, not per array elements:
   Works:
       // Comment for foo
       foo : [ 1, 2, 3 ]
       // Comment for bar
       bar : { "one": 1, "two": 2, "three" : 3 }
   Does not work:
       foo : [ 1, /* here */ 2, 3 ]
       bar : {
          "one": 1,
          // comment for some entry
          "two": 2,
          "three" : 3
        }
  

Mihai

Mihai Nita

unread,
Feb 8, 2024, 4:50:17 AM2/8/24
to Jim Hargrave, Group: okapi-devel
I've submitted some proto example with serialization at

and all relevant code is in app/src/main/java/test_proto/App.java

The out folder contains the serialized proto in binary, text, and json format.

===

With these tests I've found a behavior that feels a bit unpleasant: the default values specified in the .proto file (the template) are not saved when serialized.
They are hard-coded in the generated code, so when you call a getter out get the expected default value as specified in the proto.
But it means that someone without the .proto does not know what values to expect.

Imagine you get a .json file like this:
{
  "someString": "New value",
  "someBoolean": true,
}

There is no way to tell that loading the file and querying for someInteger you get 42.
There is also no way to tell (by looking at the json) what are all the available options, so that you can edit "by hand" without a .proto for reference.

The JSON serializer has an option to serialize the defaults (JsonFormat.printer().includingDefaultValueFields().print(param))
And you can see the result in default_printer_all.proto.json
But you have to remember to configure the printer to do that.

For binary and text there is no such option.
But it is probably less critical, because those are proprietary protobuf, so you are pretty much expected to either have the proto file, or use a class generated from a proto file.

Another option is to not specify any defaults in the .proto file, and have each filter / step return a default (maybe a static method).
With the default values specified by the filter, not the .proto file, they will always be saved, in all formats.
Example:
class FooFilter {
    public static Parameters getDefaultParameters() {
        return Parameters.newBuilder()
            .setFoo("foo") // default value for Foo
            .setBar(42) // default value for Bar
            .build();
    }
}

In fact proto 2 has defaults, proto 3 does not.
And they said it was dropped because they were easy to misuse.
So maybe this is one of the reasons/

Mihai

Jim Hargrave

unread,
Feb 8, 2024, 2:10:06 PM2/8/24
to Mihai Nita, Group: okapi-devel

PB does have the advantage of allowing non-Java code to manipulate Okapi Parameters. That is desirable, but maybe could be done outside of Okapi? Check out these projects:

https://github.com/protostuff/protostuff

https://github.com/protostuff/protostuff-compiler

These projects are basically trying to solve the same problem we are. Not suggesting we should use them in Okapi. But if we did have proper, consistent Java Beans we could use these tools during deployment to generate other formats like PB and even HTML documentation. We could also have an optional sister project that could provide API's to make all this easier.

I think the way forward is to use pure Java Beans. Basically a refactor of IParameters to be more consistent and support more datatypes and add comment and doc fields (as Mihai explained). UI is something that requires more thought. But once we can convert our Parameters to various formats maybe one format could be JSON Forms?

So the basic strategy would be to keep everything simple (pure JDK only) in Okapi, but use or implement other tools for conversion, doc creation etc.. Possible formats: ProtoBuf, JSON, TOML etc..

Jim

Jim Hargrave

unread,
Feb 9, 2024, 2:20:18 PM2/9/24
to Mihai Nita, Group: okapi-devel

Here is an example of how documentation is generated with these projects:

https://protostuff.github.io/samples/protostuff-compiler/html/#/

Not bad, someone with CSS experience could enhance if needed.

Jim

Reply all
Reply to author
Forward
0 new messages