SQL support for JavaScript Object Notation (JSON)
Hello. As you might know, Red Soft has implemented SQL JSON functions, and I hope we can start working on pull requests for Firebird 6 soon. To reduce the time spent on review and discussion, the proposal has been split into several parts, and here is the first one. This part provides a text explanation of the implementation details. Once all the most controversial points have been discussed, I will send a PR.
All functionality is implemented according to the SQL standard. In the first part, I cover the simplest function of the standard — JSON_ARRAY, along with all its related features (there are plenty of them).
MAIN POINTSTo make things clear, let's set up the main points on the entire implementation:
This is the implementation of the JSON SQL standard 2017 (ISO/IEC TR 19075-6:2017).
There is no explicit JSON data type (which appears only in SQL 2023) or a binary JSON type – JSON is stored as text via text types.
The implementation adds JSON functions, JSON_TABLE record source, and all its associated components.
All functions are named according to the SQL standard.
Almost all functionality is already implemented and awaiting its time to be proposed for merging. However, to avoid a 15k+ new lines of code, the whole proposal is divided into 8 parts.
Parts 1-3 are related to JSON Generation functions, parts 4-6 to JSON Extract (query) functions, and parts 7-8 are related to JSON_TABLE:
The core description of the IO interface implementation and JSON_ARRAY as an example.
The PR with JSON_ARRAY implementation.
The PR with remaining JSON generation functions' implementation.
The description of JSON Get functions.
The implementation of JSON_QUERY and IS JSON functions (PR).
The implementation of JSON_VALUE and JSON_EXISTS functions (PR).
The description of JSON_TABLE implementation.
The JSON_TABLE PR.
I wanted the code to be tested via Unit tests, but it is quite hard due to the engine heavily relying on tdbb, which is not defined while testing. It is possible to partially define it, but it still has serious limitations. So, the JSON code is mostly separated from the engine as much as possible. This means using dsc only for input and output values.
All the basic JSON functions (JSON_VALUE, JSON_QUERY, JSON_ARRAY, and JSON_OBJECT) are implemented as ExprNodes, JSON_TABLE is a RecordSourceNode and JSON_ARRAYAGG and JSON_OBJECTAGG are AggNodes.
All the functions produce compact JSON. Therefore, all excess spaces, tabs, and new lines will be omitted.
Some parts of the standard are not very handy. Therefore, there will be a couple of additions to the functionalities that are outside of the standard.
Implement most of SQL JSON.
Adhere to the SQL standard as closely as possible.
Keep JSON code as independent from Jrd as possible to enable writing unit tests.
Keep it user-friendly.
The current document is focused on the description of the IO interface, so there will be only one function (with 2 constructors):
JSON_ARRAY<JSON array constructor> ::=
<JSON array constructor by enumeration>
| <JSON array constructor by query>
<JSON array constructor by enumeration> ::=
JSON_ARRAY <left paren>
[ <JSON value expression> [ { <comma> <JSON value expression> }... ]
[ <JSON constructor null clause> ] ]
[ <JSON output clause> ]
<right paren>
<JSON array constructor by query> ::=
JSON_ARRAY <left paren>
<query expression>
[ <JSON input clause> ]
[ <JSON constructor null clause> ]
[ <JSON output clause> ]
<right paren>
<JSON constructor null clause> ::=
ABSENT ON NULL
| NULL ON NULL
<JSON output clause> ::=
RETURNING <data type>
[ FORMAT <JSON representation> ]
<JSON representation> ::=
JSON
| SQL
| AUTO
<JSON value expression> ::=
<value expression> [ <JSON input clause> ]
<JSON input clause> ::=
FORMAT <JSON representation>
<JSON representation> ::=
JSON
| SQL
The JSON representation is declared in the standard as:
<JSON representation> ::=
JSON [ ENCODING { UTF8 | UTF16 | UTF32 } ]
| <implementation-defined JSON representation option>
ENCODING is what I believe is the CHARSET of our string. And I add representation options: SQL and AUTO. Their purpose will be documented later.
ExamplesSQL> select JSON_ARRAY(false, 'false', 'false' format json) from rdb$database;
========
[false,"false",false]
SQL> SELECT JSON_ARRAY(SELECT RDB$RELATION_ID FROM RDB$RELATIONS WHERE RDB$RELATION_ID < 5) FROM RDB$DATABASE;
========
[0,1,2,3,4]
There aren't many 1-byte BLR codes left, and, as of now, the full PR will require 11 verbs. So, I've decided to create a simple factory. Only one BLR code will be used as the primary node. Based on the secondary BLR code, the factory will determine the type of node and create it. Currently, it is implemented as a switch statement. I tried a factory with register implementation, but it doesn't look so good for such a small number of nodes. But if you say no, the factory with register-classes will be used.
OutputLet's start with the simplest JSON function - JSON_ARRAY with enum constructor. The function takes a set of arguments and combines them into a JSON ARRAY. The function has an alternative syntax with a subquery (the query expression must return exactly one according to the SQL/JSON Standard). The first problem is the default output type. The SQL/JSON 2017 standard says that JSON is an "implementation-dependent string of Unicode characters". But there is one problem - it is hard to predict the output length. Another issue is that JSON can be longer than the VARCHAR limit (which is reduced by 4 times if you use utf8 encoding). So the safest default type is a text blob. Yes, blobs are slow and not very convenient, but at least it is clear to the user that the function works instead of resulting in an obscure error about "string right truncation". The good news is that the standard provides a RETURNING clause. It can be used to specify the type of output.
SELECT JSON_ARRAY(1,"hello",false RETURNING VARCHAR(30))
So if a user cares about optimization, there will not be any problems.
But there is one more issue - the FORMAT clause. This is basically a workaround for the missing JSON type. The format can be specified for output and for input. Its meaning is simple: it helps the JSON function to understand how to interpret values: as a string or as a JSON. For example: SQL>select JSON_ARRAY('{"a":42}', {"b":16}' format json) from rdb$database; ["{"a":42}",{"b":16}]
The first object is interpreted as a string and all special symbols are escaped. The input value can be returned from another JSON function. By default, the format of the returning clause is set to JSON even when the returning value is absent.
Here are some examples:
SQL>SELECT JSON_ARRAY(JSON_ARRAY(1), JSON_ARRAY(1) FORMAT JSON, JSON_ARRAY(3 RETURNING VARCHAR (30) FORMAT JSON)) FROM RDB$DATABASE;
[[1],[1],[3]]
SQL>SELECT JSON_ARRAY(JSON_ARRAY(1), JSON_ARRAY(1) FORMAT SQL, JSON_ARRAY(3 RETURNING VARCHAR(30) FORMAT SQL)) FROM RDB$DATABASE;
[[1],"[1]","[3]"]
To specify the format, a DSC flag is used named DSC_json. The problem is that flags are really easy to lose. This can happen in a cast, for example. So how is the RETURNING CLAUSE implemented? First, the JSON_ARRAY node sets the flag. If there is a RETURNING clause, wrap the JSON node in an artificial CastNode. Then put the cast node inside a JsonFormatNode. The JsonFormatNode specifies the flag after the cast.
JSON_ARRAY -> JsonArrayNode
JSON_ARRAY Returning -> JsonFormatNode(CastNode(JsonArrayNode))
JSON_ARRAY Returning format -> JsonFormatNode(CastNode(JsonArrayNode))
For JSON format - add the DSC_json flag. For SQL format - remove the DSC_json flag.
InputAs mentioned earlier, input arguments can also be specified with the FORMAT clause. So let's use our JsonFormatNode once again:
VALUE FORMAT JSON -> JsonFormatNode(ValueExprNode) VALUE FORMAT SQL -> JsonFormatNode(ValueExprNode)
But what format should we use by default if we are not sure about the source of the value? This is where the AUTO format comes in. It simply keeps the value as is.
VALUE FORMAT AUTO -> ValueExprNode VALUE -> ValueExprNode
FUNCTION IMPLEMENTATIONInputSo, we get an input descriptor. Now, we need to read it. For the SQL format, we also need to escape all the special symbols.
The main problem is a blob (again). We (usually) can't just read it as a whole, so we need some kind of chunk reader. But we also need to deal with simple strings, too. Therefore, we need a storage class. I named it InputJsonText.
InputJsonTextHow it works. We get a descriptor. If it has a simple type, just convert it to a string and add it to the Read Buffer. If it is a blob, read it chunk by chunk into the Read Buffer." For a VARCHAR, put it into the Read Buffer but also check it chunk by chunk. It may be overcomplicated as keeping 64KB in memory is not so scary, but what if the VARCHAR limit is extended in the future? Therefore, I decided to handle VARCHAR chunk by chunk. But I might be wrong. All the stuff about limits, buffer sizes, etc., is really controversial, so please correct me. The Read Buffer size is 16KB. Ok, we read the chunk. What is next? Next, we need a class to store the output JSON.
OutputJsonTextThis class may seem overcomplicated, but all this stuff was initially created for extraction JSON functions like JSON_QUERY. So please do not ask to burn it with fire right now, and wait for proposals #4 and #5. Back to output JSON. We have some input, which is then added to the output. The OutputJsonText produces a descriptor (dsc). But what should we put into the dsc? So we have an append method. Get the data via InputJsonText and put it into the main buffer. If it is marked as SQL format, escape it. If the buffer is larger than the limit (also 16k), add the buffer to storage. Basically, there are 2 types of storage: blob and tempspace. For Generate functions, we only need the first one. So, add the input to the blob.
When the output descriptor is requested, if there is no data in storage and the buffer fits into the VARCHAR limit, produce a text descriptor. For any data in storage, produce a blob.
JSON_ARRAYThe resulting pipeline is here:
Create an OutputJsonText field;
Add each argument and a comma;
Then finish the generation and request a DSC.
Mark the DSC as JSON.
The function creates a JSON without any spaces or new lines. First, it saves space for VARCHAR. Second, I believe JSON is used for machine processing only, so few people would care about pretty output. If it is really necessary, a formatter clause can be implemented.
ValidationBack to implementation. There is a problem - the output may be invalid. For example, the user passed something like this: '9&*H[3' FORMAT JSON. So it would be nice to validate the output JSON array or each argument. But it slows the function down, and I am not sure if it is really necessary.
Summarize (A.K.A. TLDR)JSON nodes are implemented as factory products;
The default return type is text BLOB for Generate (JSON_ARRAY) functions;
Use a special class as an input reader;
Keep small outputs in the buffer, and large ones in BLOB;
Do not include spaces/newlines/tabs in the output;
Mark the output descriptor as JSON via the DSC_json flag;
Validate the output JSON (try parsing).
Naming is my weak side. So I expect many corrections in this part.
Read big varchars by chunks.
JSON Factory and its implementation. Switch VS. Register class.
Validation of JSON_ARRAY output.
Where should we store new Nodes? ExprNodes.h/cpp?
Where should we store error messages? impl/msg/jrd.h?
Switch VS. Register class.
Limits
Use of flag for setting format is strange for me. Current set of
descriptor flags may be applied for any datatype, but what means
DSC_json set for example for a float value? Next, what if in addition to
json/sql formats we will have a need in one more (csv or xml)?
I suggest to add separate datatype dtype_json instead and use subtype to
encode formats. This will be universal and consistent with the rest of
engine solution.
> ENCODING is what I believe is the CHARSET of our string.
RFC 8259 is clear: JSON must be in UTF-8 only. If ANSI has other opinion -
they are wrong.
> And I add representation options: SQL and AUTO. Their purpose will be documented later.
I don't see where their purpose is documented. If JSON is fixed to BLOB they
are not needed IMHO.
> There aren't many 1-byte BLR codes left, and, as of now, the full PR will require 11 verbs.
Why any verb is required at all? AFAICS JSON is defined as a set of functions
and functions don't need own BLR verbs.
DSC flags are not carried to client API so it doesn't solve any problem for
client-side representation.
Adding BLOB subtype is much easier. Because of this I would go for it.
Also Google tell me that there are different kinds of binary JSON so it is not clear which one Firebird should support.
--
You received this message because you are subscribed to the Google Groups "firebird-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firebird-deve...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/firebird-devel/3eaeb942-8efb-44dd-8457-6d758e4d9c9dn%40googlegroups.com.
The current version of ISO/IEC 19075-6 is ISO/IEC 19075-6:2021, so why
base it on an outdated version (ISO/IEC TR 19075-6:2017)?
Also, as far as I'm aware part of the specification for JSON support is
in ISO 9075-2(:2023), and not in 19075-6, so make sure it's covered.
All,
Let me try to summarize the feedback. Feel free to correct me if I got
something wrong.
Internally (inside the engine) it makes sense to add a dedicated
dtype_json type and make all JSON functions to receive/return it instead
of using DSC_json with blobs. dtype_json would still be backed by the
blob mechanics as now.