Hello. While discussing the JSON functions proposal, the native JSON data type was also suggested for implementation. Here is the proposal about it.
========== STORAGE ==========
The main problem is the storage. Currently, users may store JSON as either a VARCHAR for fast and simple I/O or as a BLOB if the JSON is too big to fit the string data type. Both types are good for some scenarios and it would be a pity to lose half of the advantages. So I decided to combine both approaches and create a new BLR type - Inline Blob. Inline blob is a blob with 0-level being stored as a string. It looks as follow:
struct FB_INLINE_BLOB_t
{
ISC_USHORT inlineLength;
ISC_USHORT flags;
union
{
ISC_QUAD id;
unsigned char inlineContent[1];
};
};
typedef struct FB_INLINE_BLOB_t FB_JSON_t;
inlineLength: This field indicates the size of the inline data. A value of 0 means that a blob ID will be used instead of the data. The actual data follows the structure, similar to the page structures. Based on my tests, using the ISC_USHORT type provides optimal performance when it is similar to the page size, so using ULONG here is unnecessary.
flags: This field is not used for the JSON data itself but is included to clarify the padding, preventing potential issues for binding in other languages.
union: The union is not strictly necessary, but it is used here to avoid wasting memory when the inline data is used. However, it might be more efficient to place the ISC_QUAD id at the beginning of the structure to maintain binary compatibility with the blob type and simplify the code.
I have tested the inline blob in both insert and select operations, and it performs similarly to VARCHAR when the inline part is used (as shown in Table 1). This ensures that the approach is efficient and has no size limit.
Table 1. Insert and select 50,000 strings, each 12,000 characters long.
TEST Insert (second) Select (second)
======================== =============== ===============
inline blob (16184) 9.509 7.016
varchar(16184) 9.944 7.251
blob 16.053 8.436
This FB_INLINE_BLOB_t type and the corresponding BLR code represent a logical type and are not tied directly to JSON. Instead, this storage type is meant to be reused by different data types — for example, Spatial Data Type.
For JSON, the inline blob will be extended as following:
struct FB_JSON : FB_JSON_t
{
template<typename T>
void setInlineLength(const T length) = delete;
void setInlineLength(const USHORT length)
{
inlineLength = length;
if (length == 0)
id = {};
}
void setBlob(const ISC_QUAD blobId)
{
inlineLength = 0;
this->id = blobId;
}
template<typename T>
void setInlineContent(const UCHAR* data, const T length) = delete;
void setInlineContent(const UCHAR* data, const USHORT length)
{
memcpy(inlineJson, data, length);
inlineLength = length;
}
inline bool isInline() const noexcept
{
return inlineLength != 0 || (id.gds_quad_high == 0 && id.gds_quad_low == 0);
}
inline bool isBlobStorage() const noexcept
{
return !isInline();
}
unsigned char* getInlineContent() const noexcept
{
if (!isInline())
return nullptr;
return (unsigned char*)inlineJson;
}
std::optional<ISC_QUAD> getBlobId() const noexcept
{
if (isInline())
return std::nullopt;
return this->id;
}
};
constexpr ISC_ULONG FB_JSON_INLINE_OFFSET = offsetof(struct FB_JSON, inlineJson);
constexpr ISC_ULONG JSON_DEFAULT_TYPE_SIZE = 8092;
constexpr ISC_ULONG JSON_DEFAULT_INLINE_SIZE = JSON_DEFAULT_TYPE_SIZE - FB_JSON_INLINE_OFFSET;
static_assert(offsetof(struct FB_JSON, inlineLength) < FB_JSON_INLINE_OFFSET, "inlineJson should be the last field because it has the bytes after");
static_assert(FB_JSON_INLINE_OFFSET == 4, "Keep the align");
static_assert(sizeof(FB_JSON) <= 16, "The type in the dsc union should be fitting into 16 bytes");
If the content is small enough, it will be stored as inline data. Otherwise, a BLOB will be created.
By default, the engine will use an 8 KiB inline length allocated in the impure sector (the same as for a regular VARCHAR).
The flags field is not used by JSON itself; it is added mainly to make padding explicit when redefining this structure in different programming languages (for drivers).
Simple read and write operations are implemented in the client OO API.
Regarding the JSON type, there will be two subtypes: text and binary.
The binary subtype is supported by popular RDBMSs (PostgreSQL, Oracle, SQL Server), so it would be nice to have it implemented in Firebird as well. It significantly improves the performance.
To summarize:
ISO SQL JSON Type -> Firebird JSON [SUB_TYPE] TEXT
Binary JSON -> Firebird type for storing key–value pairs and arrays in binary format.
It also allows storing duplicate keys, just like the TEXT JSON type.
========== DDL ==========
The declaration syntax is as follows:
JSON [[SUB_TYPE] <type>] [<optimize_size>]
Where:
<type> := TEXT | BINARY
<optimize_size> := [OPTIMIZE] SIZE <nonneg_short_integer>
The TEXT format is straightforward.
During casting or insertion, the engine will parse the JSON text to validate its syntax.
For the BINARY format, no client-side mechanisms will be provided to insert or read the content directly. It may be extended in the future, if requested.
The ALTER operation for relation types is implemented as follows:
From NUMBER, BOOL, NULL, DATETIME, TEXT, or BLOB TEXT → to JSON BINARY or JSON TEXT
From JSON TEXT → to BLOB TEXT
From JSON BINARY → to BLOB TEXT
As for the SIZE alteration, I am not sure whether it should be allowed. If the answer is yes, then the rules should be the same as for changing the size of a VARCHAR: only extension is permitted
========== DML ==========
In the SQL syntax, there is a way to specify that an input value is JSON:
<JSON parse> ::=
JSON <left paren> <string value expression>
[ <JSON input clause> ]
[ <JSON key uniqueness constraint> ]
<right paren>
I propose to allow an implicit TEXT parsing when an input function — such as JSON_TABLE, JSON_QUERY, or JSON_VALUE — expects a JSON input type.
The same behavior should apply to columns defined with the JSON type:
when inserting or updating values, textual JSON input should be parsed automatically by the engine, without requiring explicit casting.
INSERT INTO JSON_REL (JSON_TEXT) VALUES ('[1,2,3]');
The content will be casted to JSON on the engine side.
Regarding JSON functions (JSON_QUERY with keep string quotes, JSON_ARRAY(AGG), JSON_OBJECT(AGG), JSON TABLE FORMATTED COLUMN) if at least one of the input arguments is JSON BINARY, the output type will also be JSON BINARY.
If all input arguments are JSON TEXT, the output will be JSON TEXT.
In all other cases, the result will be a JSON but I am not sure what subtype should be preferred. TEXT is more standard friendly, but binary is more efficient.
========== CASTS ==========
The SQL standard explicitly states that JSON values can only be cast to JSON.
To create a JSON value, the <JSON parse> syntax must be used.
However, I propose a more flexible approach: allowing casts to and from most common SQL types: NUMERIC, CHAR, VARCHAR, BLOB, BOOLEAN, and DATETIME.
========== COMPARISON ==========
When a scalar value is compared with a JSON value that contains only a scalar, a simple descriptor vs descriptor comparison is performed, like for the SQL scalars.
For equality evaluation, both JSON items must be of the same type and contain identical nested values.
The greater/less comparison is more complex.
The following rules apply:
NULL < SQL/JSON scalar < array < object (accordingly to the standard):
Between two arrays, the one with fewer elements is considered smaller (accordingly to the standard)
JSON scalars are compared like SQL scalars (accordingly to the standard)
JSON objects are internally unwrapped into stored arrays of [key, value] pairs for comparison (proposed implementation).
Both JSON TEXT and JSON BINARY values are comparable.
========== CONCATENATION ==========
I cannot find any rules for JSON concatenation and applying string-style concatenation would not be meaningful.
I see two possible approaches:
Raise an error,
Define a logical concatenation rule, for example: {"A": 1} || {"B": 2} -> {"A": 1, "B": 2}. Or: {"A": 1} || {"B": 2} → [{"A": 1}, {"B": 2}]. The second form can already be achieved using: JSON_ARRAY('{"A":1}', '{"B":2}'). Therefore, the first option may provide practical value for users.
======== SORT / GROUP BY / DISTINCT ========
Sorting needs binary comparable keys and it is not clear how to build such a key from a composed type. So, for now, SORT, GROUP BY and DISTINCT are not supported.
========== BINARY JSON ==========
Binary JSON is designed to minimize memory usage by performing serialization on-the-fly. Its content is stored both in the cache and on disk (in a BLOB).
The cache structure is organized as follows:
struct CachePointer
{
JsonbType type = JsonbType::Empty;
CacheId id = NOT_IN_CACHE; // m_objects or m_arrays or m_scalars id
JsonbPosition positionInFile = INVALID_POSITION;
};
using CachedObject = Firebird::LeftPooledMap<FiledId, CachePointer>;
using CachedArray = Firebird::Array<CachePointer>;
using CachedDuplicateKeys = Firebird::Array<CachePointer>;
// Data-oriented approach
Firebird::ObjectsArray<CachedObject> m_objects;
Firebird::ObjectsArray<CachedArray> m_arrays;
Firebird::ObjectsArray<JsonScalar> m_scalars;
Firebird::ObjectsArray<CachedDuplicateKeys> m_duplicateKeys;
On disk, the data will be stored using the following structure:
<JSON BINARY> := <HEADER> <ITEM> <FIELDS>
<ITEM> := <SCALAR> | <BUCKET>
<BUCKET> := <OBJECT> | <ARRAY>
<HEADER, 4 bytes> := <version, uint8> <flags, uint8> <reserved 2 bytes>
<SCALAR> : =<item id, uint8> <SCLAR SIZE, uint16> <scalar type, 1 byte> <SCALAR DATA>
<OBJECT> :=<BUCKET HEADER> <DATA SIZE> [<NESTED OBJECT ITEMS> <KEYS POSITION>]
<NESTED OBJECT ITEMS> := {<FIELD ID> <ITEM>, …}
<KEYS POSITION> := {<NESTED ITEM POSITION>, …}
<ARRAY> := <BUCKET HEADER> [<NESTED ARRAY ITEMS> <KEYS POSITION>]
<NESTED ARRAY ITEMS> := {<ITEM>,....}
<BUCKET HEADER> := <item id, uint8> <TOTAL SIZE> <DATA SIZE>
<FIELDS> := {<FIELD ID> <FIELD NAME>,...}
<FIELD ID> := <uint16>
<FIELD NAME> := <size, uint16> <string>
<TOTAL SIZE> := <uint64>
<DATA SIZE> := <uint64>
When reading, objects and arrays are loaded into the cache.
For scalars, there can be either the same preload mechanism or a runtime-only cache implemented as follows:
Firebird::NonPooledMap<JsonbPosition, CacheId> m_fetchedScalars;
The <BUCKET HEADER> contains a size field, which allows fast skipping of items during parsing.
However, there is a challenge with delayed writing. Ideally, the BLOB should be used as a disk storage, but the current implementation does not support random writes. I implemented a workaround for this limitation, but it involves a significant amount of copy-pasted code, and I’m not entirely confident in its stability.
Using TempSpace appears to be a better solution in this case, however, it still requires transferring all data into a BLOB at the end of the process.
========== UTILITIES ==========
gbak, replication, extract, and show utilities have been adapted to support JSON (inline BLOB) without requiring any special handling.
isql prints JSON values as strings, and the engine automatically converts them to JSON TEXT during execution.
When printing, the inline length is used as the column width.
As for the API, there will be a simple Inline Blob reader.
========== IMPLEMENTATION ==========
I plan to split the pull request into several smaller parts to avoid review hell with 100+ modified files.
All DSC/BLR-related changes are ready and can be submitted in a separate PR, independent of the JSON-related functions or parser.
========== JSON INDEX ==========
Indexing is a completely separate subject and is definitely not part of the Firebird 6 design at this stage. For now, I have an idea of full JSON indexing but I am not sure this is possible to implement.
Each element within a JSON scheme can be represented by a path.
For example:
[{"reads": 42, "writes": 50}]
The corresponding paths and values would be:
* Path $[0].reads -> value 42
* Path $[0].writes -> value 50
The idea is to use the path itself as the key in an index.
However, this implies that a single JSON record will generate multiple keys.
For example, consider the following table:
| id | stats |
| -- | ------------------------------- |
| 1 | [{"reads": 42, "writes": 50}] |
| 2 | [{"reads": 0, "writes": 100}] |
The resulting index entries for the stats field would be:
| key | value | row_id |
| ------------- | ----- | ------ |
| $[0].reads | 42 | 1 |
| $[0].writes | 50 | 1 |
| $[0].reads | 0 | 2 |
| $[0].writes | 100 | 2 |
The question is whether such an approach is feasible. A single JSON record may contain 10 or even 100 paths, depending on the number of elements.
========== OPEN QUESTIONS ==========
How should concatenation behave for JSON values?
What is the default type for JSON functions?
Is everyone okay with the proposed JSON object comparison rules?
Is everyone okay with relaxed casting rules?
When a user explicitly queries JSON binary via API, should it be sent as binary?
Should a direct binary JSON insertion be allowed from the client side? I personally see JSON Binary as an engine-only format.
Should Inline size alteration be allowed in DDL?
JSON Indexes (discussion for the future)