I made a nasty PR to solve a nasty problem that can be found at
https://github.com/greenplum-db/gpdb/pull/10746 . It tries to
solve the illegal escape sequences of JSON string field handling in an
acceptable way for users.
Those "illegal" escape sequences handling here include:
- '\u0000' which is perfectly legal in the JSON standard but cannot be handled
by PG.
- Ill-formed escape sequences like missing hex digit, illegal surrogate pairs
and so on.
The similar topic has been discussed in PG mail group for many times like:
https://www.postgresql.org/message-id/flat/51B56F2C.3020305%40dunslane.net#df085185143856c449734f08a91ee490
https://www.postgresql.org/message-id/flat/CA%2BTgmoapNgKpPiwVyR%3DwxCj%3D1m9RqL3311gA6fibbXijMv%3Drtg%40mail.gmail.com#d221e89ba047beb3429880aabd01a24f
Basically PG/GP reports errors when those cases happen because of there is no
way to maintain the correctness and consistency of this kind of JSON data.
Simply PG/GP wants to ensure when reading the json/jsonb from the db, it is
exactly the same with the original JSON representation. This won't be possible
since some of those escape sequences cannot be encoded into a legal UTF-8
string.
The complaints came from GP streaming data users. They want to save all
streaming data into GP first before dealing with them. But the streaming
source is not always reliable -- the data could contain a '\u0000', an
invalid escape or other strange sequences. This could be caused by a non-utf-8
encoding on the source server, a software bug in the streaming source or a
expected source side behavior (Like PEP 383). But users assume GP can store
those faulty data first before they start investigating/washing the data. For
jsonb, the situation is not too bad since the data is verified before stored.
For json, illegal JSON string as is stored any way and it can even block
user's normal queries on the correct data:
postgres=# drop table if exists test_bad_unicode;
DROP TABLE
postgres=# create table test_bad_unicode(j json);
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause, and no column type is suitable for a distribution key. Creating a NULL policy entry.
CREATE TABLE
postgres=# INSERT INTO test_bad_unicode VALUES('{"name":"john", "value": 1, "bad_text": "\u0000"}');
INSERT 0 1
postgres=# INSERT INTO test_bad_unicode VALUES('{"name":"john", "value": 1, "bad_text": "all right"}');
INSERT 0 1
postgres=# SELECT j->'name' from test_bad_unicode;
ERROR: unsupported Unicode escape sequence (seg0 slice1
127.0.0.1:7002 pid=2462888)
DETAIL: \u0000 cannot be converted to text.
CONTEXT: JSON data, line 1: {"name":"john", "value": 1, "bad_text":...
The PR tries to solve the issue in a user-aware way.
GUC 'gp_json_preserve_ill_formed' and 'gp_json_preserve_ill_formed_prefix' are
added to preserve those "illegal" escape sequences as the original un-escaped
string with a user given prefix.
So instead of:
postgres=# select json '{ "a": "\u0000" }' ->> 'a';
ERROR: unsupported Unicode escape sequence
DETAIL: \u0000 cannot be converted to text.
CONTEXT: JSON data, line 1: { "a":...
User can choose to have:
postgres=# set gp_json_preserve_ill_formed = true;
SET
postgres=# set gp_json_preserve_ill_formed_prefix = "##";
SET
postgres=# select json '{ "a": "\u0000" }' ->> 'a';
?column?
----------
##\u0000
(1 row)
Then user can select those incorrect data by filtering on the prefix '##'.
The PR does bring some inconsistencies, consider the below queries:
select json '{ "a": "null \u0000 escape" }' -> 'a' as it_is;
it_is
----------------------
"null \u0000 escape"
(1 row)
select json '{ "a": "null \u0000 escape" }' ->> 'a' as preserved_with_prefix;
preserved_with_prefix
-----------------------
null ##\u0000 escape
(1 row)
select '{ "a": "dollar \u0000 character" }'::jsonb::json->'a' as preserved_with_prefix;
preserved_with_prefix
------------------------------
"dollar ##\\u0000 character"
(1 row)
User could be confused the prefix '##' are not always shown in the query
result. Especially the 3rd query is using '->' to get the original char
sequences from the JSON, but it returns the text with prefix '##'. But all
those are the expected behavior and IMO is acceptable.
Some efforts have been done by different communities/companies to solve
the similar problem like:
WTF-8 (
http://simonsapin.github.io/wtf-8/#encode-to-wtf-8)
CESU-8 & MUTF-8 (
https://en.wikipedia.org/wiki/UTF-8#CESU-8)
They all try to create a superset of UTF-8 and cannot guarantee non-loosely
conversion to standard UTF-8. Supporting those means significant changes
in the pg codebase and the user may lose the original ill-formed escape
sequences.
The PR has not been 100% done yet. Before I continue working on it, I'd like
to hear your opinions about it. Especially please let me know if there are better
solutions.