JSON & ill-formed escape sequences

99 views
Skip to first unread message

Mulong Chen

unread,
Sep 1, 2020, 3:22:34 AM9/1/20
to gpdb...@greenplum.org
Hi devs,

I made a nasty PR to solve a nasty problem that can be found at
https://github.com/greenplum-db/gpdb/pull/10746 . It tries to
solve the illegal escape sequences of JSON string field handling in an
acceptable way for users.

Those "illegal" escape sequences handling here include:
- '\u0000' which is perfectly legal in the JSON standard but cannot be handled
  by PG.
- Ill-formed escape sequences like missing hex digit, illegal surrogate pairs
  and so on.

The similar topic has been discussed in PG mail group for many times like:
https://www.postgresql.org/message-id/flat/51B56F2C.3020305%40dunslane.net#df085185143856c449734f08a91ee490
https://www.postgresql.org/message-id/flat/CA%2BTgmoapNgKpPiwVyR%3DwxCj%3D1m9RqL3311gA6fibbXijMv%3Drtg%40mail.gmail.com#d221e89ba047beb3429880aabd01a24f

Basically PG/GP reports errors when those cases happen because of there is no
way to maintain the correctness and consistency of this kind of JSON data.
Simply PG/GP wants to ensure when reading the json/jsonb from the db, it is
exactly the same with the original JSON representation. This won't be possible
since some of those escape sequences cannot be encoded into a legal UTF-8
string.

The complaints came from GP streaming data users. They want to save all
streaming data into GP first before dealing with them. But the streaming
source is not always reliable -- the data could contain a '\u0000', an
invalid escape or other strange sequences. This could be caused by a non-utf-8
encoding on the source server, a software bug in the streaming source or a
expected source side behavior (Like PEP 383). But users assume GP can store
those faulty data first before they start investigating/washing the data. For
jsonb, the situation is not too bad since the data is verified before stored.
For json, illegal JSON string as is stored any way and it can even block
user's normal queries on the correct data:

postgres=# drop table if exists test_bad_unicode;
DROP TABLE
postgres=# create table test_bad_unicode(j json);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause, and no column type is suitable for a distribution key. Creating a NULL policy entry.
CREATE TABLE
postgres=# INSERT INTO test_bad_unicode VALUES('{"name":"john", "value": 1, "bad_text": "\u0000"}');
INSERT 0 1
postgres=# INSERT INTO test_bad_unicode VALUES('{"name":"john", "value": 1, "bad_text": "all right"}');
INSERT 0 1
postgres=# SELECT j->'name' from test_bad_unicode;
ERROR:  unsupported Unicode escape sequence  (seg0 slice1 127.0.0.1:7002 pid=2462888)
DETAIL:  \u0000 cannot be converted to text.
CONTEXT:  JSON data, line 1: {"name":"john", "value": 1, "bad_text":...


The PR tries to solve the issue in a user-aware way.
GUC 'gp_json_preserve_ill_formed' and 'gp_json_preserve_ill_formed_prefix' are
added to preserve those "illegal" escape sequences as the original un-escaped
string with a user given prefix.

So instead of:

postgres=# select json '{ "a":  "\u0000" }' ->> 'a';
ERROR:  unsupported Unicode escape sequence
DETAIL:  \u0000 cannot be converted to text.
CONTEXT:  JSON data, line 1: { "a":...

User can choose to have:

postgres=# set gp_json_preserve_ill_formed = true;
SET
postgres=# set gp_json_preserve_ill_formed_prefix = "##";
SET
postgres=# select json '{ "a":  "\u0000" }' ->> 'a';
 ?column?
----------
 ##\u0000
(1 row)

Then user can select those incorrect data by filtering on the prefix '##'.


The PR does bring some inconsistencies, consider the below queries:

select json '{ "a":  "null \u0000 escape" }' -> 'a' as it_is;
        it_is
----------------------
 "null \u0000 escape"
(1 row)

select json '{ "a":  "null \u0000 escape" }' ->> 'a' as preserved_with_prefix;
 preserved_with_prefix
-----------------------
 null ##\u0000 escape
(1 row)


select '{ "a":  "dollar \u0000 character" }'::jsonb::json->'a' as preserved_with_prefix;
    preserved_with_prefix    
------------------------------
 "dollar ##\\u0000 character"
(1 row)

User could be confused the prefix '##' are not always shown in the query
result. Especially the 3rd query is using '->' to get the original char
sequences from the JSON, but it returns the text with prefix '##'. But all
those are the expected behavior and IMO is acceptable.

Some efforts have been done by different communities/companies to solve
the similar problem like:
WTF-8 (http://simonsapin.github.io/wtf-8/#encode-to-wtf-8)
CESU-8 & MUTF-8 (https://en.wikipedia.org/wiki/UTF-8#CESU-8)
They all try to create a superset of UTF-8 and cannot guarantee non-loosely
conversion to standard UTF-8.  Supporting those means significant changes
in the pg codebase and the user may lose the original ill-formed escape
sequences.

The PR has not been 100% done yet. Before I continue working on it, I'd like
to hear your opinions about it. Especially please let me know if there are better
solutions.

Regards,
Chen Mulong

Jasper Li

unread,
Sep 1, 2020, 5:22:44 AM9/1/20
to Mulong Chen, gpdb...@greenplum.org
I'd like to add some small comments for this PR. It is not intended to solve the limitation of json type of postgres (details is here: https://www.postgresql.org/docs/9.5/datatype-json.html).

The possible use case is that user needs to extract content from a json message. The illegal character will also prevent user from extracting the value of other fields. For example,
```
select json_extract_path_text(('{"bad":"\u0000","name":"bb"}')::json,'name');
```
it will raise error even if we only extract the value of "name", which is totally good.

It is not uncommon that all messages in the same streaming topic have the same illegal characters because all of them are from the same source. JSON RFC doesn't tell officially what is the correct behavior in the case of illegal characters ( and \u0000 is a legal json character in fact). Postgres choose to error out but it will break the streaming ingestion in Greenplum.
Currently, the error handling of Greenplum couldn't catch such failure so this PR could help to solve this limitation. It won't raise errors or break streaming jobs any more if there is an unnecessary encoding error in JSON message.

Regards,
Jasper


发件人: Mulong Chen <cmu...@vmware.com>
发送时间: 2020年9月1日 15:22
收件人: gpdb...@greenplum.org <gpdb...@greenplum.org>
主题: JSON & ill-formed escape sequences
 
--
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+u...@greenplum.org.

Heikki Linnakangas

unread,
Sep 1, 2020, 6:17:13 AM9/1/20
to Jasper Li, Mulong Chen, gpdb...@greenplum.org
On 01/09/2020 12:22, Jasper Li wrote:
> I'd like to add some small comments for this PR. It is not intended to solve the limitation of json type of postgres (details is here: https://www.postgresql.org/docs/9.5/datatype-json.html
>
> The possible use case is that user needs to extract content from a json message. The illegal character will also prevent user from extracting the value of other fields. For example,
> ```
> select json_extract_path_text(('{"bad":"\u0000","name":"bb"}')::json,'name');
> ```
> it will raise error even if we only extract the value of "name", which is totally good.
>
> It is not uncommon that all messages in the same streaming topic have the same illegal characters because all of them are from the same source. JSON RFC doesn't tell officially what is the correct behavior in the case of illegal characters ( and \u0000 is a legal json character in fact).

Right. Exactly the same as PostgreSQL, nothing Greenplum specific here.

As a work-around, you could do somethig like:

select json_extract_path_text(replace(inputcol, '\u0000', '##\\u0000'));

> Postgres choose to error out but it will break the streaming
> ingestion in Greenplum. Currently, the error handling of Greenplum
> couldn't catch such failure so this PR could help to solve this
> limitation. It won't raise errors or break streaming jobs any more if
> there is an unnecessary encoding error in JSON message.

It sounds like that's the key question here that's relevant to
Greenplum: Why does Greenplum's streaming error handling fail to catch this?

- Heikki

Jasper Li

unread,
Sep 1, 2020, 6:33:14 AM9/1/20
to Heikki Linnakangas, Mulong Chen, gpdb...@greenplum.org
As a work-around, you could do somethig like:

> select json_extract_path_text(replace(inputcol, '\u0000', '##\\u0000'));

This is exactly what we let customer do now. However, it doesn't work well because,
  1. it can't recognize '\\u0000' or other similar strings, which are legal
  2. Besides \u0000, there might be other unknown illegal Unicode characters
  3. performance is very bad when json is big and the column number is big.

It sounds like that's the key question here that's relevant to
> Greenplum: Why does Greenplum's streaming error handling fail to catch this?

Single row error handling only work with `ERRCODE_DATA_EXCEPTION` during scan node. Such error comes from `ExecProject` often. We are still working on this but no progress by now after this PR is rejected.



发件人: Heikki Linnakangas <linnak...@vmware.com>
发送时间: 2020年9月1日 18:17
收件人: Jasper Li <lja...@vmware.com>; Mulong Chen <cmu...@vmware.com>; gpdb...@greenplum.org <gpdb...@greenplum.org>
主题: Re: 回复: JSON & ill-formed escape sequences
 
On 01/09/2020 12:22, Jasper Li wrote:

>
> The possible use case is that user needs to extract content from a json message. The illegal character will also prevent user from extracting the value of other fields. For example,
> ```
> select json_extract_path_text(('{"bad":"\u0000","name":"bb"}')::json,'name');
> ```
> it will raise error even if we only extract the value of "name", which is totally good.
>
> It is not uncommon that all messages in the same streaming topic have the same illegal characters because all of them are from the same source. JSON RFC doesn't tell officially what is the correct behavior in the case of illegal characters ( and \u0000 is a legal json character in fact).

Right. Exactly the same as PostgreSQL, nothing Greenplum specific here.

As a work-around, you could do somethig like:

select json_extract_path_text(replace(inputcol, '\u0000', '##\\u0000'));

> Postgres choose to error out but it will break the streaming
> ingestion in Greenplum. Currently, the error handling of Greenplum
> couldn't catch such failure so this PR could help to solve this
> limitation. It won't raise errors or break streaming jobs any more if
> there is an unnecessary encoding error in JSON message.

It sounds like that's the key question here that's relevant to
Greenplum: Why does Greenplum's streaming error handling fail to catch this?

- Heikki

Heikki Linnakangas

unread,
Sep 1, 2020, 8:03:26 AM9/1/20
to Jasper Li, Mulong Chen, gpdb...@greenplum.org
On 01/09/2020 13:33, Jasper Li wrote:
>> As a work-around, you could do somethig like:
>
>> select json_extract_path_text(replace(inputcol, '\u0000', '##\\u0000'));
>
> This is exactly what we let customer do now. However, it doesn't work well because,
>
> 1. it can't recognize '\\u0000' or other similar strings, which are legal
> 2. Besides \u0000, there might be other unknown illegal Unicode characters

You can make it as sophisticated as you need to.

> 3. performance is very bad when json is big and the column number is big.

Numbers please. It's not free, for sure, but how bad is "very bad"?

I did a quick little test on my laptop, with vanilla PostgreSQL. I used
the first json example document I could find at
https://json.org/example.html, loaded it into a table 1 million times,
and timed:

explain analyze select json_extract_path_text(t::json, 'xxx') from jsontest;

vs.

explain analyze select json_extract_path_text(replace(t, '\u0000',
'')::json, 'xxx') from jsontest;

The difference was about 10%. With different data it could be more, of
course, but in the big scheme of things, it really doesn't seem so bad
to me.

>> It sounds like that's the key question here that's relevant to
>> Greenplum: Why does Greenplum's streaming error handling fail to catch this?
>
> Single row error handling only work with `ERRCODE_DATA_EXCEPTION` during scan node. Such error comes from `ExecProject` often. We are still working on this but no progress by now after this PR<https://github.com/greenplum-db/gpdb/pull/9880> is rejected.

Cool, let's continue the work on that, then, based on the discussions we
had then.

- Heikki

Jasper Li

unread,
Sep 1, 2020, 8:11:18 AM9/1/20
to Heikki Linnakangas, Mulong Chen, gpdb...@greenplum.org, Wen Lin
Cool, let's continue the work on that, then, based on the discussions we had then.

Any suggestions?
this PR is needless (on 7 and later at least) so long as the error handling can catch such errors. 

发件人: Heikki Linnakangas <linnak...@vmware.com>
发送时间: 2020年9月1日 20:03
主题: Re: 回复: 回复: JSON & ill-formed escape sequences
 

Heikki Linnakangas

unread,
Sep 1, 2020, 8:31:30 AM9/1/20
to Jasper Li, Mulong Chen, gpdb...@greenplum.org, Wen Lin
On 01/09/2020 15:11, Jasper Li wrote:
>> Cool, let's continue the work on that, then, based on the discussions we had then.
>
> Any suggestions?
> this PR is needless (on 7 and later at least) so long as the error handling can catch such errors.

Quoting myself from the previous discussions:

On 11/04/2020 19:17, Heikki Linnakangas wrote:
> If we're happy with a subtransaction for every row, can you just do
> something like this:
>
> CREATE EXTERNAL TABLE ext_fromtbl1 (j json) LOCATION (...);
>
> CREATE FUNCTION transform (src ext_fromtbl1) RETURNS boolean
> LANGUAGE plpgsql
> as $$
> begin
> return (j->>1)::boolean;
> exception when others then
> -- log error here
> raise notice '%', sqlerrm;
> end;
> $$;
>
> select transform(x) from ext_fromtbl1 x;

And:

On 11/04/2020 12:17, Heikki Linnakangas (Pivotal) wrote:
> On 10/04/2020 08:04, Divya Bhargov wrote:
>> The intention of being able to continue processing the records, without
>> aborting, during a transformation error and logging it into an error
>> table is definitely a valuable fix in the ETL process.
>
> That is a hard problem. In PostgreSQL in general, it is *not* safe to
> continue processing after an error has been thrown with ereport(). The
> classic example is a duplicate key violation. The way that works is that
> the system first inserts the row into the heap, and then updates the
> indexes, and the possible duplicate key violation is detected during the
> index update. But at that point, you've already inserted the tuple into
> the heap, so you *must* abort the transaction, or you're left with a
> tuple in the table with no corresponding index entry.
>
> We've hacked through a few cases in GPDB that are deemed safe enough in
> practice. We're assuming that if an error is thrown while parsing the
> input row or when calling the input functions, it's OK to continue
> processing. Even that is a bit sketchy. For example, you could imagine a
> user-defined datatype with an input function that uses a 3rd party
> library, which might leak library handles or get otherwise confused if
> an ereport() doesn't lead to a transaction abort. But we've accepted
> that risk.
>
> In what ways would this PR extend that risk? We definitely cannot ignore
> errors from arbitrary expression. But maybe there are some more special
> cases we could allow? We need to be very careful here and clearly
> document the exceptions to the usual "ereport() always leads to abort" rule.
>
> I would prefer to see a different approach to this whole problem though,
> one that wouldn't require skipping the transaction abort. For example,
> what if we did the parsing and executed the expressions in a different
> backend, in a different transaction, separate from the backend that
> actually inserts the data to the table? Then we could easily roll back
> the transaction that the expression is executed in, without having to
> roll back the transaction that actually does the loading. That would
> assume that the expressions don't do INSERTs or other update of their
> own, with effects that would need to be preserved. But if they did such
> things, then catching and ignoring the errors would be unsafe anyway, so
> at least the behavior would be more sane and consistent if it was done
> in a different transaction.
>
> I'm not wedded to that particular idea, maybe there is some even better
> way to handle this. We need to find a solution that doesn't involve
> catching arbitrary ereports() and skipping the aborts.

- Heikki

Jasper Li

unread,
Sep 1, 2020, 8:52:35 AM9/1/20
to Heikki Linnakangas, Mulong Chen, gpdb...@greenplum.org, Wen Lin, Peifeng Qiu
I see.
Then we will surely start with this and initialize related discussion if we have any progress.
Soo appreciated! Thanks, Heikki!

发件人: Heikki Linnakangas <linnak...@vmware.com>
发送时间: 2020年9月1日 20:31
收件人: Jasper Li <lja...@vmware.com>; Mulong Chen <cmu...@vmware.com>; gpdb...@greenplum.org <gpdb...@greenplum.org>; Wen Lin <li...@vmware.com>
主题: Re: 回复: 回复: 回复: JSON & ill-formed escape sequences
 

Wen Lin

unread,
Sep 18, 2020, 11:24:18 AM9/18/20
to Heikki Linnakangas, Mulong Chen, Peifeng Qiu, Jasper Li, gpdb...@greenplum.org
Hi, All,

I have run a simple performance testing for using a function to catch projection errors. It is very common in customers' environment to meet a project error while loading data. 
Below is the steps and results.

  1. create a file, named "exttab.data", which contains 1 milliion integers.
  2. create an external table:
    CREATE EXTERNAL TABLE exttab_basic_1( i int)
    LOCATION ('file://linw-test:6000//home/gpadmin/exttab.data') FORMAT 'TEXT' (DELIMITER '|')
    LOG ERRORS SEGMENT REJECT LIMIT 2;
  3. create two functions: the first one will return a default value if meet division_by_zero error, the second one is using SRF and returns a SET.
    CREATE FUNCTION transform1 (src exttab_basic_1) RETURNS integer as $$
        begin
          return src.i;
        exception
          when division_by_zero then

  1.       raise notice '%', sqlerrm;
  1.   return 0;
        end;
     $$ LANGUAGE plpgsql;

     CREATE FUNCTION transform2 (src exttab_basic_1) RETURNS SETOF integer as $$
        begin
          return NEXT src.i;
        exception
          when division_by_zero then

  1.       raise notice '%', sqlerrm;
  1.   RETURN;
        end;
     $$ LANGUAGE
  2. run SELECT query,
    select * from exttab_basic_1
    1. Time: 994.783 ms
    2. Time: 738.306 ms
    3. Time: 578.568 ms
    select transform1(exttab_basic_1) from exttab_basic_1 ;
    1. Time: 9846.092 ms
    2. Time: 9866.108 ms
    3. Time: 9826.879 ms
    select transform2(exttab_basic_1) from exttab_basic_1 ;
    1. Time: 12911.812 ms
    2. Time: 14521.931 ms
    3. Time: 14829.304 ms
We can see that, if use a function to catch errors, the performance is bad. So what we can do for this issue or is there any other solutions?
Thanks!

Wen


From: Jasper Li <lja...@vmware.com>
Sent: Tuesday, September 1, 2020 8:52 PM
To: Heikki Linnakangas <linnak...@vmware.com>; Mulong Chen <cmu...@vmware.com>; gpdb...@greenplum.org <gpdb...@greenplum.org>; Wen Lin <li...@vmware.com>; Peifeng Qiu <peif...@vmware.com>
Subject: 回复: 回复: 回复: 回复: JSON & ill-formed escape sequences
 
Reply all
Reply to author
Forward
0 new messages