Pass SQL null values to Parquet

551 views
Skip to first unread message

Rory Campbell-Lange

unread,
Jan 22, 2023, 3:57:04 PM1/22/23
to golan...@googlegroups.com
I'm writing a PostgreSQL dump file to parquet converter as a hobby project.

PostgreSQL dump files are simply plain-text files containing, as the man page puts it "the SQL commands required to reconstruct the database to the state it was in at the time it was saved." Null SQL values are represented by "\N".

I have implemented a fairly efficient line-wise dump file reader which includes a rudimentary table typer. However I'm not sure how to represent nulls in my go code so that I can pass these onto parquet encoders.

Is a struct something like the following a good idea for each type?

type pgInt64 struct {
val int64
isNull bool
}

(This somewhat mirrors the sort of sidecar "validity" design of Arrow and definition levels of Parquet as described at https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/).

Simple tests encoding parquet format files with github.com/xitongsys/parquet-go and github.com/fraugster/parquet-go work ok. However I'm equally unsure how to pass these nulls.

Thanks for any insights,
Rory

Nick White

unread,
Jan 22, 2023, 5:22:10 PM1/22/23
to golan...@googlegroups.com
On Sun, Jan 22, 2023 at 08:56:12PM +0000, Rory Campbell-Lange wrote:
> I have implemented a fairly efficient line-wise dump file reader which includes a rudimentary table typer. However I'm not sure how to represent nulls in my go code so that I can pass these onto parquet encoders.
>
> Is a struct something like the following a good idea for each type?
>
> type pgInt64 struct {
> val int64
> isNull bool
> }

I'm not experienced with Parquet at all, but that struct plan is
almost exactly the same as the format go's database/sql package
reads into:

$ go doc sql NullInt64
package sql // import "database/sql"

type NullInt64 struct {
Int64 int64
Valid bool // Valid is true if Int64 is not NULL
}

...

So yes, it seems like a reasonable way to represent data that may
contain nulls. Whether this is the easiest way to pass to parquet, I
don't know, but I'd imagine it shouldn't be too bad.

Nick

Brian Candler

unread,
Jan 23, 2023, 3:00:03 AM1/23/23
to golang-nuts
If you're using a third-party parquet encoding library, you'll need to pass null/not-null using whatever convention that library requires.

I took a quick look at xitongsys/parquet-go's front-page documentation, and it seems it handles null (OPTIONAL) using pointers:

That is, an int32 which maybe null is passed as *int32, with a nil pointer for null.

Reply all
Reply to author
Forward
0 new messages