SQLAlchemy: JSON vs. PickleType vs. raw string for serialised data

Loris Bennett

unread,

Feb 28, 2022, 4:12:04 AM2/28/22

to

Hi,

I have an SQLAlchemy class for an event:

class UserEvent(Base):
__tablename__ = "user_events"

id = Column('id', Integer, primary_key=True)
date = Column('date', Date, nullable=False)
uid = Column('gid', String(64), ForeignKey('users.uid'), nullable=False)
info = ??

The event may have arbitrary, but dict-like data associated with it,
which I want to add in the field 'info'. This data never needs to be
modified, once the event has been inserted into the DB.

What type should the info field have? JSON, PickleType, String, or
something else?

I couldn't find any really reliable sounding information about the relative
pros and cons, apart from a Reddit thread claiming that pickled dicts
are larger than dicts converted to JSON or String.

Cheers,

Loris

--
This signature is currently under construction.

Albert-Jan Roskam

unread,

Feb 28, 2022, 12:15:04 PM2/28/22

to

====
I think you need a
BLOB. https://docs.sqlalchemy.org/en/14/core/type_basics.html#sqlalchemy.types.LargeBinary

Robert Latest

unread,

Feb 28, 2022, 4:29:36 PM2/28/22

to

Albert-Jan Roskam wrote:
> The event may have arbitrary, but dict-like data associated with it,
> which I want to add in the field 'info'. This data never needs to be
> modified, once the event has been inserted into the DB.
>
> What type should the info field have? JSON, PickleType, String, or
> something else?
>
> I couldn't find any really reliable sounding information about the
> relative
> pros and cons, apart from a Reddit thread claiming that pickled dicts
> are larger than dicts converted to JSON or String.

I've done exactly this. Since my data was strictly ASCII I decided to go for
JSON. But in the end you're the only one who can decide this because only you
know the data. That's why you won't find any hard and fast rule for this.

Greg Ewing

unread,

Feb 28, 2022, 5:17:31 PM2/28/22

to

On 1/03/22 6:13 am, Albert-Jan Roskam wrote:
> I think you need a
> BLOB. https://docs.sqlalchemy.org/en/14/core/type_basics.html#sqlalchemy.types.LargeBinary

That won't help on its own, since you still need to choose a
serialisation format to store in the blob.

I'd be inclined to use JSON if the data is something that can
be easily represented that way.

--
Greg

Cameron Simpson

unread,

Feb 28, 2022, 5:47:15 PM2/28/22

to

On 28Feb2022 10:11, Loris Bennett <loris....@fu-berlin.de> wrote:
>I have an SQLAlchemy class for an event:
>
> class UserEvent(Base):
> __tablename__ = "user_events"
>
> id = Column('id', Integer, primary_key=True)
> date = Column('date', Date, nullable=False)
> uid = Column('gid', String(64), ForeignKey('users.uid'), nullable=False)
> info = ??
>
>The event may have arbitrary, but dict-like data associated with it,
>which I want to add in the field 'info'. This data never needs to be
>modified, once the event has been inserted into the DB.
>
>What type should the info field have? JSON, PickleType, String, or
>something else?

I would use JSON, it expresses dicts well provided the dicts contain
only basic types (strings, numbers, other dicts/lists of basic types
recursively).

I have personal problems with pickle because nonPython code can't read
it.

Cheers,
Cameron Simpson <c...@cskk.id.au>

Loris Bennett

unread,

Mar 1, 2022, 2:36:04 AM3/1/22

to

Thanks for the various suggestions. The data I need to store is just a
dict with maybe 3 or 4 keys and short string values probably of less
than 32 characters each per event. The traffic on the DB is going to be
very low, creating maybe a dozen events a day, mainly triggered via a
command-line interface, although I will probably set up one or two cron
jobs, each of which might generate another 0 to maybe 5 records a day.

I could go for JSON (or rather LONGSTRING, as JSON is just an alias for
LONGSTRING, but JSON is not available on the version of MariaDB I am
using). However, that seems like overkill, since I am never going to
have to store anything near 4 GB in the field. So I should probably in
fact just use say VARCHAR(255).

WDYT?

Cheers,

Loris

Robert Latest

unread,

Mar 1, 2022, 4:51:11 AM3/1/22

to

Loris Bennett wrote:
> Thanks for the various suggestions. The data I need to store is just a
> dict with maybe 3 or 4 keys and short string values probably of less
> than 32 characters each per event. The traffic on the DB is going to be
> very low, creating maybe a dozen events a day, mainly triggered via a
> command-line interface, although I will probably set up one or two cron
> jobs, each of which might generate another 0 to maybe 5 records a day.
>
> I could go for JSON (or rather LONGSTRING, as JSON is just an alias for
> LONGSTRING, but JSON is not available on the version of MariaDB I am
> using). However, that seems like overkill, since I am never going to
> have to store anything near 4 GB in the field. So I should probably in
> fact just use say VARCHAR(255).
>
> WDYT?

Using TypeDecorator to transparently convert between a dict and its JSON string
representation and MutableDict to track changes, you will get a completely
transparent attribute that works just like a dict. Make sure to check that the
generated JSON fits into your column width. I once got bitten by the fact that
VARCHAR(x) can hold only x/4 characters in utf8mb4 character set.

Loris Bennett

unread,

Mar 1, 2022, 5:16:37 AM3/1/22

to

Thanks for pointing out TypeDecorator - I wasn't aware of that. I won't
need to track changes in the JSON data, because the events I am
recording form an audit trail and so are written and read, but never
modified.

Dennis Lee Bieber

unread,

Mar 1, 2022, 11:27:22 AM3/1/22

to

On Tue, 01 Mar 2022 08:35:05 +0100, Loris Bennett
<loris....@fu-berlin.de> declaimed the following:

>Thanks for the various suggestions. The data I need to store is just a
>dict with maybe 3 or 4 keys and short string values probably of less
>than 32 characters each per event. The traffic on the DB is going to be
>very low, creating maybe a dozen events a day, mainly triggered via a
>command-line interface, although I will probably set up one or two cron
>jobs, each of which might generate another 0 to maybe 5 records a day.
>
>I could go for JSON (or rather LONGSTRING, as JSON is just an alias for
>LONGSTRING, but JSON is not available on the version of MariaDB I am
>using). However, that seems like overkill, since I am never going to
>have to store anything near 4 GB in the field. So I should probably in
>fact just use say VARCHAR(255).
>
>WDYT?
>

Having taken a few on-line short courses on database normalization and
SQL during my first lay-off, my view would be to normalize everything
first... Which, in your description, means putting that dictionary into a
separate table of the form (I also tend to define an autoincrement primary
key for all tables):

DICTDATA(*ID*, _eventID_, dictKey, dictValue)

where * delimits primary key, _ delimits foreign key to parent (event?)
record.

Caveat: While I have a book on SQLAlchemy, I confess it makes no sense to
me -- I can code SQL joins faster than figuring out how to represent the
same join in SQLAlchemy.

--
Wulfraed Dennis Lee Bieber AF6VN
wlf...@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/

Loris Bennett

unread,

Mar 2, 2022, 2:26:47 AM3/2/22

to

Dennis Lee Bieber <wlf...@ix.netcom.com> writes:

> On Tue, 01 Mar 2022 08:35:05 +0100, Loris Bennett
> <loris....@fu-berlin.de> declaimed the following:
>
>>Thanks for the various suggestions. The data I need to store is just a
>>dict with maybe 3 or 4 keys and short string values probably of less
>>than 32 characters each per event. The traffic on the DB is going to be
>>very low, creating maybe a dozen events a day, mainly triggered via a
>>command-line interface, although I will probably set up one or two cron
>>jobs, each of which might generate another 0 to maybe 5 records a day.
>>
>>I could go for JSON (or rather LONGSTRING, as JSON is just an alias for
>>LONGSTRING, but JSON is not available on the version of MariaDB I am
>>using). However, that seems like overkill, since I am never going to
>>have to store anything near 4 GB in the field. So I should probably in
>>fact just use say VARCHAR(255).
>>
>>WDYT?
>>
>
> Having taken a few on-line short courses on database normalization and
> SQL during my first lay-off, my view would be to normalize everything
> first... Which, in your description, means putting that dictionary into a
> separate table of the form (I also tend to define an autoincrement primary
> key for all tables):
>
> DICTDATA(*ID*, _eventID_, dictKey, dictValue)
>
> where * delimits primary key, _ delimits foreign key to parent (event?)
> record.

Ah, yes, you are right. That would indeed be the correct way to do it.
I'll look into that. Up to now I was thinking I would only ever want to
read out the dict in its entirety, but that's probably not correct.

> Caveat: While I have a book on SQLAlchemy, I confess it makes no sense to
> me -- I can code SQL joins faster than figuring out how to represent the
> same join in SQLAlchemy.

I currently can't code SQL joins fast anyway, so although doing it in
SQLAlchemy is might be relatively slower, absolutely there's maybe
not going to be much difference :-)