pyDAL: uploads clarification

86 views
Skip to first unread message

Niphlod

unread,
Feb 11, 2015, 4:14:55 PM2/11/15
to web2py-d...@googlegroups.com
Hi all,
    I have not given much thought to (py)DAL uploads used with blobs (i.e. raw content on a blob field) but IMHO we should rectify some small doubts.

I have investigated the code upon this thread https://groups.google.com/d/msg/web2py/_klMYy4mp3Q/xhSG6eqYQEAJ , but that's another unrelated story.

What I thought (and said) was pretty much what's on the book, namely this excerpt:

blob fields are also special. By default, binary data is encoded in base64 before being stored into the actual database field, and it is decoded when extracted.

Immediately after this there's

This has the negative effect of using 25% more storage space than necessary in blob fields, but has two advantages. On average it reduces the amount of data communicated between web2py and the database server, and it makes the communication independent of back-end-specific escaping conventions.


Now... correct me if I'm wrong but....IMHO we should rectify the following pieces:
- "by default binary data is encoded in base64" --> perfectly fine by me, but are we suggesting to newbies that there's a "not default" way to NOT encode the contents ?
- "This has the negative effect of using 25% more storage space than necessary in blob fields" --> IMHO this is plain wrong. base64 (confirmed by wikipedia) expansion is approximately n * 4 / 3 = n*1.33 , so it's not 25%, it's 33%.
- "On average it reduces the amount of data communicated between web2py and the database server" --> uhm........doubtful. At the net cost of drivers being "screwy" (i.e. be more dumb than web2py itself, that inflates the contents of a third),  what happens really is that web2py buffers the contents of the file, base64.encode()s it in memory (ARGH!), sends over the wire the base64.encode()d representation of the contents. This means waiting - at least - 33% more of time for network contention, at the - not negligible - cost of base64.encode()ing the file on submit, and base64.decode()ing on read
- "it makes the communication independent of back-end-specific escaping conventions" --> wrong. it just makes our implementation of represent()ing/expand()ing a blob more linear, at the expense of the former specs. Sure, we don't have to deal with backend-specific code in each adapter, but the expense is quite a deal. The only thing more linear is the representation of "insert statement", and some backends (seem to recall firebird) have a taddle bit of problem dealing with, e.g. 100MB file uploads that turn out to an insert statement 130MB long. And that's probably because the DB API states almost nothing in regards of blobs, so that's our way to streamline a bit the APIs of different backends

And this is just about the little piece of documentation we have on blobs.

Now, another matter entirely....analyzing "the usual suspects" I found out that:
- sqlite --> 'blob'
- mysql --> 'longblob'
- mssql --> 'image'
- postgresql --> 'bytea'
- oracle --> 'clob'

this clashes really bad with what web2py does, that is "predigesting" the contents to a text representation, sending it in an insert statement, and the database storing it in a binary-type field, that is capable of storing much more than a base64 string.
Changing those "blob" columns with 'text'(like) ones doesn't hurt web2py in any way, because everything is passed through base64...so why choosing "binary-like" types for 'blob' fields ?
The current behaviour is quite erratic at best...
insert: raw file --> web2py --> stores in tempfile --> fetch it in memory --> base64 in memory --> long insert statement --> backend digestion --> backend implicit conversion to "binary storage" for something that is cleartext
select: backend spitting binary --> driver fetching a binary-like type --> implicit conversion (luckily for us, str()ify it just returns the text represenation that matches the base64 "original" string) --> web2py in memory --> base64 in memory --> result

Then I thought ... there could be some discrepancies between "upload" and "blob" fields, and web2py provides "blob" to deal with legacy tables... and tried... and tried ... and smashed... and found no way of dealing with a legacy table containing "correctly stored" (as per backend specs) binaries...

Anyone cares for commenting and/or explaining ?

Michele Comitini

unread,
Feb 11, 2015, 4:41:12 PM2/11/15
to web2py-developers
as I am a fun of delegation, each driver should store things the best way it can.
base64 could eventually be used as a "lingua franca" for csv exports.
exactly as it should happen with the boolean.

mic

--
-- mail from:GoogleGroups "web2py-developers" mailing list
make speech: web2py-d...@googlegroups.com
unsubscribe: web2py-develop...@googlegroups.com
details : http://groups.google.com/group/web2py-developers
the project: http://code.google.com/p/web2py/
official : http://www.web2py.com/
---
You received this message because you are subscribed to the Google Groups "web2py-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web2py-develop...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michele Comitini

unread,
Feb 11, 2015, 4:42:23 PM2/11/15
to web2py-developers
fun => fan (with moderate fun)

Niphlod

unread,
Feb 16, 2015, 3:06:35 PM2/16/15
to web2py-d...@googlegroups.com
ok, so I'll make a PR to fix at least the docs.
On further thoughts, I think the choice to use 'binary-like' column types is done because some backends have - albeit pretty large - limits on text, while 'binary-like' are larger.
In any case if we want to switch there will be a clear breakage: fields stored as base64 encoded strings won't be compatible with "properly stored" binaries. Yet another arg in DAL init() or a totally new field type ?

Michele Comitini

unread,
Feb 17, 2015, 3:04:37 AM2/17/15
to web2py-developers
We could have 2 types:
binaryblob
base64blob

and blob would be an alias of the two based on the n-th kwarg in DAL

Massimo DiPierro

unread,
Feb 20, 2015, 12:55:23 PM2/20/15
to web2py-d...@googlegroups.com
I would just call them binary and base64 but I like the idea.
Reply all
Reply to author
Forward
0 new messages