Hi all,
I have not given much thought to (py)DAL uploads used with blobs (i.e. raw content on a blob field) but IMHO we should rectify some small doubts.
I have investigated the code upon this thread
https://groups.google.com/d/msg/web2py/_klMYy4mp3Q/xhSG6eqYQEAJ , but that's another unrelated story.
What I thought (and said) was pretty much what's on the book, namely this excerpt:
blob
fields are also special. By default, binary data is encoded in base64
before being stored into the actual database field, and it is decoded
when extracted.
Immediately after this there's
This has the negative effect of using 25% more storage
space than necessary in blob fields, but has two advantages. On average
it reduces the amount of data communicated between web2py and the
database server, and it makes the communication independent of
back-end-specific escaping conventions.Now... correct me if I'm wrong but....IMHO we should rectify the following pieces:
- "
by default binary data is encoded in base64" --> perfectly fine by me, but are we suggesting to newbies that there's a "not default" way to NOT encode the contents ?
- "
This has the negative effect of using 25% more storage
space than necessary in blob fields" --> IMHO this is plain wrong. base64 (confirmed by wikipedia) expansion is approximately n * 4 / 3 = n*1.33 , so it's not 25%, it's 33%.
- "
On average
it reduces the amount of data communicated between web2py and the
database server" --> uhm........doubtful. At the net cost of drivers being "screwy" (i.e. be more dumb than web2py itself, that inflates the contents of a third), what happens really is that web2py buffers the contents of the file, base64.encode()s it in memory (ARGH!), sends over the wire the base64.encode()d representation of the contents. This means waiting - at least - 33% more of time for network contention, at the - not negligible - cost of base64.encode()ing the file on submit, and base64.decode()ing on read
- "
it makes the communication independent of
back-end-specific escaping conventions" --> wrong. it just makes our implementation of represent()ing/expand()ing a blob more linear, at the expense of the former specs. Sure, we don't have to deal with backend-specific code in each adapter, but the expense is quite a deal. The only thing more linear is the representation of "insert statement", and some backends (seem to recall firebird) have a taddle bit of problem dealing with, e.g. 100MB file uploads that turn out to an insert statement 130MB long. And that's probably because the DB API states almost nothing in regards of blobs, so that's our way to streamline a bit the APIs of different backends
And this is just about the little piece of documentation we have on blobs.
Now, another matter entirely....analyzing "the usual suspects" I found out that:
- sqlite --> 'blob'
- mysql --> 'longblob'
- mssql --> 'image'
- postgresql --> 'bytea'
- oracle --> 'clob'
this clashes really bad with what web2py does, that is "predigesting" the contents to a text representation, sending it in an insert statement, and the database storing it in a binary-type field, that is capable of storing much more than a base64 string.
Changing those "blob" columns with 'text'(like) ones doesn't hurt web2py in any way, because everything is passed through base64...so why choosing "binary-like" types for 'blob' fields ?
The current behaviour is quite erratic at best...
insert: raw file --> web2py --> stores in tempfile --> fetch it in memory --> base64 in memory --> long insert statement --> backend digestion --> backend implicit conversion to "binary storage" for something that is cleartext
select: backend spitting binary --> driver fetching a binary-like type --> implicit conversion (luckily for us, str()ify it just returns the text represenation that matches the base64 "original" string) --> web2py in memory --> base64 in memory --> result
Then I thought ... there could be some discrepancies between "upload" and "blob" fields, and web2py provides "blob" to deal with legacy tables... and tried... and tried ... and smashed... and found no way of dealing with a legacy table containing "correctly stored" (as per backend specs) binaries...
Anyone cares for commenting and/or explaining ?