Using apsw in C extension

35 views
Skip to first unread message

Nikolaus Rath

unread,
Oct 7, 2011, 6:19:52 PM10/7/11
to python...@googlegroups.com
Hello,

Currently my application is written entirely in Python and uses apsw to
access an SQLite database.

At times, it is necessary to dump the entire database into a file. At
the moment, I am using cPickle for that. After LZMA compression, this is
significantly smaller than using the (LZMA compressed) SQLite database
(even after VACUUM) - presumably because it does not contain any
indices.

I would like to speed up this operation and reduce the resulting
file size.

I believe that I'm getting a lot of overhead from using Python and
pickle, and that I'm not getting optimal compression because I'm not
taking advantage of the fact that there is a lot freedom in the order in
which columns, rows and tables can be saved.

My idea is to use delta compression on successive rows, because most of
them contain increasing integer values. In order to do that efficiently,
I'd like to factor out the dumping code into a C extension.

My question (sorry for the long introduction) is this: can I somehow use
the native SQLite API in my C extension, i.e. is there a way to get a C
pointer to the sqlite DB object from apsw? If possible, I would like to
avoid the superfluous conversion of values to Python objects by apsw,
and back to C values by the planned C extension.

Also, am I going to run into linking problems if both apsw and my
extension link to libsqlite.so, and then Python tries to load both of
them?


Best,

-Nikolaus

--
»Time flies like an arrow, fruit flies like a Banana.«

PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C

Roger Binns

unread,
Oct 7, 2011, 9:21:21 PM10/7/11
to python...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/10/11 15:19, Nikolaus Rath wrote:
> significantly smaller than using the (LZMA compressed) SQLite database
> (even after VACUUM) - presumably because it does not contain any
> indices.

You could always make a backup of the SQLite database, drop the indices in
it, vacuum and then compress the result.

> I believe that I'm getting a lot of overhead from using Python and
> pickle,

Size or speed? Certainly pickle binary encoding isn't as space efficient
as SQLite's integer encoding. For APSW's dump code I wrote a C routine
for dumping values because dealing with strings and blobs was far more
efficient that way (speed):

http://apidoc.apsw.googlecode.com/hg/apsw.html#apsw.format_sql_value

> and that I'm not getting optimal compression because I'm not taking
> advantage of the fact that there is a lot freedom in the order in which
> columns, rows and tables can be saved.

You'll still likely need one byte per value since you'll want some bits to
indicate type and some more bits to give an integer delta. SQLite's
encoding is substantially similar to that:

http://www.sqlite.org/fileformat2.html#record_format

> i.e. is there a way to get a C pointer to the sqlite DB object from
> apsw?

It has been available for several years:


http://apidoc.apsw.googlecode.com/hg/connection.html#apsw.Connection.sqlite3pointer

> Also, am I going to run into linking problems if both apsw and my
> extension link to libsqlite.so, and then Python tries to load both of
> them?

The operating system will only load the same shared library once. Just do
not use the amalgamation with APSW otherwise it will have a private
internal copy of SQLite.

What I'd recommend you do is add whatever new methods you want to the APSW
source code until you have something working well, and then figure out how
to carve that out separately.

However I'm will to bet that a SQLite database that has been vacuumed and
indices removed will be as good or only slightly worse than anything you
come up with :-)

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk6PpZAACgkQmOOfHg372QSZrACfTlbB49Ky8a9V3pFfbPK2NIup
yJAAoMBQPk/YFcYPlhLkax63tg/auBq1
=GQD2
-----END PGP SIGNATURE-----

Nikolaus Rath

unread,
Oct 8, 2011, 11:29:33 AM10/8/11
to python...@googlegroups.com
Roger Binns <rogerb-JFdGOZ7...@public.gmane.org> writes:
> On 07/10/11 15:19, Nikolaus Rath wrote:
>> significantly smaller than using the (LZMA compressed) SQLite database
>> (even after VACUUM) - presumably because it does not contain any
>> indices.
>
> You could always make a backup of the SQLite database, drop the indices in
> it, vacuum and then compress the result.

Yes, but that's what I'm comparing with already. It seems that the
unremovable indices for the primary keys take up a significant amount of
space.

>> I believe that I'm getting a lot of overhead from using Python and
>> pickle,
>
> Size or speed?

Both.

> Certainly pickle binary encoding isn't as space efficient as SQLite's
> integer encoding. For APSW's dump code I wrote a C routine for dumping
> values because dealing with strings and blobs was far more efficient
> that way (speed):
>
> http://apidoc.apsw.googlecode.com/hg/apsw.html#apsw.format_sql_value

I already tried apsw's table dump functionality. Before compression the
result is (obviously) way larger than the binary pickle, but after
compression it's about the same. There is also no significant time
difference, it seems that cPickle and apsw are pretty much heads-on
here.

>> and that I'm not getting optimal compression because I'm not taking
>> advantage of the fact that there is a lot freedom in the order in which
>> columns, rows and tables can be saved.
>
> You'll still likely need one byte per value since you'll want some bits to
> indicate type and some more bits to give an integer delta.

Almost, there is only one type of data in every column, and I can
hardcode the type.

>> i.e. is there a way to get a C pointer to the sqlite DB object from
>> apsw?
>
> It has been available for several years:
>
> http://apidoc.apsw.googlecode.com/hg/connection.html#apsw.Connection.sqlite3pointer
>
>> Also, am I going to run into linking problems if both apsw and my
>> extension link to libsqlite.so, and then Python tries to load both of
>> them?
>
> The operating system will only load the same shared library once. Just do
> not use the amalgamation with APSW otherwise it will have a private
> internal copy of SQLite.

Great, thanks for the pointer! So I can just link my extension to
libsqlite without any special precautions, as long as apsw is also
dynamically linked?


> What I'd recommend you do is add whatever new methods you want to the APSW
> source code until you have something working well, and then figure out how
> to carve that out separately.
>
> However I'm will to bet that a SQLite database that has been vacuumed and
> indices removed will be as good or only slightly worse than anything you
> come up with :-)

You already lost the bet, I'm already doing substantially better using
pickle:

DB size (after dropping all droppable indices and VACUUM): 232 MB
After LZMA compression: 51 MB
Size of pickle dump: 184 MB
After LZMA compression: 42 MB
Size of apsw dump: 409 MB
After compression: 41 MB


I have attached the output of sqlite3_analyzer for the DB if you're
interested.

I think I can do even better than that if I make use of the knowledge
that I have about the data.


Best,

-Nikolaus

analyzer.txt

Roger Binns

unread,
Oct 8, 2011, 2:34:33 PM10/8/11
to python...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/10/11 08:29, Nikolaus Rath wrote:
> It seems that the unremovable indices for the primary keys take up a
> significant amount of space.

Note that there isn't an actual index for the primary key. The primary
key is used to find rows and hence is used as the key for btree nodes but
isn't stored in an index anywhere. Well, btree keys are a form of index
but you get the jist - they are different than regular indices.

Which brings up another thought - have you tried using a larger page size
for the compressible database such as 64kb?

> Great, thanks for the pointer! So I can just link my extension to
> libsqlite without any special precautions, as long as apsw is also
> dynamically linked?

Yes.

> You already lost the bet,

Even with a larger page size?

> DB size (after dropping all droppable indices and VACUUM): 232 MB After
> LZMA compression: 51 MB Size of pickle dump: 184 MB After LZMA
> compression: 42 MB Size of apsw dump: 409 MB After compression: 41 MB

That implies that using a LZMA dump should work! However with knowledge
of the data you should do better, although LZMA is really good. The
pickle format is fairly verbose which is why it isn't much better than
ASCII text.

> I have attached the output of sqlite3_analyzer for the DB if you're
> interested.

I keep thinking of porting that to Python or making some sort of low level
"database explorer".

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk6Ql7kACgkQmOOfHg372QRhkgCePKFeDmNugjBFFuuxUVeBinV6
IXUAn15yLHkinmWygVwh46T2zXICm9Hn
=Jl3X
-----END PGP SIGNATURE-----

Nikolaus Rath

unread,
Oct 8, 2011, 3:29:28 PM10/8/11
to python...@googlegroups.com
Roger Binns <rogerb-JFdGOZ7...@public.gmane.org> writes:
> On 08/10/11 08:29, Nikolaus Rath wrote:
>> It seems that the unremovable indices for the primary keys take up a
>> significant amount of space.
>
> Note that there isn't an actual index for the primary key. The primary
> key is used to find rows and hence is used as the key for btree nodes but
> isn't stored in an index anywhere. Well, btree keys are a form of index
> but you get the jist - they are different than regular indices.
>
> Which brings up another thought - have you tried using a larger page size
> for the compressible database such as 64kb?

Yep, all the way up to 128kb. No significant changes. I can report back
with the results of my delta compression once I have them if you're
interested.

Roger Binns

unread,
Oct 8, 2011, 3:47:21 PM10/8/11
to python...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/10/11 12:29, Nikolaus Rath wrote:
> I can report back with the results of my delta compression once I have
> them if you're interested.

I am. Also worth trying is plain old boring CSV which should beat pickle
and SQL text. I suspect sorting the rows before they go through LZMA will
improve compression a fair bit irrespective of the intermediary format
being SQL, pickle etc.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk6QqMkACgkQmOOfHg372QRV3QCfXuUu4GFN1eQq0h8P1+jpj2aN
SXIAoLdt19XLHHFngN5NzXTn1Q82mtqP
=69Z7
-----END PGP SIGNATURE-----

Nikolaus Rath

unread,
Oct 18, 2011, 8:44:32 PM10/18/11
to python...@googlegroups.com
Roger Binns <rogerb-JFdGOZ7...@public.gmane.org> writes:
>> Great, thanks for the pointer! So I can just link my extension to
>> libsqlite without any special precautions, as long as apsw is also
>> dynamically linked?
>
> Yes.

Is there a way to determine (at runtime of apsw and compile time of my
extension) if apsw has been linked dynamically or statically?

I think when comparing apsw.sqlitelibversion() with sqlite3_libversion()
(called by my C extension), I could get a match if apsw has been
statically linked against exactly the same sqlite version that I'm
linking to dynamically. Or would that be no problem?

Roger Binns

unread,
Oct 18, 2011, 11:00:51 PM10/18/11
to python...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 18/10/11 17:44, Nikolaus Rath wrote:
> Is there a way to determine (at runtime of apsw and compile time of my
> extension) if apsw has been linked dynamically or statically?

You can tell if the amalgamation was compiled in:

http://apidoc.apsw.googlecode.com/hg/apsw.html#apsw.using_amalgamation

However if the amalgamation was not used then APSW doesn't know internally
if it was dynamically or static linked. The linker is passed -lsqlite3
and it resolves accordingly.

At build time on Unix you can run ldd against the APSW shared library
which will show dynamic links which should answer your question. Under
Windows you can use a similar tool to look for imports.

> I think when comparing apsw.sqlitelibversion() with
> sqlite3_libversion() (called by my C extension), I could get a match if
> apsw has been statically linked against exactly the same sqlite version
> that I'm linking to dynamically. Or would that be no problem?

A better version check is the sourceid:

http://apidoc.apsw.googlecode.com/hg/apsw.html#apsw.sqlite3_sourceid

This just tells you they are using the same source code, but not
necessarily the same library binary, but is more accurate than the library
version.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk6ePWMACgkQmOOfHg372QQ7KwCcCcBc4/N6vksn9D6ws8kiH2wD
QMEAn2shJ0RElWoAPwu5aUacfh4nRzTy
=0sQH
-----END PGP SIGNATURE-----

Nikolaus Rath

unread,
Oct 19, 2011, 9:27:42 AM10/19/11
to python...@googlegroups.com
Roger Binns <rogerb-JFdGOZ7...@public.gmane.org> writes:
>> I think when comparing apsw.sqlitelibversion() with
>> sqlite3_libversion() (called by my C extension), I could get a match if
>> apsw has been statically linked against exactly the same sqlite version
>> that I'm linking to dynamically. Or would that be no problem?
>
> A better version check is the sourceid:
>
> http://apidoc.apsw.googlecode.com/hg/apsw.html#apsw.sqlite3_sourceid
>
> This just tells you they are using the same source code, but not
> necessarily the same library binary, but is more accurate than the library
> version.

So would this check be enough, i.e. is it safe to run a statically
linked apsw with a dynamically linked extension, as long as both are
linked against the same source id?

Roger Binns

unread,
Oct 19, 2011, 12:21:45 PM10/19/11
to python...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 19/10/11 06:27, Nikolaus Rath wrote:
> So would this check be enough, i.e. is it safe to run a statically
> linked apsw with a dynamically linked extension, as long as both are
> linked against the same source id?

Do you really mean statically linked or do you mean using the
amalgamation? Also just because the code was the same doesn't mean that
the same compilation options were used.

Alternatively ensure everyone is using the same shared library and refuse
to work otherwise (use ldd to check).

The only safe way is to write your code as a SQLite extension. When the
extension is loaded it is passed a structure with pointers to the various
SQLite functions and hence will end up using the exact same SQLite no
matter what else is going on. I don't think you can be both a SQLite and
Python extension simultaneously, but it may work providing the OS doesn't
reload the DLL.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk6e+RUACgkQmOOfHg372QSQ/wCfVqxNqKSiLKHgqFKFCaDI9wQc
PWUAoMExgQvrEc6QpZyR0Gro+2c72coy
=GNMd
-----END PGP SIGNATURE-----

Nikolaus Rath

unread,
Oct 19, 2011, 4:15:59 PM10/19/11
to python...@googlegroups.com
Roger Binns <rogerb-JFdGOZ7...@public.gmane.org> writes:
> On 08/10/11 12:29, Nikolaus Rath wrote:
>> I can report back with the results of my delta compression once I have
>> them if you're interested.
>
> I am. Also worth trying is plain old boring CSV which should beat pickle
> and SQL text. I suspect sorting the rows before they go through LZMA will
> improve compression a fair bit irrespective of the intermediary format
> being SQL, pickle etc.

Here are some preliminary results for just one table:

raw : 0.85 sek, 12.84 MB
raw + zlib : 10.12 sek, 3.46 MB
raw + BZIP2 : 2.45 sek, 3.53 MB
raw + LZMA : 13.85 sek, 2.58 MB
apsw : 3.51 sek, 8.92 MB
apsw + zlib : 11.62 sek, 4.17 MB
apsw + BZIP2 : 5.41 sek, 4.22 MB
apsw + LZMA : 13.72 sek, 3.17 MB
pickle : 10.01 sek, 28.22 MB
pickle + zlib : 13.59 sek, 4.27 MB
pickle + BZIP2 : 16.95 sek, 3.61 MB
pickle + LZMA : 52.69 sek, 2.82 MB

'raw' means a direct binary dump (no type information) without any delta
compression yet.

It seems that:

- zlib compression sucks
- apsw dump is actually a lot faster than pickle dump, I guess my last
test was with an old apsw version
- raw dump is still much faster than apsw dump
- pickle dump compresses slightly better than apsw dump, but
compression takes much more time
- raw dump compresses most, but compression of the raw data
takes more time than compression of the apsw dump

Roger Binns

unread,
Oct 19, 2011, 5:30:38 PM10/19/11
to python...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 19/10/11 13:15, Nikolaus Rath wrote:
> Here are some preliminary results for just one table:

It doesn't look like any of them are CSV which I suspect would work best.
What exactly is a "raw" dump?

> - apsw dump is actually a lot faster than pickle dump, I guess my last
> test was with an old apsw version

The APSW dump speed significantly improved in 3.7.0 when I moved the code
that formats values (strings and blobs were slow) from Python into C:

http://apidoc.apsw.googlecode.com/hg/apsw.html#apsw.format_sql_value

Unlike SQLite's shell it makes sure that embedded nulls in strings are
correctly output.

> - pickle dump compresses slightly better than apsw dump, but
> compression takes much more time

Pickle's format has several prefix bytes before each value to set their
type (eg an int takes 3 type bytes). Larger ints are represented as ASCII
rather than binary like smaller ones. Consequently the compressor is
going to see lots of repeated boiler plate and large numbers like inodes
are represented exactly the same as in SQL and CSV.

The ascii SQL encoding from APSW is more efficient (eg the integer 1 takes
one byte) but *every* row has an "INSERT INTO TABLE xxxx VALUES (?,?,?)"
which is a waste. The rest of the row is substantially similar to CSV
which is why I recommend you try it.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk6fQX4ACgkQmOOfHg372QTFRgCfeA5ru6OLh7qSDmw48zT54NsC
U/UAn1OyeG2wB3VUQDrKcjb/Nq6tsBWj
=Pj50
-----END PGP SIGNATURE-----

Nikolaus Rath

unread,
Oct 19, 2011, 7:25:55 PM10/19/11
to python...@googlegroups.com
Roger Binns <rogerb-JFdGOZ7...@public.gmane.org> writes:
> On 19/10/11 13:15, Nikolaus Rath wrote:
>> Here are some preliminary results for just one table:
>
> It doesn't look like any of them are CSV which I suspect would work best.
> What exactly is a "raw" dump?

Every integer as 64bit binary, every blob with its (fixed length) binary
representation.

> The ascii SQL encoding from APSW is more efficient (eg the integer 1 takes
> one byte) but *every* row has an "INSERT INTO TABLE xxxx VALUES (?,?,?)"
> which is a waste. The rest of the row is substantially similar to CSV
> which is why I recommend you try it.

Will do.

Nikolaus Rath

unread,
Oct 19, 2011, 7:57:43 PM10/19/11
to python...@googlegroups.com
Nikolaus Rath <Nikolaus-B...@public.gmane.org> writes:

> Roger Binns <rogerb-JFdGOZ7s+BwWQnjQ7V0...@public.gmane.org> writes:
>> On 08/10/11 12:29, Nikolaus Rath wrote:
>>> I can report back with the results of my delta compression once I have
>>> them if you're interested.
>>
>> I am. Also worth trying is plain old boring CSV which should beat pickle
>> and SQL text. I suspect sorting the rows before they go through LZMA will
>> improve compression a fair bit irrespective of the intermediary format
>> being SQL, pickle etc.
>
> Here are some preliminary results for just one table:
>
> raw : 0.85 sek, 12.84 MB
> raw + zlib : 10.12 sek, 3.46 MB
> raw + BZIP2 : 2.45 sek, 3.53 MB
> raw + LZMA : 13.85 sek, 2.58 MB
> apsw : 3.51 sek, 8.92 MB
> apsw + zlib : 11.62 sek, 4.17 MB
> apsw + BZIP2 : 5.41 sek, 4.22 MB
> apsw + LZMA : 13.72 sek, 3.17 MB
> pickle : 10.01 sek, 28.22 MB
> pickle + zlib : 13.59 sek, 4.27 MB
> pickle + BZIP2 : 16.95 sek, 3.61 MB
> pickle + LZMA : 52.69 sek, 2.82 MB

As a matter of fact, the "pickle" and "apsw" labels are swapped. It's
the apsw ".dump" command that takes longer and compresses better. This
makes a lot more sense, the SQL dump with all the boilerplate statements
ought to have the biggest uncompressed size, and pickle should indeed be
faster since it doesn't have to worry about escaping special characters.


Here's the comparison with correct labels for the entire database. I
also added CSV output using the apsw shell with .mode csv and "SELECT
* from [table]":

pickle : 15.81 sek, 56.86 MB
pickle + zlib : 43.59 sek, 26.07 MB
pickle + BZIP2 : 23.41 sek, 24.27 MB
pickle + LZMA : 65.44 sek, 15.27 MB
apsw_csv : 77.87 sek, 71.80 MB
apsw_csv + zlib : 98.12 sek, 21.21 MB
apsw_csv + BZIP2: 86.98 sek, 18.54 MB
apsw_csv + LZMA : 165.38 sek, 11.63 MB
apsw_sql : 33.44 sek, 128.26 MB
apsw_sql + zlib : 50.21 sek, 24.99 MB
apsw_sql + BZIP2: 51.45 sek, 20.73 MB
apsw_sql + LZMA : 160.66 sek, 14.14 MB

Nikolaus Rath

unread,
Oct 21, 2011, 6:53:01 PM10/21/11
to python...@googlegroups.com
Roger Binns <rogerb-JFdGOZ7...@public.gmane.org> writes:
> On 19/10/11 06:27, Nikolaus Rath wrote:
>> So would this check be enough, i.e. is it safe to run a statically
>> linked apsw with a dynamically linked extension, as long as both are
>> linked against the same source id?
>
> Do you really mean statically linked or do you mean using the
> amalgamation? Also just because the code was the same doesn't mean that
> the same compilation options were used.

Duh, good point. I forgot about that. So I really need to make sure that
it's dynamically linked, neither static nor amalgamation are acceptable.

> Alternatively ensure everyone is using the same shared library and refuse
> to work otherwise (use ldd to check).

Yeah, I'll do that. It's just that parsing ldd output is not exactly an
elegant solution.

> The only safe way is to write your code as a SQLite extension.

I want to use Cython, so that's not an option.

Roger Binns

unread,
Oct 21, 2011, 7:38:33 PM10/21/11
to python...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 21/10/11 15:53, Nikolaus Rath wrote:


> Roger Binns <rogerb-JFdGOZ7...@public.gmane.org> writes:
>> The only safe way is to write your code as a SQLite extension.
>
> I want to use Cython, so that's not an option.

It might be. I believe that if the you load a shared library a second
time the existing loaded in memory copy is used. Consequently you can
have both SQLite and Cython entry points. Use the SQLite entry point to
set function pointers to the various functions (sqlite3ext.h has macros to
do this for you automagically).

Roger

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk6iAnkACgkQmOOfHg372QQgrQCgj9oRfIh58+6PZr1B6hDvBJQV
drgAn1qDe/yo+JCgKDnlDXl1UZncm5j8
=SdCO
-----END PGP SIGNATURE-----

Nikolaus Rath

unread,
Nov 22, 2011, 10:36:11 PM11/22/11
to python...@googlegroups.com
Nikolaus Rath <Nikolaus-B...@public.gmane.org> writes:

> Roger Binns <rogerb-JFdGOZ7s+BwWQnjQ7V0...@public.gmane.org> writes:
>> On 08/10/11 12:29, Nikolaus Rath wrote:
>>> I can report back with the results of my delta compression once I have
>>> them if you're interested.
>>
>> I am. Also worth trying is plain old boring CSV which should beat pickle
>> and SQL text. I suspect sorting the rows before they go through LZMA will
>> improve compression a fair bit irrespective of the intermediary format
>> being SQL, pickle etc.
>
> Here are some preliminary results for just one table:

And here are the final results for two example databases.

DB 1, 317 MB:
delta_f : 21.26 sek, 112.73 MB
delta_f + zlib : 136.65 sek, 65.21 MB
delta_f + BZIP2 : 40.16 sek, 58.03 MB
delta_f + LZMA : 180.27 sek, 53.70 MB
pickle : 54.24 sek, 202.73 MB
pickle + zlib : 208.49 sek, 80.30 MB
pickle + BZIP2 : 85.90 sek, 69.74 MB
pickle + LZMA : 288.56 sek, 46.71 MB
apsw_csv : 326.86 sek, 256.04 MB
apsw_csv + zlib : 404.47 sek, 68.43 MB
apsw_csv + BZIP2: 365.05 sek, 58.26 MB
apsw_csv + LZMA : 694.18 sek, 38.24 MB
apsw_sql : 134.66 sek, 452.74 MB
apsw_sql + zlib : 198.06 sek, 79.08 MB
apsw_sql + BZIP2: 231.54 sek, 63.55 MB
apsw_sql + LZMA : 670.06 sek, 46.08 MB

DB 2, 79 MB
delta_f : 4.10 sek, 34.39 MB
delta_f + zlib : 26.23 sek, 22.98 MB
delta_f + BZIP2 : 10.42 sek, 21.65 MB
delta_f + LZMA : 44.30 sek, 19.90 MB
pickle : 14.33 sek, 60.24 MB
pickle + zlib : 47.89 sek, 27.49 MB
pickle + BZIP2 : 24.48 sek, 25.42 MB
pickle + LZMA : 76.15 sek, 16.06 MB
apsw_csv : 92.79 sek, 76.38 MB
apsw_csv + zlib : 117.07 sek, 22.39 MB
apsw_csv + BZIP2: 104.34 sek, 19.51 MB
apsw_csv + LZMA : 203.24 sek, 12.32 MB
apsw_sql : 39.65 sek, 135.88 MB
apsw_sql + zlib : 59.55 sek, 26.21 MB
apsw_sql + BZIP2: 63.66 sek, 21.68 MB
apsw_sql + LZMA : 194.53 sek, 14.73 MB

Overall, delta compression + bzip2 compression seems to be the best
compromise between speed and output size by a good margin. A reason for
this might be that the reduced size before compression allows the
compression algorithm to run much faster.

For delta compressed dumps, LZMA isn't able to do much better than
bzip2. For all other dumps, however, LZMA compresses better than bzip2
(but also takes more time).

Best compression can be reached by dumping in CSV or SQL format and then
LZMA compressing. It beats me why that might be the case. After all, CSV
really isn't much more than a base64 encoded delta dump.

Roger Binns

unread,
Nov 22, 2011, 11:08:07 PM11/22/11
to python...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 22/11/11 19:36, Nikolaus Rath wrote:
> Overall, delta compression + bzip2 compression seems to be the best
> compromise between speed and output size by a good margin.

- From my reading the CSV is the smallest compressed file size. I'd
strongly recommend you follow this route since you'll then have standard
format files.

Note that the APSW Shell's CSV implementation is slow - implement your own
export. The reason it is slow is because it supports any encoding, the
Python csv module has issues with encodings and my code has to jump
through many hoops to make that always work correctly. Behind the scenes
it works on a line by line basis which adds another layer of slowness.

You however already know exactly what encoding you'll use, what the fields
are (mostly integers IIRC) and should be able to produce a CSV very quickly.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)

iEYEARECAAYFAk7McaYACgkQmOOfHg372QSM+QCg03RApp1rE9Q+bb1lqczBXonb
8H8Anjv+TpZkMGNzkOQZisxnE22JyGpX
=Bqmn
-----END PGP SIGNATURE-----

Nikolaus Rath

unread,
Nov 23, 2011, 10:29:55 AM11/23/11
to python...@googlegroups.com
Roger Binns <rogerb-JFdGOZ7...@public.gmane.org> writes:
> On 22/11/11 19:36, Nikolaus Rath wrote:
>> Overall, delta compression + bzip2 compression seems to be the best
>> compromise between speed and output size by a good margin.
>
> From my reading the CSV is the smallest compressed file size.

True, but only when combined with LZMA compression.

> Note that the APSW Shell's CSV implementation is slow - implement your own
> export. The reason it is slow is because it supports any encoding, the
> Python csv module has issues with encodings and my code has to jump
> through many hoops to make that always work correctly. Behind the scenes
> it works on a line by line basis which adds another layer of slowness.

But this isn't actually the bottleneck. Compressing CSV data takes 368
seconds with LZMA and 39 seconds with bzip2. So even if I could make CSV
export as fast as the delta dump (which compresses in 19 seconds using
bzip2, and LZMA doesn't do any better), it would still take twice as
long to reach the same final size, and 19 times as long to actually get
an improvement.

> I'd strongly recommend you follow this route since you'll then have
> standard format files.

I don't think my files will ever be read by anything but my application,
so I don't think the decrease in speed is worth it.

Reply all
Reply to author
Forward
0 new messages