tcl + tdbc + sqlite3 -> slow?

Andreas Leitgeb

unread,

Dec 20, 2015, 12:49:27 PM12/20/15

to

I've once written a parser for ibm's "portable heapdump" (phd) format, and
if I feed it a really large .phd file (>1GB) and let it write its output
into CSV-files, then it finishes in less than an hour on my machine.

Then I thought, it might be faster (or at least more convenient for next
steps of analysis) to write the data into a database instead. I read a
tutorial on the web <www.magicsplat.com/articles/tdbc.html> that showed
me how to open database, create tables and how to use prepared statements
for repeating tasks (such as inserting rows into tables).

I picked sqlite3 because I don't need concurrent access or user-management.

But when I ran my parser to write into that DB, it was so very slow, that
by the time of an hour it had only produced a few kilobytes of database-
and journal files, and the script hadn't reached the first (of 206 for that
input file) progress-marks. After I aborted it, the db actually did
contain some consistent data, so it did successfully write rows to the
tables, just ways too slow.

I suppose I've done something very foolish, or maybe "insert" is just sqlite3's
weak spot (regarding performance), or maybe there are settings that I'd have
to set in the newly created db ("pragma ...") before inserts would be fast, or
maybe sqlite3 is just that one "proof-of-concept" pure tcl tdbc-driver ...

Any hints to an sqlite-newbie?

PS: I've been dealing with oracle and informix databases with C++ and Java in
the past, so I'm not a general newie with db-stuff, just with tdbc and sqlite.

Christian Gollwitzer

unread,

Dec 20, 2015, 1:31:17 PM12/20/15

to

Am 20.12.15 um 18:46 schrieb Andreas Leitgeb:

> I've once written a parser for ibm's "portable heapdump" (phd) format, and
> if I feed it a really large .phd file (>1GB) and let it write its output
> into CSV-files, then it finishes in less than an hour on my machine.
>
> Then I thought, it might be faster (or at least more convenient for next
> steps of analysis) to write the data into a database instead. I read a
> tutorial on the web <www.magicsplat.com/articles/tdbc.html> that showed
> me how to open database, create tables and how to use prepared statements
> for repeating tasks (such as inserting rows into tables).
>
> I picked sqlite3 because I don't need concurrent access or user-management.
>
> But when I ran my parser to write into that DB, it was so very slow, that
> by the time of an hour it had only produced a few kilobytes of database-
> and journal files, and the script hadn't reached the first (of 206 for that
> input file) progress-marks. After I aborted it, the db actually did
> contain some consistent data, so it did successfully write rows to the
> tables, just ways too slow.

A few things come to my mind, though I haven't benchmarked sqlite
myself. Running a sequence of queries to insert the data is likely not
the fastest way. If autocommit is on, then it will be much faster if you
switch it off, insert a large chunk of data, and commit manually.

Have you tried to import the csv file using the sqlite shell? It has a
.import command to read mass data from a CSV file. I'd expect that this
is the fastest way to import data, albeit not from Tcl - I don't know if
there is a Tcl side interface.

In general, sqlite should be not much slower than a CSV file, and a lot
faster when you subsequently run queries.

Christian

jda...@gmail.com

unread,

Dec 20, 2015, 3:02:57 PM12/20/15

to

> I suppose I've done something very foolish, or maybe "insert" is just sqlite3's
> weak spot (regarding performance), or maybe there are settings that I'd have
> to set in the newly created db ("pragma ...") before inserts would be fast, or
> maybe sqlite3 is just that one "proof-of-concept" pure tcl tdbc-driver ...
>
> Any hints to an sqlite-newbie?
>
> PS: I've been dealing with oracle and informix databases with C++ and Java in
> the past, so I'm not a general newie with db-stuff, just with tdbc and sqlite.

I suspect you are not using "transactions", and are committing on each insert instead of after all the inserts.

package require sqlite3
file delete db.sdb
sqlite3 db db.sdb
db eval "create table if not exists kv(key, value)"

proc notrans {n} {
for {set i 0} {$i<$n} {incr i} {
db eval "replace into kv(key, value) values($i $i)"
}
}

proc wtrans {n} {
db transaction {
notrans $n
}
}

puts "no transactions [time {notrans 1000}]"
puts "w/ transactions [time {wtrans 1000}]"

with n=1000 I get 4000000 and 8000 usec respectively
The time appears to be dominated by file operations.

I'm using the Tcl sqlite package, not tdbc, but there is probably a way to do the same thing through that interface.

Dave B

Andreas Leitgeb

unread,

Dec 20, 2015, 3:10:51 PM12/20/15

to

Christian Gollwitzer <auri...@gmx.de> wrote:
> Am 20.12.15 um 18:46 schrieb Andreas Leitgeb:

>> I picked sqlite3 because I don't need concurrent access or user-management.
>>
>> But when I ran my parser to write into that DB, it was so very slow, that
>> by the time of an hour it had only produced a few kilobytes of database-
>> and journal files, and the script hadn't reached the first (of 206 for that

>> input file) progress-marks. ...

>
> A few things come to my mind, though I haven't benchmarked sqlite
> myself. Running a sequence of queries to insert the data is likely not
> the fastest way. If autocommit is on, then it will be much faster if you
> switch it off, insert a large chunk of data, and commit manually.

Mentioning "autocommit" definitely was great help.

Right after opening the database I now use "$DB begintransaction",
and about every million of records processed (that's the previously
mentioned progress-marks) I do now commit and begintransaction and
it seems to run much faster now.

I didn't know that sqlite even implemented any kind of transaction
concept, but apparently it does :-)

Thanks a lot!

> Have you tried to import the csv file using the sqlite shell?
> It has a .import command to read mass data from a CSV file.

That would be a workaround (Plan B). I've still kept the produced
CSV files (almost 10 GB altogether), so if anything goes wrong
with the bigger transactions, then I can still fall back to that.

> I'd expect that this is the fastest way to import data, albeit
> not from Tcl - I don't know if there is a Tcl side interface.

If nothing else, then exec sqlite3 $dbfile << ".import ..."
would be an option :-)

> In general, sqlite should be not much slower than a CSV file, and a lot
> faster when you subsequently run queries.

Thanks, again

Andreas Leitgeb

unread,

Dec 20, 2015, 7:13:48 PM12/20/15

to

Andreas Leitgeb <a...@logic.at> wrote:

> Christian Gollwitzer <auri...@gmx.de> wrote:
>> Running a sequence of queries to insert the data is likely not

>> the fastest way. If autocommit is on, then it will be much faster ...

> Mentioning "autocommit" definitely was great help.
> Right after opening the database I now use "$DB begintransaction",
> and about every million of records processed (that's the previously
> mentioned progress-marks) I do now commit and begintransaction and
> it seems to run much faster now.

Well, it's faster now, but still not working for me. :-(

After reaching first progress-mark (1/206th of the total),
it has created a db-file that is now a couple of megabytes in
size, but the process (tclsh) already has virtual size (according
to "top") of a couple of gigabytes and the machine is grinding to
unresponsiveness due to paging(swapping)...

Is there a possibility of a massive memory leak in the tdbc driver
for sqlite3, or is this just sqlite's normal overhead (like for
indices and stuff) ?

Andreas Leitgeb

unread,

Dec 20, 2015, 7:19:16 PM12/20/15

to

jda...@gmail.com <jda...@gmail.com> wrote:
>> Any hints to an sqlite-newbie?
>

> I suspect you are not using "transactions", and are committing
> on each insert instead of after all the inserts.

Thanks, you suspected correctly, but Christian was a bit faster
to post this suggestion.

Please see my followups in his subthread.

Rich

unread,

Dec 20, 2015, 8:44:40 PM12/20/15

to

No, this is not normal overhead for Sqlite3. A couple megabyte DB file
with gigabytes of core implies either a coding error or a memory leak
somewhere.

Andreas Leitgeb

unread,

Dec 20, 2015, 9:38:23 PM12/20/15

to

Rich <ri...@example.invalid> wrote:
> Andreas Leitgeb <a...@logic.at> wrote:
>> Is there a possibility of a massive memory leak in the tdbc driver
>> for sqlite3, or is this just sqlite's normal overhead (like for
>> indices and stuff) ?
> No, this is not normal overhead for Sqlite3.

You're right, it is indeed not normal, unless some newbie
tdbc'er such as me forgets to deal with the results from
executing statements...

I apologize for the noise and for suspecting leaks everywhere
else than in my own code. :-}

Christian Gollwitzer

unread,

Dec 21, 2015, 3:04:53 AM12/21/15

to

Am 21.12.15 um 03:35 schrieb Andreas Leitgeb:

> You're right, it is indeed not normal, unless some newbie
> tdbc'er such as me forgets to deal with the results from
> executing statements...

Ah, you had to free the result objects manually, which contain nothing
for an insert statement...

Honestly, I do not know what you gain from using tdbc. Sqlite3 has an
excellent Tcl interface, which does not return anything from an insert
statement. sqlite3 also does automatically create prepared statements
for repeated queries, if used properly (i.e. placeholders for the
values, not Tcl string interpolation), which basically means to only
ever use curly braces for SQL queries. tdbc is useful if you want to
connect to different database engines using a unified syntax. Caveat
emptor: I've never used it because sqlite3 fulfilled all my needs so far.

On a side note, if in Tcl we had the possibility to create a Tcl_Obj
from Tcl, and a list iteration interface, we could fake a value-based
interface and let the interpreter perform the garbage collection of teh
results.

Christian

Alexandre Ferrieux

unread,

Dec 21, 2015, 5:04:12 AM12/21/15

to

On Monday, December 21, 2015 at 9:04:53 AM UTC+1, Christian Gollwitzer wrote:
>
> On a side note, if in Tcl we had the possibility to create a Tcl_Obj
> from Tcl, and a list iteration interface, we could fake a value-based
> interface and let the interpreter perform the garbage collection of teh
> results.

Sorry if I'm being dense, but I fail to grok this sentence. At the same time, it sounds interesting. Care to rephrase/elaborate ?

-Alex

Andreas Leitgeb

unread,

Dec 21, 2015, 6:33:08 AM12/21/15

to

Christian Gollwitzer <auri...@gmx.de> wrote:
> Am 21.12.15 um 03:35 schrieb Andreas Leitgeb:
>> You're right, it is indeed not normal, unless some newbie
>> tdbc'er such as me forgets to deal with the results from
>> executing statements...
> Ah, you had to free the result objects manually, which contain nothing
> for an insert statement...

Yes. I merely changed "execute" to "allrows" and was done.

> tdbc is useful if you want to connect to different database engines
> using a unified syntax.

That was my point. I wasn't sure, if sqlite3 was suitable for my
current need, so I might have switched to mysql or postgres if it
had turned out otherwise... I might still do it later to see what
difference in performance it would have made.

Also, I took the chance to get a productive glimpse into tdbc. :-)

In the end, the conversion of the original .phd file (>1G) into an
sqlite3 db took about 400 minutes, thus 10 times longer than csv
export. It ran overnight, and I've already successfully tried some
queries on the result.

> On a side note, [... Tcl_Obj ...]
That goes above my head.

Christian Gollwitzer

unread,

Dec 21, 2015, 11:07:44 AM12/21/15

to

Am 21.12.15 um 11:04 schrieb Alexandre Ferrieux:

Right now I'm a bit short in time due to Christmas stress etc. I'll
write up some more about the idea maybe in the coming days. A simple
example could be a command to do ordered sets, let's call it "oset".
This can be implemented as dicts with the values all set to 1 or {} or
similar. From C, it is easy to create a new Tcl_ObjType which looks like
a list and a command to oset on it implemented as a hashtable. If, then,
there would be list iterators, as proposed by dkf earlier, you could
"foreach $olist {}" without destroying the internal hash map.
The same thing could be used to implement the database results which ARE
pure values, but use objects only because they might be large.

Christian

Alexandre Ferrieux

unread,

Dec 21, 2015, 5:27:16 PM12/21/15

to

Yes, we've been dreaming about listlike polymorphism for some time, but stopped short because of the big activation energy of the necessary overhaul of dozens of clients of the "list interface" (which is actually a pervasive idiom taking for granted the list implementation). But given your recent achievements with VecTcl, I bet you have that energy :)

As for using value semantics for query results, of course you're right about the main motivation for not doing so in the first place (size). Specifically, in similar situations I've missed a way for the stringUpdateProc to bail out (like raising a Tcl error), meaning "sorry Dave...". The idea is that there are perfectly useful workloads where huge "values" could be manipulated throughout their lifecycle, in complete respect of the EIAS principle, without ever needing their stringproc to be called. If we do this lighthandedly on giga-size sets (for example), a single slip with [puts] ends up in Tcl_Panic("out of memory").

One possible trick would be to let the UpdateString spit out some ridiculous placeholder instead. "Ridiculous" meaning you can hardly miss it if ever it sees the light. In the context of list-iterable things, this can be made practical with list-invalid syntax like {handle}set0x12345678. But that does not generalize to non-lists. Dunno.

Lack of energy doesn't mean lack of interest. Please keep me posted on your thoughts about all this.

-Alex

Christian Gollwitzer

unread,

Jun 30, 2016, 10:55:51 AM6/30/16

to

Am 21.12.15 um 23:27 schrieb Alexandre Ferrieux:

> On Monday, December 21, 2015 at 5:07:44 PM UTC+1, Christian Gollwitzer wrote:
>> Am 21.12.15 um 11:04 schrieb Alexandre Ferrieux:
>>> On Monday, December 21, 2015 at 9:04:53 AM UTC+1, Christian
>>> Gollwitzer wrote:
>>>>
>>>> On a side note, if in Tcl we had the possibility to create a
>>>> Tcl_Obj from Tcl, and a list iteration interface, we could fake a
>>>> value-based interface and let the interpreter perform the garbage
>>>> collection of teh results.
>>>
>>> Sorry if I'm being dense, but I fail to grok this sentence. At the
>>> same time, it sounds interesting. Care to rephrase/elaborate ?
>>

> Lack of energy doesn't mean lack of interest. Please keep me posted on your thoughts about all this.

Voilà:

https://www.youtube.com/watch?v=7mCDd-h89gQ

The slides are here:

http://www.eurotcl.tcl3d.org/presentations/EuroTcl2016-Gollwitzer-EIAS.pdf

The last example shows a possible solution. Thanks for the inspiration
to the talk!

Christian

Harald Oehlmann

unread,

Jul 1, 2016, 9:34:04 AM7/1/16

to

Dear Christian (I suppose its you),

thank you for the great videos:

https://www.youtube.com/playlist?list=PLHNnTryxvDnf0YotQXyOu6fBWGLJALkd5

My last wish would be to have the TCT pannel on video ;-)

Thank you, great,
Harald

--- news://freenews.netfront.net/ - complaints: ne...@netfront.net ---

Christian Gollwitzer

unread,

Jul 1, 2016, 10:19:32 AM7/1/16

to

Hi Harald,

Am 01.07.16 um 15:34 schrieb Harald Oehlmann:

> thank you for the great videos:
>
> https://www.youtube.com/playlist?list=PLHNnTryxvDnf0YotQXyOu6fBWGLJALkd5

I'm still awaiting approval for some videos. If I can get this complete,
I wanted to post it to this newsgroup. But nevermind:)

> My last wish would be to have the TCT pannel on video ;-)

the video is there, but it has to be approved by all 5 TCT members.
Additionally, I had planned to send the link only to participants.
Waiting for approval and further instructions from the TCT. (A few have
indicated that they don't mind whatever happens to the recordings)

Christian