Cython and Sqlite (or databases in general)

Aman Thakral

unread,

Nov 17, 2011, 4:55:28 PM11/17/11

to cython...@googlegroups.com

Hi All,

I've got a model that I'm working that I've optimized using cython. The results of profiling now show that the main bottleneck is the repeated queries that I'm making to an sqlite database. The database is small (about 10 MB and can easily fit in memory), but I was wondering if there was a way to access sqlite using cython to speed up the data retrieval step. If all else fails, I could just store the entire database in a nested python dict and write a routine to query that data that way. I would, however, prefer to query it from the database and store the result of the query in a python dict.

Has anyone had any experience in using cython with SQLite? Would it be faster if I was using a different database (or type of storage, ie. HDF5 via PyTables)?

Any insight would be greatly appreciated.

Thanks,
Aman

Sturla Molden

unread,

Nov 17, 2011, 6:41:46 PM11/17/11

to cython...@googlegroups.com

Do you keep the database on disk or in memory? You can make a database
in memory by opening a database file with name ":memory:". Or you can
use a ramdisk to keep it in shared memory, which will allow multiple
processes to share database. Most likely your bottleneck is the disk
access, so get rid of that first.

BSD DB can be faster than Sqlite if you don't need SQL (i.e. it's more
like a Python dict).

Both Sqlite and BSD DB have C API's you can use from Cython. Though I
hardly think the Python wrapper is the important bottleneck here.

HDF5/PyTables are typically used for something else than Sqlite: That
is, storing large array-based scientific data. That can be easier (and
faster) than e.g. keeping pickled NumPy arrays in BLOBs using Sqlite.

Sturla

Aman Thakral

unread,

Nov 17, 2011, 8:43:20 PM11/17/11

to cython...@googlegroups.com

I tried using apsw to create an in memory backup and using that, but it did not provide an improvement in speed. I'll take a look at BSD DB. I do like using a database since I'm working with spatiotemporal data (meteorological data from numerous weather stations) and its very easy to do an SQL query for a time range for a particular station. I'm not familiar with BSD DB, so I'm not sure how well its suited for the type of data that I have. I guess I'll have to dive into the cython docs and learn about using external C libraries with it.

Thanks for insight!

Aman

Dag Sverre Seljebotn

unread,

Nov 18, 2011, 3:15:35 AM11/18/11

to cython...@googlegroups.com

PyTables has support for selecting a subset of rows from a table quickly (it has some database indexes etc. that it uses -- used to be in pro version, but now the whole thing is OSS). So if you do not need RDBMS features like joins across tables etc. then I'd at least benchmark it.

DS
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Sturla Molden

unread,

Nov 18, 2011, 11:06:30 AM11/18/11

to cython...@googlegroups.com

Den 18.11.2011 02:43, skrev Aman Thakral

> I guess I'll have to dive into the cython docs and learn
> about using external C libraries with it.

Why? Do you have any reason to think the bottleneck is in the Python
wrappers?

If the database is the problem, calling it from C will not help. Sqlite
will still be sqlite, neither faster nor slower.

Sturla

Aman Thakral

unread,

Nov 18, 2011, 11:40:22 AM11/18/11

to cython...@googlegroups.com

Well, I think it may be the fact that the loop for storing the data is in python. Here is an example:

    cursor = connection.cursor()
    records = cursor.execute(query)
    data = {}
    for record in records:
        #date = datetime.datetime.strptime(record[0],'%Y-%m-%d')
        date = record[0]
        data[date] = [float(x) for x in record[1:]]

Ideally, I would like to keep the date as the datetime object, but this cause my code to run an objective function call in 11.2 seconds. If I keep the date as a string (which is how it's stored in SQLite), the objective function call takes 2.1 seconds.
I'm going to optimizing some parameter values, so I'd like the objective function to run as quickly as possible. I was hoping that running this chunk of code so that the loop is done in Cython would speed things up. Perhaps I'm mistaken though. Also, maybe its just an approach issue and there are some tools in numpy that I'm not aware of that will help.

Sturla Molden

unread,

Nov 18, 2011, 12:26:22 PM11/18/11

to cython...@googlegroups.com

Looks like a design problem. Running that chunk of code in Cython will
not help.
Cython is not a cure for bad program design.

Why do you use a Python dict (an unordered container)? Why not a NumPy
array of e.g. int?
Why do you use multiple lists of float instead of a 2D NumPy array?
Why do you even use a database?

You have been thinking too complicated.

Sturla

Aman Thakral

unread,

Nov 18, 2011, 12:38:34 PM11/18/11

to cython...@googlegroups.com

I use the dict as a way to create a pandas.DataFrame object. Pandas.DataFrame inherits from numpy.ndarray. I use the database to facilitate the lookup of my weather data. I only need a particular subset of the data based on the criteria of location and time range.

Sturla Molden

unread,

Nov 19, 2011, 3:56:51 PM11/19/11

to cython...@googlegroups.com

Run the Python profiler to identify what is taking so much time. Then rectify the problem.

Sturla

mauro

unread,

Nov 20, 2011, 2:24:19 AM11/20/11

to cython-users

> > Well, I think it may be the fact that the loop for storing the data is in> >> python. Here is an example:> > >> cursor = connection.cursor()> >> records = cursor.execute(query)> >> data = {}> >> for record in records:> >> #date = datetime.datetime.strptime(**record[0],'%Y-%m-%d')> >> date = record[0]> >> data[date] = [float(x) for x in record[1:]]
I recently wrote a non-cython code to convert date strings
intomatplotlib (float) date numbers, maybe that could be of help to
you.
(Aside, note that the newest numpy version supportsits own date
objects and they may be of use too.) Below my mostlyvectorized code
to convert an array of date-strings to date-floats. Itconverts the
array of strings into array of charaters and operates onthose in a
vectorised manner. Note that this hasn't been tested muchyet and that
it could be done more cleverly. It's about 40x fasterthan just using
datetime.datetime.strptime for time series that hasmany datapoints per
month, if only few points per month speedimporvement will be marginal
and you'd have to hand code thedatetime.date call.
import numpy as npimport datetimeimport pylab as plt
isostrings = np.array(["2009-07-07 00:00:00","2010-07-07
00:01:00","2010-09-07 03:01:00"])str_len = isostrings.dtype.itemsize
# string array viewn as bytes (charaters)isobytes =
isostrings.view(np.byte)isobytes =
isobytes.reshape((isostrings.shape[0], str_len))# now this is an array
of charatersisoints = isobytes - 48 # subtracting 48 from ASCII
numbers gives their integer valuesisoints[isoints==-48] = 0 # set
empty strings to zero# add timesyears = np.sum(isoints[:,
0:4]*np.array([1000,100,10,1]),1)months = np.sum(isoints[:,
5:7]*np.array([10,1]),1)years_months = years+months/100.# make a hash
for all possible year-months between# years[0] and years[-1] -> this
should be efficient for time series which have many datapoints per
monthyear_month_hash = {}for year in range(years[0], years[-1]+1):
for month in range(1,13): year_month = year+month/100.
year_month_hash[year_month] = datetime.date(year, month,
1).toordinal()# convert into days (in matplotlib date-
format)date_floats = np.empty(len(isostrings))for k in
year_month_hash: date_floats[years_months==k] = year_month_hash[k]
- 1# and the rest is easyHOURS_PER_DAY = 24.MINUTES_PER_DAY =
60.*HOURS_PER_DAYSECONDS_PER_DAY = 60.*MINUTES_PER_DAYdate_floats +=
np.sum(isoints[:,8:10]*np.array([10,1]),1)date_floats += 1/
HOURS_PER_DAY * np.sum(isoints[:,11:13]*np.array([10,1]),1)date_floats
+= 1/MINUTES_PER_DAY * np.sum(isoints[:,14:16]*np.array([10,1]),
1)date_floats += 1/SECONDS_PER_DAY * np.sum(isoints[:,
17:19]*np.array([10,1]),1)if str_len>19: # have fractional seconds
too date_floats += 1/SECONDS_PER_DAY * np.sum(isoints[:,
20:]*np.logspace(-1, -(str_len-20), 10),1)
# tests: print if wrongfor ii in range(0,date_floats.shape[0]): if
date_floats[ii] !=
plt.date2num(datetime.datetime.strptime(isostrings[ii], '%Y-%m-%d %H:
%M:%S')): print date_floats[ii],
plt.date2num(datetime.datetime.strptime(isostrings[ii], '%Y-%m-%d %H:
%M:%S'))

Stefan Behnel

unread,

Nov 20, 2011, 3:59:07 AM11/20/11

to cython...@googlegroups.com

mauro, 20.11.2011 08:24:

>>> Well, I think it may be the fact that the loop for storing the data is in> >> python. Here is an example:> > >> cursor = connection.cursor()> >> records = cursor.execute(query)> >> data = {}> >> for record in records:> >> #date = datetime.datetime.strptime(**record[0],'%Y-%m-%d')> >> date = record[0]> >> data[date] = [float(x) for x in record[1:]]
> I recently wrote a non-cython code to convert date strings
> intomatplotlib (float) date numbers, maybe that could be of help to
> you.
> (Aside, note that the newest numpy version supportsits own date
> objects and they may be of use too.) Below my mostlyvectorized code
> to convert an array of date-strings to date-floats. Itconverts the
> array of strings into array of charaters and operates onthose in a
> vectorised manner. Note that this hasn't been tested muchyet and that
> it could be done more cleverly. It's about 40x fasterthan just using
> datetime.datetime.strptime for time series that hasmany datapoints per
> month, if only few points per month speedimporvement will be marginal
> and you'd have to hand code thedatetime.date call.
> import numpy as npimport datetimeimport pylab as plt
> isostrings = np.array(["2009-07-07 00:00:00","2010-07-07
> 00:01:00","2010-09-07 03:01:00"])str_len = isostrings.dtype.itemsize
> # string array viewn as bytes (charaters)isobytes =
> isostrings.view(np.byte)isobytes =
> isobytes.reshape((isostrings.shape[0], str_len))# now this is an array

> [...]

Hi,

your e-mail is completely unreadable. Could you resend it as a plain text
message? You appear to be posting through a web mail interface, so you may
have to reconfigure it to fix this.

Also, note that you sent your e-mail twice, I only let one copy pass
through to the list. Postings from first-time senders require explicit
approval in order to prevent spam, so it may take a bit for them to reach
the list.

Stefan

mauro

unread,

Nov 21, 2011, 3:18:02 AM11/21/11

to cython-users

Yes, completely unreadable. I used the web-interface on
google.groups.com, so not sure how I should reconfigure that to fix
it... Anyway, I'll re-post from my email once I get a non-digest email
to this thread to which I can respond. And yes, I did send it twice
as I forgot that the mail would need to get approved. Let's hope that
my future postings are more successful and sorry for the noise!

Mauro

unread,

Nov 21, 2011, 11:16:27 PM11/21/11

to cython...@googlegroups.com

Here again my post which was garbled last time, hope it works now:

Aman Thakral wrote on 18 Nov:

> Well, I think it may be the fact that the loop for storing the data is in
> python. Here is an example:
>
> cursor = connection.cursor()
> records = cursor.execute(query)
> data = {}
> for record in records:
> #date = datetime.datetime.strptime(**record[0],'%Y-%m-%d')
> date = record[0]
> data[date] = [float(x) for x in record[1:]]

I recently wrote a non-cython code to convert date strings into

matplotlib (float) date numbers to avoid a loop like yours above,
maybe that could be of help. (Aside, note that the newest numpy
version supports its own date objects and they may be of use too.)
Below my mostly vectorized code to convert an array of date-strings to
matplotlib date-floats. It converts the array of strings into an array
of characters and operates on those in a vectorised manner. Note that
this hasn't been tested much yet and that it could be done more

cleverly. It's about 40x faster than just using

datetime.datetime.strptime for a time series that has many datapoints
per month. If it has only few points per month, speed imporvement will

be marginal and you'd have to hand code the datetime.date call.

Mauro

import numpy as np
import datetime
import pylab as plt

isostrings = np.array(["2009-07-07 00:00:00","2010-07-07 00:01:00","2010-09-07 03:01:00"])
str_len = isostrings.dtype.itemsize

# string array viewn as bytes (characters)
isobytes = isostrings.view(np.byte)
isobytes = isobytes.reshape((isostrings.shape[0], str_len))
# now this is an array of characters

isoints = isobytes - 48 # subtracting 48 from ASCII numbers gives their integer values
isoints[isoints==-48] = 0 # set empty strings to zero
# add times

years = np.sum(isoints[:,0:4]*np.array([1000,100,10,1]),1)
months = np.sum(isoints[:,5:7]*np.array([10,1]),1)
years_months = years+months/100.

# make a hash for all possible year-months between
# years[0] and years[-1] -> this should be efficient for time series which have many datapoints per month
year_month_hash = {}
for year in range(years[0], years[-1]+1):
for month in range(1,13):
year_month = year+month/100.
year_month_hash[year_month] = datetime.date(year, month, 1).toordinal()
# convert into days (in matplotlib date-format)
date_floats = np.empty(len(isostrings))
for k in year_month_hash:
date_floats[years_months==k] = year_month_hash[k] - 1
# and the rest is easy
HOURS_PER_DAY = 24.
MINUTES_PER_DAY = 60.*HOURS_PER_DAY
SECONDS_PER_DAY = 60.*MINUTES_PER_DAY

date_floats += np.sum(isoints[:,8:10]*np.array([10,1]),1)
date_floats += 1/HOURS_PER_DAY * np.sum(isoints[:,11:13]*np.array([10,1]),1)
date_floats += 1/MINUTES_PER_DAY * np.sum(isoints[:,14:16]*np.array([10,1]),1)

date_floats += 1/SECONDS_PER_DAY * np.sum(isoints[:,17:19]*np.array([10,1]),1)
if str_len>19: # have fractional seconds too

date_floats += 1/SECONDS_PER_DAY * np.sum(isoints[:,20:]*np.logspace(-1, -(str_len-20), 10),1)

# tests: print if wrong
for ii in range(0,date_floats.shape[0]):
if date_floats[ii] != plt.date2num(datetime.datetime.strptime(isostrings[ii], '%Y-%m-%d %H:%M:%S')):
print date_floats[ii], plt.date2num(datetime.datetime.strptime(isostrings[ii], '%Y-%m-%d %H:%M:%S'))

At Mon, 21 Nov 2011 00:18:02 -0800 (PST),

Reply all

Reply to author

Forward