BSD DB can be faster than Sqlite if you don't need SQL (i.e. it's more
like a Python dict).
Both Sqlite and BSD DB have C API's you can use from Cython. Though I
hardly think the Python wrapper is the important bottleneck here.
HDF5/PyTables are typically used for something else than Sqlite: That
is, storing large array-based scientific data. That can be easier (and
faster) than e.g. keeping pickled NumPy arrays in BLOBs using Sqlite.
Sturla
Why? Do you have any reason to think the bottleneck is in the Python
wrappers?
If the database is the problem, calling it from C will not help. Sqlite
will still be sqlite, neither faster nor slower.
Sturla
Why do you use a Python dict (an unordered container)? Why not a NumPy
array of e.g. int?
Why do you use multiple lists of float instead of a 2D NumPy array?
Why do you even use a database?
You have been thinking too complicated.
Sturla
Hi,
your e-mail is completely unreadable. Could you resend it as a plain text
message? You appear to be posting through a web mail interface, so you may
have to reconfigure it to fix this.
Also, note that you sent your e-mail twice, I only let one copy pass
through to the list. Postings from first-time senders require explicit
approval in order to prevent spam, so it may take a bit for them to reach
the list.
Stefan
Mauro
Aman Thakral wrote on 18 Nov:
> Well, I think it may be the fact that the loop for storing the data is in
> python. Here is an example:
>
> cursor = connection.cursor()
> records = cursor.execute(query)
> data = {}
> for record in records:
> #date = datetime.datetime.strptime(**record[0],'%Y-%m-%d')
> date = record[0]
> data[date] = [float(x) for x in record[1:]]
I recently wrote a non-cython code to convert date strings into
matplotlib (float) date numbers to avoid a loop like yours above,
maybe that could be of help. (Aside, note that the newest numpy
version supports its own date objects and they may be of use too.)
Below my mostly vectorized code to convert an array of date-strings to
matplotlib date-floats. It converts the array of strings into an array
of characters and operates on those in a vectorised manner. Note that
this hasn't been tested much yet and that it could be done more
cleverly. It's about 40x faster than just using
datetime.datetime.strptime for a time series that has many datapoints
per month. If it has only few points per month, speed imporvement will
be marginal and you'd have to hand code the datetime.date call.
Mauro
import numpy as np
import datetime
import pylab as plt
isostrings = np.array(["2009-07-07 00:00:00","2010-07-07 00:01:00","2010-09-07 03:01:00"])
str_len = isostrings.dtype.itemsize
# string array viewn as bytes (characters)
isobytes = isostrings.view(np.byte)
isobytes = isobytes.reshape((isostrings.shape[0], str_len))
# now this is an array of characters
isoints = isobytes - 48 # subtracting 48 from ASCII numbers gives their integer values
isoints[isoints==-48] = 0 # set empty strings to zero
# add times
years = np.sum(isoints[:,0:4]*np.array([1000,100,10,1]),1)
months = np.sum(isoints[:,5:7]*np.array([10,1]),1)
years_months = years+months/100.
# make a hash for all possible year-months between
# years[0] and years[-1] -> this should be efficient for time series which have many datapoints per month
year_month_hash = {}
for year in range(years[0], years[-1]+1):
for month in range(1,13):
year_month = year+month/100.
year_month_hash[year_month] = datetime.date(year, month, 1).toordinal()
# convert into days (in matplotlib date-format)
date_floats = np.empty(len(isostrings))
for k in year_month_hash:
date_floats[years_months==k] = year_month_hash[k] - 1
# and the rest is easy
HOURS_PER_DAY = 24.
MINUTES_PER_DAY = 60.*HOURS_PER_DAY
SECONDS_PER_DAY = 60.*MINUTES_PER_DAY
date_floats += np.sum(isoints[:,8:10]*np.array([10,1]),1)
date_floats += 1/HOURS_PER_DAY * np.sum(isoints[:,11:13]*np.array([10,1]),1)
date_floats += 1/MINUTES_PER_DAY * np.sum(isoints[:,14:16]*np.array([10,1]),1)
date_floats += 1/SECONDS_PER_DAY * np.sum(isoints[:,17:19]*np.array([10,1]),1)
if str_len>19: # have fractional seconds too
date_floats += 1/SECONDS_PER_DAY * np.sum(isoints[:,20:]*np.logspace(-1, -(str_len-20), 10),1)
# tests: print if wrong
for ii in range(0,date_floats.shape[0]):
if date_floats[ii] != plt.date2num(datetime.datetime.strptime(isostrings[ii], '%Y-%m-%d %H:%M:%S')):
print date_floats[ii], plt.date2num(datetime.datetime.strptime(isostrings[ii], '%Y-%m-%d %H:%M:%S'))
At Mon, 21 Nov 2011 00:18:02 -0800 (PST),