Streamlined dictionary numpy arrays storage in a one-to-many relationship

53 views
Skip to first unread message

Ruben Di Battista

unread,
Jul 27, 2017, 10:50:23 AM7/27/17
to sqlalchemy
Hello, I'm trying to figure out a streamlined way to store some children values that are stored in numpy arrays in Python. As example let's assume I have a parent object that is a sensor that has some readings associated to it:

class Sensor(object):
    __tablename__ = 'sensor'
    id = Column(Integer, primary_key=True),
    name = Column(String)
    readings = relationship("Reading", backref="sensor")


class Reading(object):
    __tablename__ = 'reading'
    id = Column(Integer, primary_key=True),
    date = Column(DateTime),
    voltage = Column(Float),
    value = Column(Float),

    sensor_id = Column(Integer, ForeignKey('sensor.id'))


What I would like to achieve is something like:
sensor = Sensor(name='Bedroom Sensor')
dates, voltages, values = get_sensor_data_from_somewhere()  #<-- This returns three numpy arrays respectively of datetime, float, float types, same len!

sensor.readings['date'] = dates
sensor.readings['voltage'] = voltages
sensor.values['value'] = values

session.add(sensor)

Is this possible somehow? It's similar to the attribute_mapped_collection, but I need to map three different keys to the three attributes of the Reading object. 

Ruben Di Battista

unread,
Jul 27, 2017, 11:49:19 AM7/27/17
to sqlalchemy
sensor.values['value'] = values

This is a typo. Should be this: 

sensor.readings['value'] = values

Mike Bayer

unread,
Jul 27, 2017, 4:43:52 PM7/27/17
to sqlal...@googlegroups.com
Application of appropriate __getitem__() and __setitem__() methods on
your Sensor class would achieve this.


>
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Mike Bayer

unread,
Jul 27, 2017, 4:50:27 PM7/27/17
to sqlal...@googlegroups.com
oh, .readings. Unfortunately you'd need to develop a custom
collection class which achieves this and make it available by a
@property on Sensor:

class Sensor(...):
@property
def readings(self):
return MagicReadingCollection(self)

then you do the __getitem__ / __setitem__ thing on MagicReadingCollection:

class MagicReadingCollection(object):
def __init__(self, sensor):
self.sensor = sensor

def __setitem__(self, field, values):
start_appending_at = len(self.sensor.readings)
for index, value in enumerate(values):
if index > start_appending_at:
reading = Reading()
self.sensor.readings.append(reading)
else:
reading = self.sensor.readings[index]
setattr(reading, field, value)

Mike Bayer

unread,
Jul 27, 2017, 4:51:09 PM7/27/17
to sqlal...@googlegroups.com
correction, you'd also name the relationship "_readings", eg.
self.sensor._readings

Jonathan Vanasco

unread,
Jul 28, 2017, 3:28:19 PM7/28/17
to sqlalchemy
Unless you need to use all the readings immediately, have you considered just making a custom def under the Sensor model, and then inserting all the readings via sqlalchemy core?  That would allow you to insert them without creating ORM objects, which people using numpy and a lot of data often like to avoid. 

Then you could do...

     s = Sensor()
     session.add(s)
     session.add_readings(dates, voltages, values)

it would look something like this...

class Sensor(Base):
    def add_readings(self, dates, voltages, values):
        if not self.id:
            # flush this to the session
            session = object_session(self)
            session.flush(objects=[self])
       # insert via core

Ruben Di Battista

unread,
Jul 30, 2017, 10:36:56 AM7/30/17
to sqlalchemy
Thanks, this is in fact what I implemented now as a method in the Sensor class, exploiting also the bulk_insert_mappings since the number of readings are quite a lot (400k each time):

    def store_readings(self, session):

        if not(self.id):
            session.add(self)
            session.flush()

        sensor_id = self.id

        times, voltages = self.get_values_from_somewhere()

        n = len(times)
        chunk_size = 100000
        for i in six.moves.range(0, n, chunk_size):
            begin = i
            end = i+chunk_size
            time_chunk = times[begin:end]
            voltage_chunk = voltages[begin:end]

            session.bulk_insert_mappings(PassageData, [
                {
                    'time': t,
                    'voltage': v,
                    'sensor_id': sensor_id
                } for t, v in
                zip(time_chunk, voltave_chunk)]
            )

        session.commit()


The problem with the custom collection is that maybe we lose performances that are gained while using the numpy powerful indexing. I need, still, to study a bit the suggestion given by Mike. 

Thanks, 

Jonathan Vanasco

unread,
Jul 31, 2017, 4:44:26 PM7/31/17
to sqlalchemy
Using `bulk_insert_mappings` is much more performant than using the ORM and custom collections. Right now you are bypassing ORM object creation and state management. Your operations might be 10x slower with a collection.

IIRC, `bulk_insert_mappings` will iterate over the payload like this:

     INSERT INTO foo VALUES (bar);
     INSERT INTO foo VALUES (bar);
     INSERT INTO foo VALUES (bar);

and you may be able to increase performance by batching the payload into groups of 100+ and constructing a sqlalchemy core statement with the form of:

     INSERT INTO foo VALUES (bar), (bar), (bar)...;

There is a lot of discussion on this list about "high volume inserts", different ways to approach it for performance, and the overhead of using the ORM.  At 4k entries, people avoid the ORM.  At 400k, you really want to avoid the ORM
Reply all
Reply to author
Forward
0 new messages