How to formulate a query using sqlalchemy ORM constructs that calculates the difference between entries in rows

563 views
Skip to first unread message

Björn Nadrowski

unread,
Apr 20, 2018, 12:31:45 PM4/20/18
to sqlalchemy
Hello, 

I would like to formulate a query that returns the difference between items stored in the same row.
I modified it slightly to fit my example program below.

My program is the following:
from sqlalchemy import Column, Integer, String, DateTime, Float
from sqlalchemy import create_engine, text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from datetime import datetime as dttm
import pandas as pd
import sys
Base = declarative_base()

class CTestTable(Base):
    __tablename__ = 'testTable'
    id = Column(Integer, primary_key=True, index=True)
    datetime = Column(DateTime(timezone=False))
    val = Column(Float)
    info = Column(String(100))

    def __repr__(self):
       return "<User(name='%s', fullname='%s', password='%s')>" % (
                            self.name, self.fullname, self.password)

def main():
    # create engine
    engine = create_engine('sqlite:///test.sqlitedb', echo=False)

    # create session
    Session = sessionmaker()
    Session.configure(bind=engine)
    session = Session()

    # create tables in database
    Base.metadata.drop_all(engine)
    Base.metadata.create_all(engine)

    # add test data
    testTables = []
    testTables.append(CTestTable(datetime=dttm(2018, 1, 1), val=1.1, info='t1'))
    testTables.append(CTestTable(datetime=dttm(2018, 1, 3), val=1.4, info='t1'))
    testTables.append(CTestTable(datetime=dttm(2018, 1, 4), val=1.5, info='t1'))

    testTables.append(CTestTable(datetime=dttm(2018, 1, 1), val=11.1, info='t2'))
    testTables.append(CTestTable(datetime=dttm(2018, 1, 2), val=11.4, info='t2'))
    testTables.append(CTestTable(datetime=dttm(2018, 1, 4), val=11.5, info='t2'))

    session.add_all(testTables)
    session.commit()

    # print the table
    qs = (session.query(CTestTable))
    df = pd.read_sql(qs.statement, qs.session.bind)
    print(df)

    # applying SQL text:
    ttext = (f"""select id,
                        datetime,
                        val,
                        info,
                        julianday("datetime") - julianday("PreviousDate") as datediff
                 from ( select  id,
                                datetime,
                                val,
                                info,
                                (select max(datetime)
                                from testTable T2
                                where T2.info=T1.info
                                and T2.datetime < T1.datetime
                                ) as PreviousDate
                  from testTable T1
                  ) as T""")
    stmt = text(ttext)
    df =  pd.read_sql(stmt, session.bind)
    print(df)

if __name__ == '__main__':
    sys.exit(main())


The output is the following:
   id   datetime   val info
0   1 2018-01-01   1.1   t1
1   2 2018-01-03   1.4   t1
2   3 2018-01-04   1.5   t1
3   4 2018-01-01  11.1   t2
4   5 2018-01-02  11.4   t2
5   6 2018-01-04  11.5   t2
   id                    datetime   val info  datediff
0   1  2018-01-01 00:00:00.000000   1.1   t1       NaN
1   2  2018-01-03 00:00:00.000000   1.4   t1       2.0
2   3  2018-01-04 00:00:00.000000   1.5   t1       1.0
3   4  2018-01-01 00:00:00.000000  11.1   t2       NaN
4   5  2018-01-02 00:00:00.000000  11.4   t2       1.0
5   6  2018-01-04 00:00:00.000000  11.5   t2       2.0
 
The second table is the desired output. 
As you can see, the SQL works perfectly ok (for sqlite) (aside from performance issues on tables with >1. e7 rows)
But what is the  way to perform this using sqlalchemy's ORM structures?
Somehow using alias(), select(), nested subqueries...
But how do I do that?
One expected benefit would be, of course, that the ORM way would be database-backend independent (the backend could be changed to mysql or other, the program would still run).

Thanks for any help!


Mike Bayer

unread,
Apr 23, 2018, 8:37:43 AM4/23/18
to sqlal...@googlegroups.com
We can help you with this query but I'd first ask that you have
reviewed the documentation that covers these patterns, within the Core
tutorial: http://docs.sqlalchemy.org/en/latest/core/tutorial.html
specifically, how to do aliased subqueries
(http://docs.sqlalchemy.org/en/latest/core/tutorial.html#using-aliases),
functions (http://docs.sqlalchemy.org/en/latest/core/tutorial.html#functions),
etc.
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full
> description.
> ---
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Björn Nadrowski

unread,
May 3, 2018, 6:05:41 PM5/3/18
to sqlalchemy

Thanks for your reply. I guessed that this should be possible using alias and functions. But I was unable to write a simple script that achieves that. 
Could you help me with that?
Or do you want me to show the things that I tried but do not work?

Thanks, Bjoern

Mike Bayer

unread,
May 3, 2018, 8:12:54 PM5/3/18
to sqlal...@googlegroups.com
On Thu, May 3, 2018 at 6:05 PM, Björn Nadrowski <bjrnf...@gmail.com> wrote:
>
> Thanks for your reply. I guessed that this should be possible using alias
> and functions. But I was unable to write a simple script that achieves that.
> Could you help me with that?
> Or do you want me to show the things that I tried but do not work?

no need, can you try this query?

T2 = aliased(CTestTable, name="T2")
T1 = aliased(CTestTable, name="T1")

max_subq = session.query(
func.max(T2.datetime)
).filter(T2.info == T1.info).filter(
T2.datetime < T1.datetime).correlate(T1).label("PreviousDate")

subq = session.query(
T1.id, T1.datetime, T1.val, T1.info, max_subq).subquery("T")
q = session.query(
subq.c.id, subq.c.datetime, subq.c.val, subq.c.info,
(
func.julianday("datetime") -
func.julianday("PreviousDate")).label("datediff")
).select_from(
subq
)

q.all()

log output is:

SELECT "T".id AS "T_id", "T".datetime AS "T_datetime", "T".val AS
"T_val", "T".info AS "T_info", julianday(?) - julianday(?) AS datediff
FROM (SELECT "T1".id AS id, "T1".datetime AS datetime, "T1".val AS
val, "T1".info AS info, (SELECT max("T2".datetime) AS max_1
FROM "testTable" AS "T2"
WHERE "T2".info = "T1".info AND "T2".datetime < "T1".datetime) AS "PreviousDate"
FROM "testTable" AS "T1") AS "T"
2018-05-03 20:11:30,082 INFO sqlalchemy.engine.base.Engine
('datetime', 'PreviousDate')


is that right? note the uppercase names need to be quoted, literal
values are turned into bind parameters, etc.





>
> Thanks, Bjoern
>
>
> Am Montag, 23. April 2018 14:37:43 UTC+2 schrieb Mike Bayer:
>>
>> We can help you with this query but I'd first ask that you have
>> reviewed the documentation that covers these patterns, within the Core
>> tutorial: http://docs.sqlalchemy.org/en/latest/core/tutorial.html
>> specifically, how to do aliased subqueries
>> (http://docs.sqlalchemy.org/en/latest/core/tutorial.html#using-aliases),
>> functions
>> (http://docs.sqlalchemy.org/en/latest/core/tutorial.html#functions),
>> etc.
>>
>>

Björn Nadrowski

unread,
May 11, 2018, 3:17:31 PM5/11/18
to sqlalchemy


Hi Mike, 
that looks very good, 

However, I am still left with three problems:

a) the query returns "None" in the datediff column. It seems that the strings "datetime" and "PreviousDate" are somehow not recognized. When I replace 
        "( 
            func.julianday("datetime") - 
            func.julianday("PreviousDate")).label("datediff") "
by 
        (max_subq).label("datediff") 

then I do get a non-None result. The constructed SQL is:
SELECT "T".id, "T".datetime, "T".val, "T".info, (SELECT max("T2".datetime) AS max_1 
  FROM "testTable" AS "T2", "testTable" AS "T1" 
  WHERE "T2".info = "T1".info AND "T2".datetime < "T1".datetime) AS datediff 
  FROM (SELECT "T1".id AS id, "T1".datetime AS datetime, "T1".val AS val, "T1".info AS info, (SELECT max("T2".datetime) AS max_1 
  FROM "testTable" AS "T2" 
  WHERE "T2".info = "T1".info AND "T2".datetime < "T1".datetime) AS "PreviousDate" 
  FROM "testTable" AS "T1") AS "T"

The "datediff" column contains, in this case, the value "2018-01-03" in each of the 6 rows.

b) I do not know what the "correlate(T1)" statement does. I looked up the docs for correlate but was unable to find out what that really does. 
The SQL produced by your code is identical regardless of whether correlate(T1) is used or not.

c) The code is still sqlite-specific (use of julianday()). Suppose I find out the  reason why the code does not work and succeed reproducing the desired output. Then the code would still be SQLite-specific and not generic which is the main reason for using sqlalchemy. What do I have to do to make it databse-backend-indpeendent?

Thanks for your help, 
Bjoern

Mike Bayer

unread,
May 11, 2018, 3:50:47 PM5/11/18
to sqlal...@googlegroups.com
On Fri, May 11, 2018 at 3:17 PM, Björn Nadrowski <bjrnf...@gmail.com> wrote:
>
>
> Hi Mike,
> that looks very good,
>
> However, I am still left with three problems:
>
> a) the query returns "None" in the datediff column. It seems that the
> strings "datetime" and "PreviousDate" are somehow not recognized. When I
> replace
> "(
> func.julianday("datetime") -
> func.julianday("PreviousDate")).label("datediff") "
> by
> (max_subq).label("datediff")
>
> then I do get a non-None result.

does the raw SQL statement previously w/ the julianday function work?
I'm not familiar with this function on SQLite.

The constructed SQL is:
> SELECT "T".id, "T".datetime, "T".val, "T".info, (SELECT max("T2".datetime)
> AS max_1
> FROM "testTable" AS "T2", "testTable" AS "T1"
> WHERE "T2".info = "T1".info AND "T2".datetime < "T1".datetime) AS datediff
> FROM (SELECT "T1".id AS id, "T1".datetime AS datetime, "T1".val AS val,
> "T1".info AS info, (SELECT max("T2".datetime) AS max_1
> FROM "testTable" AS "T2"
> WHERE "T2".info = "T1".info AND "T2".datetime < "T1".datetime) AS
> "PreviousDate"
> FROM "testTable" AS "T1") AS "T"
>
> The "datediff" column contains, in this case, the value "2018-01-03" in each
> of the 6 rows.
>
> b) I do not know what the "correlate(T1)" statement does. I looked up the
> docs for correlate but was unable to find out what that really does.
> The SQL produced by your code is identical regardless of whether
> correlate(T1) is used or not.

then you can omit it, correlation is automatic if there are no
correlation directives given, documentation at
http://docs.sqlalchemy.org/en/latest/core/tutorial.html#correlated-subqueries


>
> c) The code is still sqlite-specific (use of julianday()). Suppose I find
> out the reason why the code does not work and succeed reproducing the
> desired output. Then the code would still be SQLite-specific and not generic
> which is the main reason for using sqlalchemy. What do I have to do to make
> it databse-backend-indpeendent?

unfortunately date arithmetic is complicated and extremely different
on pretty much every database, and SQLAlchemy doesn't have an
abstraction layer for date arithmetic functions. What I usually do is
figure out the correct expression for each of the databases I'm
targeting, then I build custom function, example at
http://docs.sqlalchemy.org/en/latest/core/compiler.html#utc-timestamp-function
.
Reply all
Reply to author
Forward
0 new messages