convert subset to dictionary

19 views
Skip to first unread message

Justvuur

unread,
Jul 2, 2020, 3:46:31 PM7/2/20
to sqlalchemy
Hi there,

I'm struggling to find an efficient way to get a two columned subset into dictionary form.

I have an entity that has a subset of data. The subset is linked to the entity via Id. The order of the subset of data is defined in another table.

Example:
Student - Id, firstname, lastname
Subjects - StudentId, SubjectCode, SubjectName

At the moment I'm looping through the SqlAlchemy result of "student.subjects" in python and creating a dictionary from that. It's quite heavy, especially when there are 2000+ students with a potential of 100+ subjects each.

For each student, how do I get the subjects as a dictionary for a student where the key is the SubjectCode and the value is the SubjectName?
Better yet, how can I get a result set: Id, firstname, lastname SubjectCode x, SubjectCode y, etc etc (where the SubjectName becomes the value and the SubjectCode becomes the column)?

Regards,
Justin

Simon King

unread,
Jul 3, 2020, 6:31:03 AM7/3/20
to sqlal...@googlegroups.com
Are you trying to optimise the database access (ie. minimize the
number of queries), or provide a nice dictionary-style API for your
Student objects? What do you mean when you say that looping over
student.subjects is quite heavy?

An association proxy can be used to get dict-style access to a relationship:

https://docs.sqlalchemy.org/en/13/orm/extensions/associationproxy.html#proxying-to-dictionary-based-collections

There are also a couple of examples in the SQLAlchemy docs that
provide a dictionary-style API:

https://docs.sqlalchemy.org/en/13/orm/examples.html#module-examples.dynamic_dict

https://docs.sqlalchemy.org/en/13/orm/examples.html#module-examples.vertical

Hope that helps,

Simon
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
> ---
> You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/57b74c9a-a6e5-494b-b468-d0bdcbcce60co%40googlegroups.com.

Justvuur

unread,
Jul 3, 2020, 8:36:18 AM7/3/20
to sqlalchemy
Hi Simon, thanks for the help! I've never used that before, it's quite handy.

I'm looping through all the students and printing them and their subject details to a CSV file.
What makes things a tad complicated is the subjects must appear in a specific order.
There is a table that has the subject code and order number (ordered_subjects used below is the resultset from it).
I printed out the timing and found the problem to be with a nested for loop.

I was hoping to reduce that process time by using a map that automatically gets populated instead of having to create it on the fly.

Before - subjects_collection "attribute_mapped_collection":
********************************************************************************************
for row in students:
    row_no += 1

    for subject in row.subjects:
        student_subjects[subject.code] = subject.value

    csv_row = [row_no]
    csv_row += [student_subjects.get(x.code, '') for x in ordered_subjects]
    csv_row += [row.created_on, row.updated_on]
    
    writer.writerow([x.encode('utf-8') if type(x) == unicode else x for x in csv_row])


After adding the subjects_collection "attribute_mapped_collection", I unfortunately did not see a change in performance.

After - subjects_collection "attribute_mapped_collection":
********************************************************************************************
for row in students:
    row_no += 1            
    csv_row = [row_no]
    csv_row += [row.subjects_collection.get(x.code, '').value for x in ordered_subjects]
    csv_row += [row.created_on, row.updated_on]

    writer.writerow([x.encode('utf-8') if type(x) == unicode else x for x in csv_row])


class Subject(db.Model):
    __tablename__ = 'subjects'

    student_id = db.Column(db.Integer, db.ForeignKey('students.id'), primary_key=True)

    code = db.Column(db.String(50), primary_key=True)

    value= db.Column(db.String)

    def __init__(self, code , value):
        self.code = code
        self.value = value


class Student(ResourceMixin, db.Model):
    __tablename__ = 'students'

    subjects= db.relationship('Subject', backref='student')

    id = db.Column(db.Integer, primary_key=True)

    subjects_collection = relationship("Subject", collection_class=attribute_mapped_collection('code'))

Can you see a way I can optimize this? Any ideas?
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlal...@googlegroups.com.

Simon King

unread,
Jul 3, 2020, 11:07:43 AM7/3/20
to sqlal...@googlegroups.com
Are you eager-loading the "student.subjects" relationship? If not,
that will give you the biggest performance increase. Without that, you
will be issuing a separate DB query for each of the students, to load
that student's subjects. Eager-loading allows you to preload the
subjects for every student in a single query:

https://docs.sqlalchemy.org/en/13/orm/loading_relationships.html#joined-eager-loading

Simon
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/67439e94-2666-402a-a642-7f4e7de8fbffo%40googlegroups.com.

Justvuur

unread,
Jul 6, 2020, 2:34:25 PM7/6/20
to sqlalchemy
I added the eager-loading but it seems to slow down the SQL query quite a lot.
It's as if now, the SQL query is taking longer but the generating of the file is quicker hehe... I guess now the queries are being fired before to populate subjects.
It's still taking relatively the same amount of time though.

Simon King

unread,
Jul 7, 2020, 6:50:13 AM7/7/20
to sqlal...@googlegroups.com
How long is it taking? You mentioned 2000 students with 100 subjects
each, so there are something like 200,000 rows in the Subjects table,
and you need to load all of it. I wouldn't expect that to take longer
than a couple of seconds though.

Simon
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/9c86dfc4-3666-454e-8b5e-2b9e2a32166ao%40googlegroups.com.

Justvuur

unread,
Jul 7, 2020, 7:53:32 AM7/7/20
to sqlalchemy
I'm currently testing with 7000 students with 181 subjects.
I first went over to the DB to run the query directly on there. I did an innerjoin between the two tables.
It took about 2:25 minutes. I then selected specific columns from the two tables and the time dropped to about 2 minutes or so.
I then tried a "select subjects where student ID in (...)" query and that executed a lot quicker than the others.

So then on the python side of things, I first fetched the students and only selected id, created_on and updated_on columns using the with_entities option.
That seemed to take less than 1 second. I then, for each student, executed a DB query to fetch the subjects using the with_entities option.
For some reason, this seems to work faster than using the  eager-loading/students.subjects/attribute_mapped_collection. They seem to take 2 minutes longer than the above (what should be inefficient) approach.

Simon King

unread,
Jul 7, 2020, 9:56:04 AM7/7/20
to sqlal...@googlegroups.com
2:25 minutes, even for 1,267,000 rows, sounds like quite a long time
to me. Are you physically a long way from the database, querying
across the internet? Do the tables contain many columns, or perhaps
large BLOB or TEXT fields?

If there are large fields that you don't need very often, you could
consider marking them as "deferred", then they wouldn't be loaded by
default:

https://docs.sqlalchemy.org/en/13/orm/loading_columns.html#deferred-column-loading

I suppose if your database server was underpowered, loading over a
million rows might exhaust its memory and cause it to start swapping,
which would hurt performance, in which case querying for smaller
amounts of data might be better.

Simon
> To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/29b184cc-9f5c-4fe2-a69d-4fd12651c88cn%40googlegroups.com.

Justvuur

unread,
Jul 7, 2020, 12:46:29 PM7/7/20
to sqlalchemy
Nope, I'm running it on a Xubuntu VM. The VM only has 4 gigs of RAM and 4 cores assigned to it.
Nope, the students table has created_on, updated_on, id, 3 integer columns and a boolean column.
The subjects table has student_id, code and name columns. The code column is character varying(100) and the name column is character varying.

Yes, so its not over the net, its running locally.
That's a good idea! Maybe I'll try that!
Thanks for your help on this, I really appreciate it.

Jonathan Vanasco

unread,
Jul 7, 2020, 1:31:56 PM7/7/20
to sqlalchemy
Based on what you shared above:

* The "Subject" table is: `StudentId, SubjectCode, SubjectName`
* There are 181 subjects

It looks like you don't have a "Subject" table, but a "StudentAndSubject" table.

I think you'd have a bigger performance improvement by normalizing that data into two tables:

    Subject:  SubjectId (primary key), SubjectCode, SubjectName
    Student2Subject: StudentId, SubjectId, (primary key is both)

Assuming this can be done with your data... the database performance should improve because

1. The raw filestorage will decrease
2. The in-memory dataset size will decrease

You could then either

1. use the Subject table as part of a joined query to keep things simple, or
2. just select off a join of Student+Student2Subject , and query all the Subjects separately.  Even if there are 2000 subjects total, it should only take a few ms to get all that into a python datastructure that is used to generate your csv


Justvuur

unread,
Jul 8, 2020, 3:22:17 AM7/8/20
to sqlalchemy
I'd like to redesign the DB but that's not on the cards at the moment (or ever hehe).

Thanks for the feedback, I appreciate all the help, I really do.
What puzzles me now is, why would the looping through each student (using with_entities) and getting the subjects (using with_entities) for each student be faster (from 2:25 min down to 0:19 min) than using the student.subjects approach?
I know I'm including 3 or 4 less columns but surely it wont cost 2 minutes to have them included?

Simon King

unread,
Jul 8, 2020, 5:33:31 AM7/8/20
to sqlal...@googlegroups.com
You should start by enabling SQLAlchemy logging to see the actual
queries that are being run. The easiest way is to pass "echo=True" to
your create_engine call. (You could also try echo="debug", but since
you've got hundreds of thousands of rows you'll be swamped)

Verify that the queries look correct (for example, they have the right
join conditions). If they look OK but they run slowly, use your
database's tools (eg. EXPLAIN or EXPLAIN ANALYZE) to understand why.

Simon
> --
> SQLAlchemy -
> The Python SQL Toolkit and Object Relational Mapper
>
> http://www.sqlalchemy.org/
>
> To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example. See http://stackoverflow.com/help/mcve for a full description.
> ---
> You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sqlalchemy/3b701d26-2f3b-4407-bc69-2dc13df60caan%40googlegroups.com.

Justvuur

unread,
Jul 8, 2020, 6:49:08 AM7/8/20
to sqlalchemy
Thanks, yes, this well help a lot.
Reply all
Reply to author
Forward
0 new messages