There is some overhead by using ORM: this is inevitable.
On the other hand, you didn't posted your ORM calls, and the queries are incomplete.
If you think your queries
are optimized (as I understand they are from your description),
an workaround is to avoid using the ORM for those specific
queries. Or review your Models, they might not be optimized
for what you want. Can you try to reduce the number of queries
to a smaller number by creating more complex queries?
Another possibility is
getting a new Model recovering information from a view/stored
procedure on the database... you see, you want to get 10k to
50k items... that's a lot (to serialize too).
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/989dd9f6-508a-4dd7-a085-c30557204f41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Besides, you "time" from a shell is not the appropriate way to measure DB response. Try to use whatever is available for your database to measure that.
Also, see
http://blogs.perl.org/users/steffen_mueller/2010/09/your-benchmarks-suck.html.
Don't be fooled by the article being on a Perl related blog,
the concepts on the post are applicable everywhere.
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/989dd9f6-508a-4dd7-a085-c30557204f41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Alceu Rodrigues de Freitas Junior | ||
Phone: +551139264816 | | Mobile: +55119989312021 | | ![]() |
http://www.linkedin.com/in/alceufreitasjr |
This may be very basic thing to ask, but there is no harm to double check: are you using select_related instead of all for those two specific Models?
Check out https://django-debug-toolbar.readthedocs.io/en/stable/index.html
if you aren't using it yet, it will help you to get that very
easily (I'm assuming you have a view to list all those instances
for those Models).
On Sunday 11 June 2017 09:15:09 Miika Huusko wrote:
> Thanks for response!
>
> Yep, timing with "time" is not the best way to go to compare SQL query
> times. The reason I added that "time" test is that in Stack Overflow
> I was asked to confirm that data transfer is not factor here. I timed
> SQL queries with "EXPLAIN ANALYZE" and "\timing" first. That don't
> take data transfer into account, so, "time" as quick test that data
> transfer is not a problem.
>
> About timing: as an example I can reduce the problem to serializing
> only Items and related Photos. It results only two queries. For
> example for a dataset of about 40k items results:
>
> django.db.backends: (1.252) SELECT "todo_item"."version", *... all
> item properties ...* FROM "todo_item" WHERE ("todo_item"."project_id"
> = '...' AND "todo_item"."deleted_at" IS NULL);
>
> django.db.backends: (0.883) SELECT "photos_photo"."version", *... all
> item properties ...* FROM "photos_photo" WHERE
> "photos_photo"."note_id" IN (349527, 349528, 349529, *... and rest of
> the 40k IDs ...* );
Why did you opt to use prefetch_related and not select_related?
--
Melvyn Sopacua
On Sunday 11 June 2017 13:41:55 Miika Huusko wrote:
> @Melvyn Sopacue, I opt to use prefetch_related because I'm fetching
> "Items" that are used as ForeignKey for "Photos" and "Events".
> Select_related is meant for cases where you have OneToOne of
> ForeignKey relation (single item) that you need to fetch. If I would
> be fetching "Photos" I would use selected_related to JOIN "Items" in
> SQL level. Now that I have fetching "Items" relations to "Photos" and
> "Events" are "reverse ForeignKey" Many-to-One sets and I need to use
> prefetch_related.
You are in part.
But created_by is a foreign key on items, not the other way around.
So this:
> > django.db.backends: (0.001) SELECT "auth_user"."id", ...
> > everything
> >
> > ... FROM "auth_user" WHERE "auth_user"."id" IN (1, 2, ... some IDs
> > ...);>
can be avoided.
Last but not least, this is why API's use pagination. If for whatever reason you must provide such a complex model (instead of having the consumer make more API calls for photo's they don't have yet) you paginate, but it's not very flexible and scales bad.
The proper desgin pattern here would be to hash your photo's, use that as identifier in the API and put the burdon on the consumer to only request photo's they don't have the ID for yet.
This is for example, how Zendesk email attachments.
--
Melvyn Sopacua
I'm guessing here that REST would not be the appropriate to handle "large" amounts of data, and probably that's why Melevyn suggested to use pagination.
If your customer needs such data to be used off-line, I bet that would be better to generate a file in the appropriate format, compact it and redirect it to request the file to be downloaded.
For that, you could skip the ORM entirely, or build an
appropriate Model for it. I would stick with the former (and
export the data straight from the DB with a stored procedure), I
don't see you reusing this new Model in other places in your
Django app.
On Sunday 11 June 2017 14:41:22 Miika Huusko wrote:
> The reason I'm providing all "Items" and other relates stuff with one
> request is that I want to make customer (that's also me and my good
> frieds) life easier. I want to keep all the difficult and heavy work
> on server side. Customer needs all data because of offline
> requirements and therefore would need to loop through pagination or
> perform the same big JOIN if "Photos" and other related objects would
> be requested separately. Doing that in mobile devices is going to
> take longer than on Ubuntu server even if it takes longer that I
> would expect with Python + Django. That said, it's good to point out
> opinions on design patterns. I haven't listed all background
> information for decisions.
Think about how Google Maps does this with Offline dataset for area's you've marked. They cache it for x days. Then prompt you if you have a free moment to sync up till you do.
It's not a burdon in practice and makes life a lot easier. Sometimes we set requirements too harsh for ourselves.
> I'm not quite sure what you mean by "hash photos and use that as
> identifier".
Sorry, I wasn't complete as it's been a while. I've built an application that fetches Zendesk data into Django to for trend analysis. To get an attachment, for a given message was:
- get the message id: /messages/id
- get the attachment id's: messages/id/attachments (this is your reverse related)
- for each one that you don't have yet, fetch it: /files/id
The hashing came into play, because corporate idiots - I mean people - attach their logo to each and every email they send. So by hashing them, you eliminate duplicates and scale down your storage requirements and (query) data size.
> What I could do - and believe what you mean(?) - is to
> provide list of "Photo" IDs (or hashes) for "Item" and ask customer
> to fetch "Photo" details separately with another request(s). Or -
> easier for API - leave "Photo" IDs out of the "Item" information
> completely and ask customer to make the JOIN with separately
> requested "Photos". JOIN could be done with hashes or IDs. Either way
> it would be customer who would make the JOIN.
But it makes a lot more sense now with the mobile and offline requirements, why you have setup the API the way you did. Two more thourghts for you:
- Use a cacheable dataset and incremental updates to be done "when wifi or better is available". Complex, but worth it in the long run.
- Ditch DRF serialization and get the querieset as values(), dumping to json instantly and thus eliminate both serialization objects and model objects.
It's easy to test the performance gain of the 2nd, especially if you can stream the json (as opposed to in-memory-then-flush).
--
Melvyn Sopacua