Equivalent of multi-table JOIN (another post on reverse select_related)

78 views
Skip to first unread message

Samuel Abels

unread,
Nov 6, 2017, 2:46:41 PM11/6/17
to Django users
I am working on a reporting feature that allows users for querying arbitrary models and fields, and present the result as a table. For example, consider the following object model:

Package
   |
   v
Device <- Component
   ^         ^
   |         |
   |---- Interface <---2--- Connection
            ^^^
           / | \
          /  |  \
         /   |   \
  Sampling  IP   Policy

(The dash is the direction of a ForeignKey.)
To produce a report, I chose a three-step process:

1. The user interfaces returns a list of fields to be included in the report, such as

args = dict('Device.hostname__contains': 'localhost', 'Package.name__icontains': 'unix', 'IP.address__contains': '192')

2. Given the list of args, find the shortest path that connects all required models. For the example above, the result is a tuple:

path = Device, Package, Interface, IP

3. In theory, I could now perform the following SQL request:

SELECT * FROM myapp_device d
LEFT JOIN myapp_package pa ON pa.device_id=d.id
LEFT JOIN myapp_interface ifc ON ifc.device_id=d.id
LEFT JOIN myapp_ip ip ON ip.interface_id=ifc.id;

But of course, I want to avoid the raw SQL. I considered the following options:

- Using Device.objects.select_related() does not work, because Device has a 1:n relation to Package (and also to Unit), which Django's select_related() does not support.

- Using prefetch_related() does not work, because it prefetches everything, which is too much in our case (>100 million rows if a user queries on all tables), and it does not provide us with a total of the number of rows selected. In practice, I want to count(*) everything for displaying a total, and fetch only a subset, using LIMIT.

Our tests showed that the raw SQL query with LEFT JOIN is fast enough for production, regardless of what fields and objects are being queried. The craziest query I built took about 20 seconds, which is ok for what we are trying to do.

Any other options?

-Samuel

Matthew Pava

unread,
Nov 6, 2017, 2:56:29 PM11/6/17
to django...@googlegroups.com

Though it doesn’t directly answer your query, you might be interested in this package:

https://github.com/burke-software/django-report-builder

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/ec0e2d37-9b0f-4623-8f60-c6feeef0eb9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Samuel Abels

unread,
Nov 6, 2017, 2:58:30 PM11/6/17
to Django users
Thanks, I have seen that and plan to use it, but for this particular feature I need something more tailored.

-Samuel

To post to this group, send email to djang...@googlegroups.com.

Matthew Pava

unread,
Nov 6, 2017, 3:15:09 PM11/6/17
to django...@googlegroups.com

Maybe you are expecting too much from the user interface.  Shouldn’t you at least request from the user what primary object you are looking for?  Explicit is better than implicit.  Your example indicates that your primary object is a Device, but your UI dict gives no indication whatsoever that that’s what you want.

 

If you want to look for related fields in a model, check out the Model _meta API:

https://docs.djangoproject.com/en/1.11/ref/models/meta/

 

Ultimately, your Django ORM code would look like this for your example, if I follow your arrows correctly:

Device.objects.filter(hostname__contains= 'localhost ', package__name__contains= 'unix ', interface__IP__address__contains= '192')

 

It would seem that you want your UI to pass in the primary model and with all other models how they are related to the primary model.

To post to this group, send email to django...@googlegroups.com.

Samuel Abels

unread,
Nov 6, 2017, 3:21:58 PM11/6/17
to Django users


On Monday, November 6, 2017 at 9:15:09 PM UTC+1, Matthew Pava wrote:

Maybe you are expecting too much from the user interface.  Shouldn’t you at least request from the user what primary object you are looking for?


The primary model is always that one that is closest to "device"; step 2 of the process already takes this into account and returns the path in the best order, with the primary model being returned first.
 

Ultimately, your Django ORM code would look like this for your example, if I follow your arrows correctly:

Device.objects.filter(hostname__contains= 'localhost ', package__name__contains= 'unix ', interface__IP__address__contains= '192')


This would perform the right query, but would not provide me with a total. It would also be impossible to do paging, because slicing the result does not take into account that the LEFT JOIN multiplies the number of total rows.

-Samuel

Matthew Pava

unread,
Nov 6, 2017, 3:36:11 PM11/6/17
to django...@googlegroups.com

Is it really that bad?  Maybe I’m missing something in your situation.  I use my own custom page_queryset function.  I never got around to looking at the built-in Django way of doing it.  I think there is a generic view that can do paging.

q = Device.objects.filter(hostname__contains= 'localhost ', package__name__contains= 'unix ', interface__IP__address__contains= '192')

total_count = q.count()

 

def page_queryset(qs, page, count_per_page):
    """
    :param qs: the queryset or list to slice
    :param page: the current page to get records from (1-based)
    :param count_per_page: how many items are part of each page
    :return: a tuple: the index of the last page, the sliced queryset
    """
   
slice_begin = (page - 1) * count_per_page
    slice_end = slice_begin + count_per_page

    if type(qs) == QuerySet:
        max_count = qs.count()
    else:
        max_count = len(qs)

    slice_end = slice_end if slice_end < max_count else max_count
    last_page = max_count // count_per_page + 1
    return last_page, qs[slice_begin:slice_end]

 

 

From: django...@googlegroups.com [mailto:django...@googlegroups.com] On Behalf Of Samuel Abels
Sent: Monday, November 6, 2017 2:22 PM
To: Django users
Subject: Re: Equivalent of multi-table JOIN (another post on reverse select_related)

 

--

You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.

Samuel Abels

unread,
Nov 6, 2017, 3:51:48 PM11/6/17
to Django users
On Monday, November 6, 2017 at 9:36:11 PM UTC+1, Matthew Pava wrote:

Is it really that bad? Maybe I’m missing something in your situation. 


Ooooh, it isn't really. I incorrectly assumed that the query would perform like having an implicit DISTINCT(device.id). But it does in fact return absolutely the right thing:

>>> q = Device.objects.filter(pk='localhost', package__name__contains='i', unit__ip__address__contains='1')
>>> q.count()
1257

Yeah, that solves the issue. Thanks a lot!

-Samuel
Reply all
Reply to author
Forward
0 new messages