[GSOC] Weekly update

1,303 views
Skip to first unread message

Daniel Pyrathon

unread,
May 19, 2014, 1:56:06 PM5/19/14
to django-d...@googlegroups.com
Hi All,

Today I will be starting my weekly updates on my SoC project: refactoring Meta to a stable API. For anyone who missed out, you will be able to view it here: https://docs.google.com/document/d/1yp2_skqkxyrc0egdRv6ofnRGCI9nmvxDFBkCXgy0Jwo/edit

This week is the first official week of SoC. Me and my mentor (Russell) are initially approaching the work in the following way:
  • Document the existing Meta API
    For each endpoint, document the following:
      - Input parameters and return type
      - Caching pattern used
      - Where it's called from (internally and externally to Meta)
      - Why is it being called
      - When is it being called

  • Propose an initial refactor plan
    Once the documentation has been done, I should have a better idea of the current implementation. This will allow me to mock a proposed implementation that will be reviewed at my next update call, on Monday.
My next update will be posted on Friday, just to make sure the community is informed of my progress. For any major updates that require community approval, I will be creating separate threads.
My name on the internet is pirosb3, so if you want to have a chat about my progress feel free to contact me! The branch I am currently working on is https://github.com/PirosB3/django/tree/meta_documentation

Regards,
Daniel Pyrathon

Josh Smeaton

unread,
May 20, 2014, 9:25:45 AM5/20/14
to django-d...@googlegroups.com
Best of luck!

Daniel Pyrathon

unread,
May 23, 2014, 3:05:02 PM5/23/14
to django-d...@googlegroups.com
Hi all,

In the last days I have built a documentation of the current state of Options. Based on feedback and prototyping I have thought of a potential interface for _meta that can solve the issues currently present, such as redundancy (in code and in caching systems). The interface has also been thought to be maintainable and is a base that can be used to create custom meta stores.
Obviously this is far from perfect, It will need many iterations and maybe it is too complex. I would really love to gain as much feedback as possible so it can be discussed with Russell and the community on Monday.

I will be refining the document in the next days, I will also be publishing the docs on a webserver and will be linking a URL soon.

My proposal has been published here:
In the next days I will be iterating over the feedback gained, and based on one very interesting suggestion on IRC, I will try to see how my API syntax looks in modelforms.py.

As said previously, and feedback is greatly appreciated.

Hi from Pycon IT!

Daniel Pyrathon

Chris Beaven

unread,
May 24, 2014, 1:44:15 AM5/24/14
to django-d...@googlegroups.com
Hi Daniel,

The proposal looks interesting - I've only skimmed it so far but one question: you mention User.get_model() several times -- do you mean User.get_meta()?

Josh Smeaton

unread,
May 24, 2014, 4:37:49 AM5/24/14
to django-d...@googlegroups.com
Hi Daniel,

Nice work putting that document together. Is the meta document you put together the current API or is it the API you are proposing? If the latter, a few suggestions (and if others disagree, please point that out):

- Remove all mention of caching. That should be an implementation detail only, and not a requirement for other implementations.
- the *_with_model methods really rub me up the wrong way. I would prefer always returning the _with_model variant, and letting the caller discard the model if they don't need it.
- I'm not a fan of virtual and concrete fields, though I have to admit I'm not sure how they're different, especially in the context of different implementations.
- Not sure that m2m should be differentiated from related.
- init_name_map should be an implementation detail.
- normalize_together should be an implementation detail.

Regards,

Josh

Daniel Pyrathon

unread,
May 25, 2014, 7:19:11 PM5/25/14
to django-d...@googlegroups.com
Hi Chris,

Oh sorry about that! big typo over there. Modifying the gist.

Thanks,
Daniel Pyrathon

Daniel Pyrathon

unread,
May 25, 2014, 7:26:27 PM5/25/14
to django-d...@googlegroups.com
Hi Josh,

The meta API specified in the docs (https://github.com/PirosB3/django/blob/meta_documentation/docs/ref/models/meta.txt) is the current API. I have documented this in order to understand more of the current implementation and it will be good to show a comparison when a new meta API will be approved.

My current proposal (https://gist.github.com/PirosB3/371704ed40ed093d5a82) and it will be discussed tomorrow with Russell. I will post as soon as I have updates.

Daniel Pyrathon 

Daniel Pyrathon

unread,
May 25, 2014, 7:28:31 PM5/25/14
to django-d...@googlegroups.com
Hi All,

Just to make you know, I have put up the current _meta API documentation here:
As always, feel free to ask questions.

Daniel

Daniel Pyrathon

unread,
Jun 1, 2014, 12:10:53 PM6/1/14
to django-d...@googlegroups.com
Hi All,

An update on my side, some interesting work has happened this week: me and Russell have decided to start on the implementation early in order to understand better the internals of Options. Currently I am working on the following:

Providing one single function: get_fields(types, opts, **kwargs)
Previously we had identified a number of functions that contained similar data but had a different return type. We are going to provide one function that takes a set of field types + options and returns the same output everywhere: ((field_name, field), ...). This has the benefit of simplicity and is more maintainable than the previous approach.

TYPES = DATA, M2M, FK, RELATED_OBJECT, RELATED_M2M
OPTS = NONE, LOCAL_ONLY, CONCRETE, INCLUDE_HIDDEN, INCLUDE_PROXY, VIRTUAL

Providing two functions for retrieving details of a specific field
As specified in my previous document, in many parts of the code we just want to retrieve a field object by name, other times we have a field but need other metadata such as: owner (model_class), direct (bool), m2m (bool). We will provide two functions:

get_field(field_name) -> field_instance

get_field_details(field_instance) -> direct, m2m, (still to be defined)

While we still have not entirely defined what get_field_details will return, this will be done soon.


Building a test suite for the existing API
The new API development will be driven by a test suite that will compare the current (legacy) API with the new implementation. While return types will be different, we are asserting that all the correct fields and metadata are returned. Building a test suite means we can start implementing before a final API spec is finalised. It also means we can iterate faster and, from my perspective, I also understand a lot more of the current implementation. We are testing each combination of fields and options together.

My current implementation is visible here: https://github.com/PirosB3/django/compare/soc2014_meta_refactor

For any questions or suggestions, let me know.

Daniel Pyrathon

Daniel Pyrathon

unread,
Jun 6, 2014, 1:03:36 PM6/6/14
to django-d...@googlegroups.com
Hi All,

Based on the work done last week, this week I have worked on the following:

1) Covered the current _meta implementation of unittests
The current Options is not covered by unit tests at all, I have created the model_options test module containing one or more unit tests for each endpoint and usage. Each endpoint is tested with many different models and fields (data, m2m, related objects, related m2m, virtual, ..). Each unit test asserts equality in field naming and field type. Endpoints that return the model are also tested for equality.


2) Pulled in tests from soc2014_meta_unittests and tested the new implementation
The previous branch that I developed on contains the new API implementation, I have pulled in the tests from soc2014_meta_unittests and I have switched the old API calls to the new API calls (with a few adjustments). I have successfully made all tests pass even though I have found some design issues that need to be addressed in my next update call.


3) Created a new branch that maps the old implementation with the new
Today I started putting my new API in "production". This is obviously nowhere near a finalised version but it helps me spot some edge cases that are not in the unit-tests. Each issue found has been converted into a standalone unit-test and has been proved to fail.
Unfortunately, this made me realise of other design issues related to the new return type of get_fields -> (field_name, field), A decision will be taken on monday.

This branch is found here: https://github.com/PirosB3/django/tree/soc2014_meta_refactor_implementation as is for personal use only

For any questions, let me know!
Dan

Daniel Pyrathon

unread,
Jun 13, 2014, 1:53:41 PM6/13/14
to django-d...@googlegroups.com
Hi All,

This week's work was a follow-up of last week's work, with some new discoveries with regards to the new API:

1) Improved the current _meta implementation of unittests
I refactored the current meta unittest branch into a more compact version. I also added test cases for Generic Foreign Keys, RelatedField and more improvements on virtual fields. This section will actually be our first merge to master: a unit test suite for the current implementation.
2) Refactored the new _meta API spec
By implementing the new API, I found new redundancies in the current implementation that we can avoid in the new API spec:

1) Only return field instances
the current implementation has a common return pattern: (field_object, model, direct, m2m). After a few tests I realized that model, direct, m2m and field_name are all redundant and can be derived from only field_object. Since there are only 3-4 places that actually use direct and m2m it makes sense to remove this function from the new API spec.
Here I show how all the information can be derived form field_instance (https://gist.github.com/PirosB3/6cb4badbb1b8c2e41a96/revisions), I ran the entire test suite and things look good. The only issue I can see now is from a performance point of view.

2) Provide only 2 API endpoints

 - get_fields(types, opts) -> (field_instance, field_instance, ...)
 - get_field(field_name) -> field_instance

The rest is all (I might be very wrong! don't trust me) redundant. By continuing with this iterative approach, I will soon find out if I am correct or not ;)
For any questions, as usual, let me know
Dan

Aymeric Augustin

unread,
Jun 13, 2014, 4:11:41 PM6/13/14
to django-d...@googlegroups.com
I like this simplification a lot. I hope you can make it work.

Just check that you don’t "overload" fields, by storing in field objects information that doesn’t belong there.

-- 
Aymeric.



-- 
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/78f45c61-9626-4aa9-a854-64b11ba44572%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Pyrathon

unread,
Jun 21, 2014, 5:57:27 AM6/21/14
to django-d...@googlegroups.com
Hi All,

This week I have done the following

- Openend PR and merged Options (_meta) unittests
This is a working test suite of model/options.py file. Is is currently testing the current Options implementation.

- Re-implemented test-suite in the new API
My new proposed API has been implemented on top of the Options unittests. This means new and old API give the same results! and therefore the new API is a working re-implementation.

 - Performance optimizations
Currently, the biggest issue I have had (in the new API) regards performance. I am doing small optimisations and benchmarking with cProfile.
If anyone has some good hints on how to benchmark single functions in Django please let me know!

Regards,
Dan


-- 
Aymeric.



To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsub...@googlegroups.com.

Curtis Maloney

unread,
Jun 21, 2014, 8:14:56 PM6/21/14
to django-d...@googlegroups.com
Is there somewhere I can look at your work in progress code?


To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.

To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Daniel Pyrathon

unread,
Jun 22, 2014, 5:49:05 AM6/22/14
to django-d...@googlegroups.com
Sure!


This contains the following:
 - Recently merged unit test suite for model/options
 - New API implementation: get_new_fields, get_new_field.
 - Implementation of recently merged unit test suite with new API
 - Small optimizations (BIG work in progress)

If you have any suggestions please let me know!

Dan

-- 
Aymeric.



To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscrib...@googlegroups.com.

Daniel Pyrathon

unread,
Jun 22, 2014, 5:52:20 AM6/22/14
to django-d...@googlegroups.com
RE work in progress code

The main functions to optimize are get_new_fields, get_new_field, they are found here: https://github.com/PirosB3/django/blob/a92c37f0cad6bdd7a3b24ef235c81a7dab6bee72/django/db/models/options.py

Daniel Pyrathon

unread,
Jun 27, 2014, 2:09:58 PM6/27/14
to django-d...@googlegroups.com
Hi All,

This week I have been working on the following:

- Made all Django unit test suite pass with the new API
The entire Django test suite is passing with the new refactored API. This means we (should) have a fully working version. I have tested with not problems at all with the following:
 - Django unit test suite
 - new Django projects

- Optimized new API
As said in my last post, there have been big issues regarding performance with the new API. This week I managed to decrease the cumulative time by more than half (thanks to the help received on IRC!). Optimisation made mi also find out potential bottlenecks and issues, that have been successfully resolved.

Unfortunately, the current API is still not as efficient as the current one, and that is because we are providing less endpoints and this makes it harder to cache single parts. Also, a lot of latency will be resolved by replacing the new API with the old one throughout the system (instead of just binding one call to the other).

I have a couple of ideas I would like to reason this weekend and, by chatting with Russell, we will come up with a better optimisation plan. Obviously, any core decisions will be posted here.


Dan

Daniel Pyrathon

unread,
Jul 5, 2014, 5:27:31 AM7/5/14
to django-d...@googlegroups.com
Hi All,

This week I have been working on the following:

- Optimisations on the new API
As we now have a working implementation, I have been optimising the API to improve performance while still maintaining a good design.
Yesterday I have been working on benchmarks that prove speed is similar to the old API. Unfortunately there still has not been a major speedup compared to the current version (which I am very upset of) but at least we have a newly designed (and soon documented) API that people will be able to officially use.

With regards to optimisation:
To all the community: If you have a second, please review my new API (models/options). You might be able to spot some interesting optimisations that I have not thought of currently. Comment on the pull request any ideas you might find helpful.

- Benchmarking the new API
One day was spent on benchmarking performance. Results show very little difference compared to the old API. I have been using my own fork of Djangobench where I have introduced median measure (to avoid random CPU peaks or swapping peaks) PR https://github.com/django/djangobench/pull/19.
I have collected 10 runs of each benchmark and, once results have been formatted in a nice way, I will post a link for everyone to see.

- New API Naming
Part of this week was also to propose an upgrade and deprecation plan for the new API. Most of the discussion happened on IRC with Tim Graham and on my PR https://github.com/django/django/pull/2874
The proposed decision is the following:

the new API naming will be:
get_fields(types, opts) -> (field, field, ...)
get_field(field_name, include_related=False, **kwargs) -> field
There already is a get_field, so there are backwards compatibility concerns. As the two functions do the same thing, we will keep the function and raise a warning if old parameters (such as many_to_many) are used.

- API access:
As the new API will be public, I propose a public method get_meta() on models.Model to officially access _meta.

As always, let me know for any questions and please look at the PR
Dan

Josh Smeaton

unread,
Jul 5, 2014, 6:11:07 AM7/5/14
to django-d...@googlegroups.com
Excellent work, well done. I have a couple of questions/ideas though.

1. The use of bit flags irks me. Would it be possible just to use numerous keyword arguments to the new get_fields ?
2. Since you've reduced the API to effectively two functions (get_fields, get_field), is it still necessary to contain those methods on an inner object? Can't the new methods live directly on the model? If they can't be put directly on the model for some reason, I would consider a better name than "meta". I can't think of a reason that the name meta should be used other than history. If you were to think about the purpose of the API without the associated history, what would you call it?

log = Log.objects.get(pk=1)
log.get_fields(concrete=True, m2m=True, related=True)
log.get_field('log_level') # we're being explicit about the field name, no need for options

Thoughts?

I'll take a better look at the PR and see if I can offer anything with regards to your performance optimisations.

Josh 

Curtis Maloney

unread,
Jul 5, 2014, 6:38:00 AM7/5/14
to django-d...@googlegroups.com
Can I ask as a favour to future works that you record all the performance tuning approaches you use?

Many of these will be obvious to seasoned Python programmers, but some of them won't be, and it would be nice to have a catalogue of "things to look out for" to apply to critical paths in Django.

--
Curtis



--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Daniel Pyrathon

unread,
Jul 11, 2014, 10:53:42 AM7/11/14
to django-d...@googlegroups.com
Hi Josh,

Thanks to your advice.


On Saturday, July 5, 2014 12:11:07 PM UTC+2, Josh Smeaton wrote:
Excellent work, well done. I have a couple of questions/ideas though.

1. The use of bit flags irks me. Would it be possible just to use numerous keyword arguments to the new get_fields ?

The new API dosen't use bit flags anymore, it has all been refactored out this week. It is much better now.
 
2. Since you've reduced the API to effectively two functions (get_fields, get_field), is it still necessary to contain those methods on an inner object? Can't the new methods live directly on the model? If they can't be put directly on the model for some reason, I would consider a better name than "meta". I can't think of a reason that the name meta should be used other than history. If you were to think about the purpose of the API without the associated history, what would you call it?

That's an interesting point! Do you think this is feasible? how many people will want to look fields up? I expect very few people using this API (only people that know what they are doing), but I might be very wrong.
Do we want to discuss this on a separate thread? I know Russell would like to write his own opinions too.

Daniel Pyrathon

unread,
Jul 11, 2014, 10:55:49 AM7/11/14
to django-d...@googlegroups.com
Hi Curtis,

Of course! I will be happy to open a document with my performance tuning experience. Said this, I am very far away from being a "performance master" and I am still looking at places where my code can improve. It would be nice if it was a wiki page, where everyone can share and correct. Collective knowledge!
Where would you like me to write the document?

Daniel Pyrathon

unread,
Jul 11, 2014, 10:59:09 AM7/11/14
to django-d...@googlegroups.com
Hi All,

Following this week's work. I have made progress in optimisation and design of the API.

I have opened a wiki page that contains all:
- Main concepts
- Decision process
- Benchmarks
- API designs
- How you can help!


For anything, as usual, just let me know :)
Daniel

Daniel Pyrathon

unread,
Jul 18, 2014, 1:32:04 PM7/18/14
to django-d...@googlegroups.com
Hi All,

As usual, here are my updates:

- Formalised pull request, looking for a formal review
Last week I published the wiki page for my meta implementation. This week I have been working on improving internal code documentation, optimising a few bits and bobs, and added the formal documentation.
As usual, all code is available here: https://github.com/django/django/pull/2894.
Now it's time for a real full review of the code, as maybe a few concepts and terms will change but the code is formal and most things might not change.

- Published new benchmarks, and explanations for some slowdowns
The benchmarks for my new implementation are available on the PR (https://github.com/django/django/pull/2894). These benchmarks are taken from DjangoBench (all numbers are the median of 2000 runs of each test). You might see that, while overall performance has increased compared to the current implementation, there are some very odd results. This week I have spent some time investigating the reason for some of the major slowdowns and I have come to the following conclusion:
DjangoBench profiling is implemented by spinning up a separate process for each benchmark. (https://github.com/django/djangobench/blob/master/djangobench/main.py#L135). If you look at the chapter "Compute inverse relation map on first access" (https://code.djangoproject.com/wiki/new_meta_api#Computeinverserelationmaponfirstaccess) you will see that the new caching model is more expensive on startup, but is only done once per process. This means we pay a little more price on startup, but then the Options internals do much less work. Although I cannot say I am 100% correct (and I will be investigating further) I believe the slowdown on some benchmarks is related to this, and therefore it may be related more on the way it is benchmarked, rather than the benchmark itself.

- Started on my GMail Store
While I was waiting for review, I have started an implementation of my GMail store. The implementation is documented here (https://docs.google.com/document/d/1yp2_skqkxyrc0egdRv6ofnRGCI9nmvxDFBkCXgy0Jwo/edit#heading=h.wyxx4xxijubt) and will be my final deliverable.
So far, I have had some interesting results. With very little code, and the new Options API, I didn't have to modify a single file inside Django for my implementation to work. All code just overrides some of the main Django classes (Manager, Query, QuerySet).

Currently, I have managed to:
  • Browse my GMail threads through Django admin
  • View a single GMail thread
  • Read an entire email!
  • Use the Console to do everything said above!
I have attached a picture.


The next step is to improve my implementation. I will be (hopefully) soon releasing a GitHub project that you will be able to use to browse, read and send your mail through Django admin and forms!

- I WILL BE ATTENDING EUROPYTHON
I'd love to meet up with anyone attending EuroPython this week. I will obviously participate at the sprints but if you want to catch up before it would be great. It will be a great idea to discuss the new API together and get to know the community more.
I might also be doing a lightning talk on the new Options API, or the GMail store depending on progress.

For anything, as usual, let me know
Daniel Python

Raffaele Salmaso

unread,
Jul 19, 2014, 1:38:58 AM7/19/14
to django-d...@googlegroups.com
On Fri, Jul 18, 2014 at 7:32 PM, Daniel Pyrathon <pir...@gmail.com> wrote:
- Started on my GMail Store
While I was waiting for review, I have started an implementation of my GMail store. The implementation is documented here (https://docs.google.com/document/d/1yp2_skqkxyrc0egdRv6ofnRGCI9nmvxDFBkCXgy0Jwo/edit#heading=h.wyxx4xxijubt) and will be my final deliverable.
So far, I have had some interesting results. With very little code, and the new Options API, I didn't have to modify a single file inside Django for my implementation to work. All code just overrides some of the main Django classes (Manager, Query, QuerySet).

Currently, I have managed to:
  • Browse my GMail threads through Django admin
  • View a single GMail thread
  • Read an entire email!
  • Use the Console to do everything said above!
I have attached a picture.


The next step is to improve my implementation. I will be (hopefully) soon releasing a GitHub project that you will be able to use to browse, read and send your mail through Django admin and forms!
Pirosb3, just three words: really nice work!
 
--
| Raffaele Salmaso
| http://salmaso.org
| https://bitbucket.org/rsalmaso
| http://gnammo.com

Andre Terra

unread,
Jul 19, 2014, 4:42:55 PM7/19/14
to django-d...@googlegroups.com
On Sat, Jul 19, 2014 at 2:38 AM, Raffaele Salmaso <raff...@salmaso.org> wrote:
On Fri, Jul 18, 2014 at 7:32 PM, Daniel Pyrathon <pir...@gmail.com> wrote:
- Started on my GMail Store
(...)
Pirosb3, just three words: really nice work!
 

I absolutely agree! The possibilities are endless.

Congratulations on delivering such a major contribution to the framework! I am sure your work will be a key benchmark to future GSOC applicants.


Cheers,
AT

Daniel Pyrathon

unread,
Aug 3, 2014, 9:11:04 AM8/3/14
to django-d...@googlegroups.com
Hi All,

First of all Thank you SO MUCH for your comments. It's really nice to hear great feedback from the community.
These last two weeks I have been working on improving the existing API implementation and terminology. Here is an overview of the 3 main tasks:

1) get_fields should return a list instead of a tuple
Previously, get_fields would return a tuple. The main reason was related to memory and immutability. After discussing with Russell, we decided this endpoint should actually return a list as it is a more suited datastructure.
The API has changed to return a list on each API call, unfortunately this has led to a number of problems:

Manipulating lists inappropriately can cause cache errors
As lists can be manipulated, this can cause a series of issues. get_fields stores results in cache for efficiency, if the results are then manipulated on the "outside" this will directly manipulate the cache results stored. A solution to this is to always provide a "copy" of the cache, but after doing some cProfile I found out that copy is the second most expensive call throughout all operations.
As this is a public API, it brings us to a design decision: do we want to prioritise performance and document that fields results should be never manipulated or should we return a copy? There are pros and cons of each aspect, and I will be discussing with Russell tomorrow.

2) Move tree cache out of the apps registry
The main optimisation for the new API (described here https://code.djangoproject.com/wiki/new_meta_api#Mainoptimizationpoints) is currently storing information on the Apps registry. After discussing with Russell we agree this shouldn't be the correct place to keep this. A solution is to store related fields information separately on each model (https://github.com/PirosB3/django/pull/5/files#diff-98e98c694c90e830f918eb5279134ab9R275). This has been done, and all tests pass.
Unfortunately, there are performance implications with this approach: we are storing a new list on each model regardless of if it has related fields or not. I will be discussing with Russell about performance tomorrow.
Finally, for a better design, cache is now stored in a namedtuple called RelationTree.


3) Revisiting field types and options
Last week I opened this: https://groups.google.com/forum/#!topic/django-developers/s2Lp2lTEjAE in order to discuss with the community possible new field and option names. This week I have actually revisited the fields and options for get_field/s and I have come up with the following:

Removed concrete fields
Technically speaking, concrete fields are data fields without a column. The only concrete field in the codebase is ForeignObject. There are 2 classes that inherit from ForeignObject: GenericRelation and ForeignKey, but each of them fall into different categories (virtual and data). Given that ForeignObject is an internal structure, and given that Django only looks for concrete fields on very few occasions, it makes sense to remove include_concrete from get_fields options.

Removed include_proxy option
In only 2 parts of the codebase (deletion.py) we want to fetch all related objects for a model, including any relations to proxies of that model. This is used in order to perform a "bubble up" delete of relations (an example of this can be seen at https://github.com/django/django/blob/master/tests/delete_regress/tests.py#L209). Given that most developers will not need this option, and given that it can be easily computed from the outside, it makes sense to remove it from the get_fields options.

I think that by removing these two options, and by revisiting the fields terminology, we will come up with a simpler and clearer API.
Below is the new, revisited, API for get_fields:

        """
        Returns a list of fields associated to the model. By default will only search in data.
        This can be changed by enabling or disabling field types using
        the flags available.

        Fields can be any of the following:
        - data:             any field that has an entry on the database
        - m2m:              a ManyToManyField defined on the current model
        - related_objects:  a one-to-many relation from another model that points to the current model
        - related_m2m:      a M2M relation from another model that points to the current model
        - virtual:          fields that do not necessarily have an entry on the database (like GenericForeignKey)

        Options can be any of the following:
        - include_parents:        include fields derived from inheritance
        - include_hidden:         include fields that have a related_name that starts with a "+"
        """

Regarding virtual:
I have also passed a bit of time in understanding what the properties of virtual are and what they should be. A virtual field does not have a direct entry on the database, but uses the data of 1 or more other fields to create special, abstract, structures.
An example of this could be a "composite field" such as a Point2D:

class City(models.Model):
  x = models.FloatField()
  y = models.FloatField()
  position = Point2D('x', 'y')

position is a virtual field because it does not have any presence on the database and relies on the information of 'x' and 'y'.

Finally
I apologise for posting late, EuroPython was a really good conference and it was great to meet some of the core developers, such as Tom, Honza, Baptiste. I feel I worked 50% of my time, and I will recover in the future.
In the mean-time, I hope to make up for it with:

 - Latest benchmarks! (coming soon)
 - A picture of me with Baptiste, at the sprints!

Aymeric Augustin

unread,
Aug 3, 2014, 10:24:27 AM8/3/14
to django-d...@googlegroups.com
On 3 août 2014, at 15:11, Daniel Pyrathon <pir...@gmail.com> wrote:

1) get_fields should return a list instead of a tuple
Previously, get_fields would return a tuple. The main reason was related to memory and immutability. After discussing with Russell, we decided this endpoint should actually return a list as it is a more suited data structure.

Can you clarify the reasons for this choice? If it’s just the purity of using the “proper” data type, I’m not sure it beats the practicality of returning an immutable iterable and avoiding the cost of a copy or the risk of incorrect manipulation.

The bugs that would occur if the cached value is altered inadvertently could be very hard to track down.

2) Move tree cache out of the apps registry
The main optimisation for the new API (described here https://code.djangoproject.com/wiki/new_meta_api#Mainoptimizationpoints) is currently storing information on the Apps registry. After discussing with Russell we agree this shouldn't be the correct place to keep this.

+1

A solution is to store related fields information separately on each model (https://github.com/PirosB3/django/pull/5/files#diff-98e98c694c90e830f918eb5279134ab9R275). This has been done, and all tests pass.
Unfortunately, there are performance implications with this approach: we are storing a new list on each model regardless of if it has related fields or not. I will be discussing with Russell about performance tomorrow.

Is the extra cost incurred only at start up ? How large is it? If it isn’t too bad and and run time performance is unchanged, it could be acceptable.

3) Revisiting field types and options
Last week I opened this: https://groups.google.com/forum/#!topic/django-developers/s2Lp2lTEjAE in order to discuss with the community possible new field and option names. This week I have actually revisited the fields and options for get_field/s and I have come up with the following: (…)

The latest iteration looks clear, consistent and straightforward. I like it.

-- 
Aymeric.

Daniel Pyrathon

unread,
Aug 3, 2014, 10:48:04 AM8/3/14
to django-d...@googlegroups.com
Hi Aymeric,

Thanks for writing back


On Sunday, August 3, 2014 4:24:27 PM UTC+2, Aymeric Augustin wrote:
On 3 août 2014, at 15:11, Daniel Pyrathon <pir...@gmail.com> wrote:

1) get_fields should return a list instead of a tuple
Previously, get_fields would return a tuple. The main reason was related to memory and immutability. After discussing with Russell, we decided this endpoint should actually return a list as it is a more suited data structure.

Can you clarify the reasons for this choice? If it’s just the purity of using the “proper” data type, I’m not sure it beats the practicality of returning an immutable iterable and avoiding the cost of a copy or the risk of incorrect manipulation.

This is a design decision, tuple performs better. 
 

The bugs that would occur if the cached value is altered inadvertently could be very hard to track down.

 

2) Move tree cache out of the apps registry
The main optimisation for the new API (described here https://code.djangoproject.com/wiki/new_meta_api#Mainoptimizationpoints) is currently storing information on the Apps registry. After discussing with Russell we agree this shouldn't be the correct place to keep this.

+1

A solution is to store related fields information separately on each model (https://github.com/PirosB3/django/pull/5/files#diff-98e98c694c90e830f918eb5279134ab9R275). This has been done, and all tests pass.
Unfortunately, there are performance implications with this approach: we are storing a new list on each model regardless of if it has related fields or not. I will be discussing with Russell about performance tomorrow.

Is the extra cost incurred only at start up ? How large is it? If it isn’t too bad and and run time performance is unchanged, it could be acceptable.

Yes! only at start-up. Once cache warming has finished, everything should be as fast (maybe faster, given that we are accessing properties on the single Options instance).
 

3) Revisiting field types and options
Last week I opened this: https://groups.google.com/forum/#!topic/django-developers/s2Lp2lTEjAE in order to discuss with the community possible new field and option names. This week I have actually revisited the fields and options for get_field/s and I have come up with the following: (…)

The latest iteration looks clear, consistent and straightforward. I like it.

Thanks! Please feel free to comment with doubts or suggestions.
 

-- 
Aymeric.

Tom Christie

unread,
Aug 4, 2014, 8:25:50 AM8/4/14
to django-d...@googlegroups.com
>>>> 1) get_fields should return a list instead of a tuple
>>> Previously, get_fields would return a tuple. The main reason was related to memory and immutability. After discussing with Russell, we decided this endpoint should actually return a list as it is a more suited data structure.
>> Can you clarify the reasons for this choice? If it’s just the purity of using the “proper” data type, I’m not sure it beats the practicality of returning an immutable iterable and avoiding the cost of a copy or the risk of incorrect manipulation.
> This is a design decision, tuple performs better.

I'm confused - does it return a list or a tuple? Your previous answer mentioned that it was a list, but you later state "This is a design decision, tuple performs better.".

Given that it ought to be immutable, I can't see any reason for it *not* to be a tuple.

Great work on all of this!

All the best,

  Tom

Collin Anderson

unread,
Aug 4, 2014, 8:44:38 AM8/4/14
to django-d...@googlegroups.com
if we do really need a list, could we gain some performance by caching tuples and converting them to lists instead of caching lists?

Daniel Pyrathon

unread,
Aug 4, 2014, 10:14:39 AM8/4/14
to django-d...@googlegroups.com

Łukasz Rekucki

unread,
Aug 4, 2014, 10:54:45 AM8/4/14
to django-developers
On 4 August 2014 16:14, Daniel Pyrathon <pir...@gmail.com> wrote:
> Hi All,
>
> This has been resolved by using the ImmutableList datastructure
>
> https://github.com/PirosB3/django/blob/soc2014_meta_refactor_upgrade_flags_get_field_tree/django/db/models/options.py#L629
>

But why? What's the benefit over using a tuple? ImmutableList is not
even a list, because it inherits from tuple.

The only other use of this data structure I could find is in upload
handlers and the rationale is that the field suddenly changes from
mutable to immutable. I don't think your case is the same, as fields
are always immutable.

Also, if fields() is immutable, then so should concrete_fields(), etc.


>
> On Monday, August 4, 2014 2:44:38 PM UTC+2, Collin Anderson wrote:
>>
>> if we do really need a list, could we gain some performance by caching
>> tuples and converting them to lists instead of caching lists?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-develop...@googlegroups.com.
> To post to this group, send email to django-d...@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/f4795467-24c7-4a61-af78-1a5b1a16299d%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.


--
Łukasz Rekucki

Russell Keith-Magee

unread,
Aug 5, 2014, 8:33:53 PM8/5/14
to Django Developers
On Tue, Aug 5, 2014 at 12:54 AM, Łukasz Rekucki <lrek...@gmail.com> wrote:
On 4 August 2014 16:14, Daniel Pyrathon <pir...@gmail.com> wrote:
> Hi All,
>
> This has been resolved by using the ImmutableList datastructure
>
> https://github.com/PirosB3/django/blob/soc2014_meta_refactor_upgrade_flags_get_field_tree/django/db/models/options.py#L629
>

But why? What's the benefit over using a tuple? ImmutableList is not
even a list, because it inherits from tuple.

Communication. 

From a purist theoretical perspective, there shouldn't be any argument - the data we're talking about is a list. Lists have homogeneous elements; Tuples have heterogeneous elements, but have *positional* homogeneity. A "Point" is a tuple, because element 0 of the tuple "means" the x coordinate. A Database row is a tuple - The first element is the primary key (an integer), second is the "name" column (a string), and so on.

A tuple is *not* "just an immutable list".

From a pragmatic perspective, tuples are faster to work with, because they aren't carrying around all the baggage needed to support mutability. And, in this case, there is a risk of an end-user accidentally mutating the result of get_fields(), which, due to cache-related optimizations, means there's a risk of side effects.

So - in this case, at a theoretical level, we are dealing with an list of homogeneous objects (fields) that must be either immutable, or copied every time it is used. Copying is a bit expensive (not *completely* prohibitive, but noticeable), so that means we need to look to immutability. At a theoretical I'm not wild about the fact that ImmutableList subclassing tuple - it should be subclassing *list* - but I'm willing to defer to pragmatism. Given that we're dealing with a relative internal detail that most users won't be exposed to, I'm willing to hold my nose and accept optimised solutions over theoretically pure solutions in the interests of *all* users having better performance.

Yours,
Russ Magee %-)

Collin Anderson

unread,
Aug 6, 2014, 8:14:01 AM8/6/14
to django-d...@googlegroups.com
Communication. 

From a purist theoretical perspective, there shouldn't be any argument - the data we're talking about is a list. Lists have homogeneous elements; Tuples have heterogeneous elements, but have *positional* homogeneity. A "Point" is a tuple, because element 0 of the tuple "means" the x coordinate. A Database row is a tuple - The first element is the primary key (an integer), second is the "name" column (a string), and so on.

A tuple is *not* "just an immutable list".
 
So maybe we should be using lists instead of tuples in our list_display docs? https://docs.djangoproject.com/en/1.6/intro/tutorial02/#customize-the-admin-change-list 

Ryan Hiebert

unread,
Aug 6, 2014, 8:39:07 AM8/6/14
to django-d...@googlegroups.com
No, the list_display is a template of a _row_ in the list display, which is properly represented as a tuple, just like a database row.

Daniel Pyrathon

unread,
Aug 6, 2014, 9:25:47 AM8/6/14
to django-d...@googlegroups.com
Hi Łukasz,

Sorry for getting back to you now. Thanks Russell for the comments.
As Russell says, it's more of a conceptual choice. My first implementation of this API was using tuples for the same
reason you said above. I think ImmutableList is a good compromise between list and tuple. It overrides a tuple (so it should be performant) and is still conceptually a list.
Is this clear? please let me know if you want me to explain further our decision on this.

Daniel

Tom Christie

unread,
Aug 6, 2014, 9:46:42 AM8/6/14
to django-d...@googlegroups.com
> This has been resolved by using the ImmutableList datastructure

Looks fine. Behaviourally it's the same as a tuple, but with more explicit errors if you attempt to modify it, which seems reasonable.

Given that `.get_fields()` and `.fields` return `ImmutableList`, presumably `.field_names`, `.concrete_fields` and `.local_concrete_fields` should do as well right?


> A tuple is *not* "just an immutable list".

Interestingly, that's a point where we'd differ. In my mental model a programming construct is defined *purely* by it's behaviour, with no room for any 'conceptual' or 'theroretical' difference - as far as I'm concerned an (unnamed) python tuple really *is* just an immutable python list.

I think any further conversation on that is out of the scope of this list, but I've heard both positions advocated, and figured it's an interesting point to note. :)

(And it does have the occasional impact on how we'd choose to write documentation examples etc...)

All the best & keep up the great work,

  Tom

Daniel Pyrathon

unread,
Aug 8, 2014, 12:37:06 PM8/8/14
to django-d...@googlegroups.com


On Wednesday, August 6, 2014 3:46:42 PM UTC+2, Tom Christie wrote:
> This has been resolved by using the ImmutableList datastructure

Looks fine. Behaviourally it's the same as a tuple, but with more explicit errors if you attempt to modify it, which seems reasonable.

Given that `.get_fields()` and `.fields` return `ImmutableList`, presumably `.field_names`, `.concrete_fields` and `.local_concrete_fields` should do as well right?

Of course, I am pushing a new version of this now

Daniel Pyrathon

unread,
Aug 9, 2014, 10:10:31 AM8/9/14
to django-d...@googlegroups.com
Hi All,

This week I have been working on mainly 3 tasks:

- Formalizing the PR
2 days were dedicated on fixing all the small issues on the PR. I am now ready for another review.

- Simplified the API for get_field
There are 3 used combinations of get_field() in the entire Django codebase

  • get_field(data=True, m2m=True) > 60% of the occurrences
  • get_field(data=True, m2m=True, related_objects=True, related_m2m=True, virtual=True) > 39.9% of the occurrences
  • get_field(data=False, m2m=True) once only! (0.01% of occurrences).

The new API replaces the first 2 cases by generalizing: get_field(field_name, include_related=False, **kwargs).
The API is still 100% backwards compatible with the old API, until Django 2.0.

Benefits:

  • better and simpler API
  • better memory management (caching)
The mailer
As my final deadline is near, I continued working on the mailer. The mailer is a custom store that allows Django admin and forms to interact with the GMail API.
I have published the source code for this here: https://github.com/PirosB3/django-mailer/
While the code is still "a big hack", I aim on getting a PoC fully working and then do some refactoring. This week I managed to get relations between custom models working, all using the new Options API! The result is I can now browse my mail through the admin panel!

- Help me get people excited about this!
The new Options API works, and allows developers to create custom stores. This will allow us to create official NoSQL stores, SqlAlchemy stores, and even a IMAP store. I think it's a great feature that has the potential to make Django more decoupled and more powerful.
Said this, it still needs loads of work. And I am sure the API can still improve. Let's get more people aware of the new possibilities and help me make a better API:


Thanks!
Daniel

Daniel Pyrathon

unread,
Aug 14, 2014, 1:57:12 PM8/14/14
to django-d...@googlegroups.com
Hi All,

Since last week, the new API has changed: we went from 5 field types to 9 field types. Below is a list of all field types, the bold ones are the ones that have just been added.

 * pure data (CharField etc)
 * relation data (FK)
 * pure virtual (Point)
 * relation virtual (GFK)
 * m2m (badly named; nothing at present, but possibly in a document store)
 * relation m2m (badly named; M2M fields)
 * related object (Reverse FK) 
 * related m2m (Reverse M2M)
 * related virtual (GenericRelation)

More information on the new API can be seen on my documentation server at: http://5.101.98.142:49156/ref/models/meta.html#module-django.db.models.options


I will be posting again as soon as benchmarks finish. In the meantime it would be great to have some feedback.

Thanks,
Daniel 

Collin Anderson

unread,
Aug 14, 2014, 2:58:42 PM8/14/14
to django-d...@googlegroups.com
Summarizing the conversation I just had with Daniel in IRC:

What if we thought about just documenting the high-level api, like what ModelForms uses, and not necessarily document what the django low levels: ORM, migrations, etc would need to use.
Could we get away with not documenting _meta.get_fields(**kwargs) and just document these cached attributes:
_meta.data_fields (currently _meta.concrete_fields)
_meta.m2m_fields (currently _meta.many_to_many)
_meta.virtual_fields
_meta.related_objects
_meta.related_m2m
_meta.get_field('field_name') (no kwargs)
_meta.field_names
The question then is: is this a good API.
You can chain those together or filter them all you want by hand, and deal with caching yourself.
In the high level API, you shouldn't care how a particular field is stored in the database. If the high level API really wants do distinguish between local_fields and parent_fields (even though it shouldn't care), it can check something like field.model = TheModel.

My thoughts on the proposal:

It seems to me m2m _is_ a virtual field, and in the high-level API, I think we should try to avoid whether or not a particular field is virtual or not. What's virtual for one database could be native in a different database.
I don't have a clear picture in my head of how to split up the fields, and I wonder if just exposing a _meta.fields (which includes data + m2m + virtual) and instead having attrs on fields to describe themselves, like field.rel, field.db_column, field.cant_be_used_until_after_model_has_an_id, field.rel.is_m2m, etc.

Collin Anderson

unread,
Aug 14, 2014, 3:05:05 PM8/14/14
to django-d...@googlegroups.com
Also, I still believe that just because "class Meta:" stole the word "meta", doesn't mean we should take what it should really be called ("options") and put it on this. Can we call the two "meta options api" and "meta fields api"? or "meta schema api".
Reply all
Reply to author
Forward
0 new messages