Good morning!
My name is Jason Ledbetter and I'm a student hoping to participate in
Summer of Code 2008 through the Django team. I thought I should post
here first to introduce myself and throw out some of the ideas I have
baking to make sure I'm not a) replicating someone else's work, b)
completely on the wrong track or c) missing something obvious(ly bad)
about my design approach.
Just a smidge about me: I've been using Django for around 2 years to
design and maintain CMS tools for my company, Commonwealth
Underwriters. At 28 years old, I'm a little old to be a college senior
just getting my bachelor's degree. I took the long way. ;)
Django appeals to me on a few levels: it's in python, which is the
only language other than C or Erlang that I enjoy using. It's designed
by journalists; I became a programmer after I dropped out of Ball
State's journalism school. There's an obvious obsession with doing
things the Right Way, even if that means scrapping a bunch of code
that's shown itself to be the Wrong Way.
On to the actual content! My first impulse for a project is to add a
series of tools, each somewhat small in scope, that could smooth out
certain aspects of development. I figure if I tackle four month-long
projects in somewhat diverse areas, I could learn more about the code
base and gain a broader range of experience.
(Don't let the style of any of my sample code scare you; when writing
project code I'll have a list of the Django style guidelines next to
my keyboard.)
Here's what I have so far:
1. Batch Test Data Creation
Unless I'm missing something, there's no clean/easy *pythonic* way to
generate sets of test data for a given application. A large part of
what I do at work models with a lot of little details, a lot of models
strung together in complex ways, and a need to scale when confronted
with large data sets and lots of connections.
Between testing the models themselves and testing the viability of
certain reporting methods, it's important to have a sizable data set
to get various levels of real-world performance metrics.
Granted, there are ways to generate the data either by hand-writing
python scripts or by taking it down to the database level. But part of
django's philosophy (as I see it) is to make simple, monotonous things
easy to write, read and execute.
On my CMS I store report definitions as django models. I wrote a hard-
coded library to help me generate a given report in python code, so
that I can quickly create a given base set of reports on a new
installation. Using it looks like this:
http://python.pastebin.com/m2c9e1f17
The way this works is relatively simple: the outer "Report" is
actually a function, and everything else is an atom defined in my
easydefine.py. I'm not hung up on this particular syntax per-se, just
on the concept.
My code automatically links each model to the next one up on the tree
(e.g., each page is linked to the report). It also lazily links the
many-to-many objects together through refhandle and bound_page atoms.
I want to take this library and make it generic. Using introspection
(_meta) info, any given model could be registered with an 'easy
creator' system which could create the necessary atoms automatically.
Or batches could be implemented as a custom manager or part of the
default manager. Here's roughly what I'm envisioning:
http://python.pastebin.com/m365c09de
(I just noticed one of my sample fields conflicts with python reserved
words; ignore plz)
When you execute batch.run(), this would create 15 policies with a
random insured name pulled from an included dictionary of test human
names. Premium and class are randomly generated integers. Coverage
names are created by randomly pairing words from different lists; I'd
include a utilty object to easily create new sets of words to string
together which can be passed to the batch creator. The claim's loss
amount is set to the same amount every time; each setting atom will
accept standard python data objects.
Each policy would have one coverage which is automatically linked to
that policy; the batch creator will look for foreign keys which point
to the model one step up on the tree. The 5 claims made for each
coverage will also automatically link upward. In the case of multiple
keys to the surrounding object type, the batch creator will raise a
warning.
That's my idea in rough. As I mentioned earlier, I'm not hell-bent on
that particular syntax. I think the utility would be very useful,
however.
I'd make it easy to create other random dicts or other content
generators to pass a given attribute atom. Say, if someone wanted to
create a LorumIpsum() generator to fill a char field with lorum ipsum
sample text.
2. Type Coercion for Fields
Right now fields aren't enforced or coerced at all except to
(de)serialize for database communication or if a validation check is
run. I've yet to find a consistently implemented way to make sure the
value I pull from a given object is in the form that the given field
expects.
For example, using our above models, if I have a policy and I set its
premium to the string '100000', then it'll save fine. But if while
iterating through its fields, if I run a
field.to_python(field.value_from_object(policy)) then it doesn't
return int('100000'), it just returns '100000'. Granted, for complex
types like Decimal the to_python method works fine.
The reason I'm asking is, I don't understand why all fields don't
enforce such coercion. It'd be useful for certain tools, such as (for
example) a changelog that needs to compare the current values assigned
to an object instance versus the last known good values of said object
instance. If you don't make sure that the two values are of the same
basic type, the comparison won't be valid.
I'd be lying if I claimed I knew all the issues involved off-hand. So
on this one, I'm looking for more information and to find out any work
to get us from where I think I am to where I'd like to be.
Which brings me to my next idea.
3. The Union Field
Referenced code:
http://python.pastebin.com/m606219e5
In a few situations (like the above change log) I've known that I want
to store a given value but it's impossible to determine until run time
what sort of value I want to store. If we have a changelog object like
the first model then we have to coerce to and from a string value in
order to do the proper comparison. Which isn't terrible, but it tends
to fill code with a lot of ugly double-checking, which is boilerplate,
which violates DRY.
What if instead we could use the second model's method? This would
work a lot like generic relations. Now when each change object is
saved, the value types can be set to a field type on the fly.
old_value and new_value could be properties which handle the coercion
to and from string values to python types automatically. It could also
be a property that automatically contributes the correct field type to
the object upon instancing.
One thing I haven't decided: if it's implemented as a property,
setting the value to a given python object type could automatically
set the value_type field. So if you run "ch.old_value =
datetime.date(2006, 1, 1)" then it'd automatically set the value_type
to be DateField.
I have other ideas that I'd like to run by you guys but I'll stop here
for now; I don't want my first post to be a novella. In general, I
thought that four fairly hefty and useful development tools written
one per month could make for a pretty successful Summer of Code. This
is my idea for two such tools.
Thanks in advance for your feedback. If I'm totally on the wrong
track, slap me onto the right one!
- Jason Ledbetter