Built-in form fields: how "valid" should they require data to be?

James Bennett

unread,

Mar 31, 2007, 11:30:20 AM3/31/07

to django-d...@googlegroups.com

I was looking over some of the things in localflavor this morning, and
noticed that for a couple of countries there are already pretty good
validation routines for social security numbers (or locally-named
equivalent), which got me thinking I'd write one for US SSNs.

So my question, from a Django design perspective, is how much
validation built-in fields should do. The Finnish localflavor, for
example, validates check digits on its SSN field[1], and I'd be happy
to sit down and work out the logic for "fully" validating a US SSN
(rejecting rserved groups and invalid combinations, etc.), but wanted
to make sure this was the preferred method before going forward.

Additionally, in the case of the US SSN, the "valid" numbers do change
occasionally, since the Social Security Administration can choose to
allocate previously unused blocks of numbers (right now, a number
starting with any group higher than 799 is invalid, and probably will
be for a while, but eventually those numbers will come into use); if
we try to validate as much as possible, should that include rejecting
currently-unused blocks and adding them in if/when the SSA decides to
put them into service?

Another area where this is likely to come up is credit-card numbers --
if we ever ship a validator for those, will it just verify the number
of digits, or should it also know how to examine the number for
tentative validity?

[1] http://code.djangoproject.com/browser/django/trunk/django/contrib/localflavor/fi/forms.py

--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."

chris....@gmail.com

unread,

Mar 31, 2007, 12:06:19 PM3/31/07

to Django developers

This is a good question. You could also ask a similar question for
zip codes and phone numbers. For instance, you could theoretically try
to validate that the zip code corresponds to the correct state which
corresponds to the correct area code in a phone number.

Maybe what we should have is the ability to have progressive levels of
validation:
1. Basic field validation of format (i.e. xxx-yy-zzz for SSN's) or
xxxxx for Zip Codes
3. Correct data based on other fields. For instance, given a zip code
and state, we can validate that they match up.

As far as SSNs, we should validate format and maybe some versions
that are obviously wrong (000-00-0000, etc) but otherwise I think if
it is too smart, then it gets to be a pain to manage. If I really
need to collect SSN's or some other data there will probably be some
sort of validation I need to do on my end after I get the data.

I do think Luhn checksums for credit cards would be a nice thing to
add. I do have one in Satchmo here -
http://www.satchmoproject.com/trac/browser/satchmo/trunk/satchmo/shop/views/utils.py

and would be happy to add a patch if folks would like to see it.

Also, I've created the Barnum project to generate fake but accurate
data. It's sort of the reverse of what you're trying to do but I do
have a large file of US Zip Codes corresponding to cities and states.
You can see all the code here-
http://barnum.googlecode.com/svn/trunk/

-Chris

> [1]http://code.djangoproject.com/browser/django/trunk/django/contrib/loc...

James Bennett

unread,

Mar 31, 2007, 3:56:23 PM3/31/07

to django-d...@googlegroups.com

On 3/31/07, chris....@gmail.com <chris....@gmail.com> wrote:
> As far as SSNs, we should validate format and maybe some versions
> that are obviously wrong (000-00-0000, etc) but otherwise I think if
> it is too smart, then it gets to be a pain to manage. If I really
> need to collect SSN's or some other data there will probably be some
> sort of validation I need to do on my end after I get the data.

Yeah. I feel like the best option is sort of a compromise here -- in
the case of SSNs, there are some combinations which will always be
invalid (e.g., any block of all zeroes, any "666" area number, and
anything in the reserved blocks used for advertising/explanation).

I'm just not sure whether we want even that amount of logic, or if we
want to just verify that it's a 9-digit number with optional dashes
and correct grouping.

jkoch...@gmail.com

unread,

Mar 31, 2007, 4:35:10 PM3/31/07

to Django developers

On Mar 31, 2:56 pm, "James Bennett" <ubernost...@gmail.com> wrote:

> On 3/31/07, chris.moff...@gmail.com <chris.moff...@gmail.com> wrote:
>
> > As far as SSNs, we should validate format and maybe some versions
> > that are obviously wrong (000-00-0000, etc) but otherwise I think if
> > it is too smart, then it gets to be a pain to manage. If I really
> > need to collect SSN's or some other data there will probably be some
> > sort of validation I need to do on my end after I get the data.
>
> Yeah. I feel like the best option is sort of a compromise here -- in
> the case of SSNs, there are some combinations which will always be
> invalid (e.g., any block of all zeroes, any "666" area number, and
> anything in the reserved blocks used for advertising/explanation).
>
> I'm just not sure whether we want even that amount of logic, or if we
> want to just verify that it's a 9-digit number with optional dashes
> and correct grouping.

If you *do* put in that much logic, something like a strict=False
argument to the constructor would be a good idea. False should
possibly even be the default.

Joseph

Malcolm Tredinnick

unread,

Mar 31, 2007, 11:12:07 PM3/31/07

to django-d...@googlegroups.com

Can you elaborate on the logic behind this request? These are meant to
validate the fields, right? So you are asking for validation that
doesn't validate.

Given how easy it is to write custom cleaning functions, I'd rather we
shipped reasonably correct versions and if people wanted less strict
constraints, they can write their own.

Regards,
Malcolm

Malcolm Tredinnick

unread,

Mar 31, 2007, 11:29:00 PM3/31/07

to django-d...@googlegroups.com

On Sat, 2007-03-31 at 10:30 -0500, James Bennett wrote:
> I was looking over some of the things in localflavor this morning, and
> noticed that for a couple of countries there are already pretty good
> validation routines for social security numbers (or locally-named
> equivalent), which got me thinking I'd write one for US SSNs.
>
> So my question, from a Django design perspective, is how much
> validation built-in fields should do. The Finnish localflavor, for
> example, validates check digits on its SSN field[1],

The Norwegian one can tell if you you're male or female. There's some
sort of features arms race going on between our Nordic contributors. :-)

> and I'd be happy
> to sit down and work out the logic for "fully" validating a US SSN
> (rejecting rserved groups and invalid combinations, etc.), but wanted
> to make sure this was the preferred method before going forward.
>
> Additionally, in the case of the US SSN, the "valid" numbers do change
> occasionally, since the Social Security Administration can choose to
> allocate previously unused blocks of numbers (right now, a number
> starting with any group higher than 799 is invalid, and probably will
> be for a while, but eventually those numbers will come into use); if
> we try to validate as much as possible, should that include rejecting
> currently-unused blocks and adding them in if/when the SSA decides to
> put them into service?

Don't go overboard, would be my suggestion. The pain with having to
retrofit a bunch of existing production products because numbers in the
"for future expansion" range started being used is non-trivial. I've
played this game in banking systems. It is the opposite of fun, because
you have to upgrade quickly and very carefully at the same time. It's
not like you're going to notice the problem much in advance of somebody
complaining that their number doesn't work.

> Another area where this is likely to come up is credit-card numbers --
> if we ever ship a validator for those, will it just verify the number
> of digits, or should it also know how to examine the number for
> tentative validity?

As you can see from the Sachmo code Chris posted, even just "number of
digits" is non-trivial, because it varies by card type. That also gives
an example of why being too prescriptive can hurt: Sachmo's card
validation looks pretty correct to me (I have some domain experience
here), but there's at least one major credit card, used throughout
Australia, that would be needlessly rejected by their system because
it's not in the whitelist. That's always going to be a problem with
being too strict.

I think Django should err on the side of permissiveness a little bit in
cases like this. It takes five minutes to write your own, more
restrictive data cleanser function if you want something stricter.

Cheers,
Malcolm

Malcolm Tredinnick

unread,

Mar 31, 2007, 11:50:30 PM3/31/07

to django-d...@googlegroups.com

On Sun, 2007-04-01 at 13:29 +1000, Malcolm Tredinnick wrote:
[...]

> As you can see from the Sachmo code Chris posted, even just "number of
> digits" is non-trivial, because it varies by card type. That also gives
> an example of why being too prescriptive can hurt: Sachmo's card
> validation looks pretty correct to me (I have some domain experience
> here), but there's at least one major credit card, used throughout
> Australia, that would be needlessly rejected by their system because
> it's not in the whitelist. That's always going to be a problem with
> being too strict.

By the way, this isn't a dig at Sachmo. You can't default to "accept" in
e-payments, so they have to use a whitelist. It was an example of the
drawbacks of the (necessary in this case) whitelist approach.

Cheers,
Malcolm

Adrian Holovaty

unread,

Apr 1, 2007, 1:11:33 AM4/1/07

to django-d...@googlegroups.com

On 3/31/07, James Bennett <ubern...@gmail.com> wrote:
> Additionally, in the case of the US SSN, the "valid" numbers do change
> occasionally, since the Social Security Administration can choose to
> allocate previously unused blocks of numbers (right now, a number
> starting with any group higher than 799 is invalid, and probably will
> be for a while, but eventually those numbers will come into use); if
> we try to validate as much as possible, should that include rejecting
> currently-unused blocks and adding them in if/when the SSA decides to
> put them into service?

This is a good question. It seems like the policy should be: "Include
as strict of validation as possible, without being so strict that the
validation may have to be frequently loosened in the future."

So, for the case of U.S. ZIP codes, it should validate that it's a
five-digit number, because it's doubtful that the ZIP code system will
change to six digits any time (and ignoring ZIP+4). But it shouldn't
actually validate the numbers themselves, because new ZIP codes get
added occasionally.

Another example -- and sorry for being American-centric here -- is
U.S. state abbreviations. That validator checks not only that the
abbreviation is two letters, but that it's one of the valid
state/territory abbreviations. States/territories don't get added very
often, so it's worth the extra level of validation in this case.

How does that sound as a policy?

Adrian

--
Adrian Holovaty
holovaty.com | djangoproject.com

James Bennett

unread,

Apr 1, 2007, 1:39:20 AM4/1/07

to django-d...@googlegroups.com

On 4/1/07, Adrian Holovaty <holo...@gmail.com> wrote:
> How does that sound as a policy?

Sounds good to me. Based on that, I'll whip up a US SSN field, and I
think that following this it will:

1. Validate the number of digits (9).
2. Validate that the number is not one which is known to be
permanently invalid (there aren't many of these and they're easy to
test for).

And leave it at that. Sound good to everyone?

Also, @Malcolm: in theory, a US SSN validator could tell which
state/territory you were in when you applied for the number (which in
most cases will be the state/territory in which you were born), but I
don't think it needs to go that far; we'll let the Nordic SSN arms
race continue unchallenged ;)

Adrian Holovaty

unread,

Apr 1, 2007, 1:43:03 AM4/1/07

to django-d...@googlegroups.com

On 4/1/07, James Bennett <ubern...@gmail.com> wrote:
> Sounds good to me. Based on that, I'll whip up a US SSN field, and I
> think that following this it will:
>
> 1. Validate the number of digits (9).
> 2. Validate that the number is not one which is known to be
> permanently invalid (there aren't many of these and they're easy to
> test for).
>
> And leave it at that. Sound good to everyone?

Sounds good. You've probably already thought of this, but it should
accept numbers with or without hyphens, and normalize it to the number
*with* hyphens.

Russell Keith-Magee

unread,

Apr 1, 2007, 2:28:37 AM4/1/07

to django-d...@googlegroups.com

On 4/1/07, Adrian Holovaty <holo...@gmail.com> wrote:
>

> On 4/1/07, James Bennett <ubern...@gmail.com> wrote:
> > Sounds good to me. Based on that, I'll whip up a US SSN field, and I
> > think that following this it will:
> >
> > 1. Validate the number of digits (9).
> > 2. Validate that the number is not one which is known to be
> > permanently invalid (there aren't many of these and they're easy to
> > test for).
> >
> > And leave it at that. Sound good to everyone?

+1

> Sounds good. You've probably already thought of this, but it should
> accept numbers with or without hyphens, and normalize it to the number
> *with* hyphens.

+1

I noticed that the UK postal code will raise a validation error if you
don't type a space separating the two character groups. Brazilian
phone numbers are similarly affected. IMHO, this is something that the
cleaner should fix, not raise as an error. We should be liberal in
what we accept, conservative in what we save, and all that jazz.

Two other quick localflavor related things -

1) Is there any particular reason that the Brazilian validation
messages are in Portuguese, rather than i18n wrapped English?

2) Is there any reason not to normalize localflavour.usa to
localflavor.us (to match the 2 letter country code scheme used by
other flavors)?

Yours,
Russ Magee %-)

Adrian Holovaty

unread,

Apr 1, 2007, 2:33:26 AM4/1/07

to django-d...@googlegroups.com

On 4/1/07, Russell Keith-Magee <freakb...@gmail.com> wrote:
> 1) Is there any particular reason that the Brazilian validation
> messages are in Portuguese, rather than i18n wrapped English?

Good call -- those should be in English, I would think.

> 2) Is there any reason not to normalize localflavour.usa to
> localflavor.us (to match the 2 letter country code scheme used by
> other flavors)?

None of this stuff is documented yet, and it's still in flux, so I
think it's fine to rename "usa" to "us" in localflavor. I hadn't
expected such an explosion of, well, great local flavor when I first
created the package. :)

James Bennett

unread,

Apr 1, 2007, 2:35:32 AM4/1/07

to django-d...@googlegroups.com

On 4/1/07, Adrian Holovaty <holo...@gmail.com> wrote:

> Sounds good. You've probably already thought of this, but it should
> accept numbers with or without hyphens, and normalize it to the number
> *with* hyphens.

Ha. Uploaded a patch right before I saw this; it's late and I didn't
think of it, but that's easy enough to do.

Malcolm Tredinnick

unread,

Apr 1, 2007, 2:35:20 AM4/1/07

to django-d...@googlegroups.com

On Sun, 2007-04-01 at 14:28 +0800, Russell Keith-Magee wrote:
> On 4/1/07, Adrian Holovaty <holo...@gmail.com> wrote:
> >
> > On 4/1/07, James Bennett <ubern...@gmail.com> wrote:
> > > Sounds good to me. Based on that, I'll whip up a US SSN field, and I
> > > think that following this it will:
> > >
> > > 1. Validate the number of digits (9).
> > > 2. Validate that the number is not one which is known to be
> > > permanently invalid (there aren't many of these and they're easy to
> > > test for).
> > >
> > > And leave it at that. Sound good to everyone?
>
> +1
>
> > Sounds good. You've probably already thought of this, but it should
> > accept numbers with or without hyphens, and normalize it to the number
> > *with* hyphens.
>
> +1
>
> I noticed that the UK postal code will raise a validation error if you
> don't type a space separating the two character groups. Brazilian
> phone numbers are similarly affected. IMHO, this is something that the
> cleaner should fix, not raise as an error. We should be liberal in
> what we accept, conservative in what we save, and all that jazz.

Agreed.

>
> Two other quick localflavor related things -
>
> 1) Is there any particular reason that the Brazilian validation
> messages are in Portuguese, rather than i18n wrapped English?

Whoops, my bad. :-( I'd noticed that originally and then forgot about it
when I was going through things to commit. We should fix it. I'll get in
touch with the contributor.

> 2) Is there any reason not to normalize localflavour.usa to
> localflavor.us (to match the 2 letter country code scheme used by
> other flavors)?

One other thing we'll want to keep any eye on is how much duplication of
code goes in there. Up to now, I haven't noticed any great opportunity
for factoring out common pieces that will save space, but it might come
up in the future.

Regards,
Malcolm

Russell Cloran

unread,

Apr 1, 2007, 6:03:55 AM4/1/07

to django-d...@googlegroups.com

Hi,

On 4/1/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:

One other thing we'll want to keep any eye on is how much duplication of
code goes in there. Up to now, I haven't noticed any great opportunity
for factoring out common pieces that will save space, but it might come
up in the future.

I'm going to contribute some of the South African localflavour that I have around. South African ID numbers can be checked with a Luhn checksum ("the mod10 algorithm"), and I was wondering how that can best be packaged.

Should it just live in the forms.py for now?

Russell
--
echo http://russell.rucus.net/spam/ | sed 's,t/.*,t,;P;s,.*//,,;s,\.,@,;'

Malcolm Tredinnick

unread,

Apr 1, 2007, 6:41:48 AM4/1/07

to django-d...@googlegroups.com

On Sun, 2007-04-01 at 12:03 +0200, Russell Cloran wrote:
[...]

> I'm going to contribute some of the South African localflavour that I
> have around. South African ID numbers can be checked with a Luhn
> checksum ("the mod10 algorithm"), and I was wondering how that can
> best be packaged.
>
> Should it just live in the forms.py for now?

Try this: create a django/utils/checksums.py file and put it in there.
It may be useful in more than just form validation (otherwise it could
go in newforms/utils.py). I suspect we might see some more weighted
checksums in the future, too (such as the Norwegian social security
number) and we can extract out the general algorithm into that file at
some point.

Regards,
Malcolm

Nebojša Đorđ ević

unread,

Apr 1, 2007, 7:36:14 PM4/1/07

to Django developers

On 4/1/07 5:12 AM, "Malcolm Tredinnick" <mal...@pointy-stick.com> wrote:

> Can you elaborate on the logic behind this request? These are meant to
> validate the fields, right? So you are asking for validation that
> doesn't validate.

Well, I need this for the Serbian JMBG validation (something similar to SSN)
because there are numbers which are invalid when validated and still in use
(strange, don't ask me why ;)).

--
Nebojša Đorđević - nesh, ICQ#43799892, http://www.linkedin.com/in/neshdj
Studio Quattro - Niš - Serbia
http://studioquattro.biz/ | http://code.google.com/p/django-utils/
Registered Linux User 282159 [http://counter.li.org]

Malcolm Tredinnick

unread,

Apr 1, 2007, 7:41:48 PM4/1/07

to django-d...@googlegroups.com

On Mon, 2007-04-02 at 01:36 +0200, Neboj=?UTF-8?B?xaE=?=a

=?UTF-8?B?xJA=?=or=?UTF-8?B?xJE=?= evi=?UTF-8?B?xIcg?= wrote:
> On 4/1/07 5:12 AM, "Malcolm Tredinnick" <mal...@pointy-stick.com> wrote:
>
> > Can you elaborate on the logic behind this request? These are meant to
> > validate the fields, right? So you are asking for validation that
> > doesn't validate.
>
> Well, I need this for the Serbian JMBG validation (something similar to SSN)
> because there are numbers which are invalid when validated and still in use
> (strange, don't ask me why ;)).

You could never use the truly strict validation in this case. Remember,
these are things designed to be used on a website. So imagine you are
constructing a website that accepts JMBG entries. You have to accept all
in-use numbers (which I would argue are, by definition, valid). So a
validator that only used some particular algorithm and rejected certain
legally in-use numbers is not a validator at all, since it generates
false negatives. My point is that, in this case, there aren't two
possible settingsi, there is only one -- the other one doesn't accept
the right numbers.

Regards,
Malcolm

Nebojša Đorđ ević

unread,

Apr 1, 2007, 8:10:09 PM4/1/07

to Django developers

On 4/2/07 1:41 AM, "Malcolm Tredinnick" <mal...@pointy-stick.com> wrote:

> You could never use the truly strict validation in this case. Remember,
> these are things designed to be used on a website. So imagine you are
> constructing a website that accepts JMBG entries. You have to accept all
> in-use numbers (which I would argue are, by definition, valid). So a

Well this numbers are valid, problem is that lot's of them are manually
created so some of them have incorrect checksum fields. :)

> validator that only used some particular algorithm and rejected certain
> legally in-use numbers is not a validator at all, since it generates
> false negatives. My point is that, in this case, there aren't two
> possible settingsi, there is only one -- the other one doesn't accept
> the right numbers.

(it was a quick response, while I worked on validator it seemed that strict
option is a good idea)

I agree, for the my case I must accept invalid data so strict=True/False is
a no-option. I planned just to do some basic checks (without checksum
calculations) and, maybe, show a warning when incorrect entry is detected.

OTOH, there are fields which can be validated in full so for them strict
validation is a must.

Reply all

Reply to author

Forward