As of python 3.7, the `datetime` class includes a `fromisoformat` method
which handles the happy path much faster than the regex based method used
by `parse_datetime`
The values it cannot parse, from the whole Django test suite, are given as
examples below (duplicates remain). For those cases (eg: the recently
handled #32727) it still requires going through the regex variant:
{{{
'2010-1-10 14:49:00'
'2006-10-25 14:30:45.0002'
'2014-09-23T22:34Z'
'2014-09-23T22:34Z'
'2014-09-23T28:23'
'2011-09-01T10:20:30Z'
'2012-4-9 4:8:16'
'2012-04-23T09:15:00Z'
'2012-4-9 4:8:16-0320'
'2012-04-23T10:20:30.400+02'
'2012-04-23T10:20:30.400-02'
'2012-04-23T10:20:30,400-02'
'2012-04-23T10:20:30.400 +0230'
'2012-04-23T10:20:30,400 +00'
'2012-04-23T10:20:30 -02'
'2012-04-56T09:15:90'
'2011-10-32 10:10'
}}}
The only difference in output as far as I can tell is that the `timezone`
instance attached to the `tzinfo` parameter sets the `_name` attribute to
the fixed offset (`str`) where the stdlib one doesn't set it, and thus
it's `None`. I think that's fine, but YMMV.
On the happy path, using the regex version of `parse_datetime` as is:
{{{
'2021-06-30 10:42:12.638719' -> 5.99 µs ± 88.3 ns per loop (mean ±
std. dev. of 7 runs, 100000 loops each)
'2008-04-02 12:00:00' -> 5.57 µs ± 50.3 ns per loop (mean ±
std. dev. of 7 runs, 100000 loops each)
'2011-11-04T00:05:23+04:00' -> 8.39 µs ± 98.1 ns per loop (mean ±
std. dev. of 7 runs, 100000 loops each)
'2011-11-04 00:05:23.283+00:00' -> 8.81 µs ± 93.5 ns per loop (mean ±
std. dev. of 7 runs, 100000 loops each)
}}}
vs using `fromisoformat`:
{{{
'2021-06-30 10:42:12.638719' -> 227 ns ± 0.935 ns per loop (mean ±
std. dev. of 7 runs, 1000000 loops each
'2008-04-02 12:00:00' -> 216 ns ± 2.41 ns per loop (mean ±
std. dev. of 7 runs, 1000000 loops each)
'2011-11-04T00:05:23+04:00' -> 266 ns ± 0.612 ns per loop (mean ±
std. dev. of 7 runs, 1000000 loops each)
'2011-11-04 00:05:23.283+00:00' -> 234 ns ± 2.1 ns per loop (mean ±
std. dev. of 7 runs, 1000000 loops each)
}}}
when combining both, such as:
{{{
def parse_datetime(value):
""" ... """
try:
return datetime.datetime.fromisoformat(value)
except (TypeError, ValueError):
match = datetime_re.match(value)
if match:
...
}}}
it's easy enough to spot which formats end up hitting the regex case:
{{{
'2021-06-30 10:42:12.638719' -> 230 ns ± 1.57 ns per loop (mean ±
std. dev. of 7 runs, 1000000 loops each)
'2008-04-02 12:00:00' -> 215 ns ± 2.45 ns per loop (mean ±
std. dev. of 7 runs, 1000000 loops each)
'2011-11-04T00:05:23+04:00' -> 270 ns ± 4.28 ns per loop (mean ±
std. dev. of 7 runs, 1000000 loops each)
'2011-11-04 00:05:23.283+00:00' -> 234 ns ± 1.23 ns per loop (mean ±
std. dev. of 7 runs, 1000000 loops each)
'2012-04-23T10:20:30.400 +0230' -> 10.3 µs ± 281 ns per loop (mean ±
std. dev. of 7 runs, 100000 loops each)
'2012-04-23T10:20:30 -02' -> 9.14 µs ± 136 ns per loop (mean ±
std. dev. of 7 runs, 100000 loops each)
}}}
Most of the speedup gain is going to be found when using sqlite3 as a DB
backend, because it has convertery UDF type things for them, and generally
speaking Django stores and parses them in the well-formed format:
{{{
Database.register_converter("datetime", decoder(parse_datetime))
Database.register_converter("timestamp", decoder(parse_datetime))
}}}
when using `cProfile` to execute `tuple(User.objects.all())` with 101
users, `parse_datetime` accounts for 10+% of the overall time (with
`Model.__init__` being the only bigger component), ordered by tottime
(column 2), for the purposes of amortising the noise elsewhere, the call
has been executed `1000` times in a loop:
{{{
101000 0.451 0.000 0.628 0.000
django/db/models/base.py:406(__init__)
102000 0.379 0.000 0.837 0.000
django/utils/dateparse.py:98(parse_datetime)
102000 0.202 0.000 0.699 0.000
django/db/models/sql/compiler.py:1128(apply_converters)
102000 0.158 0.000 0.158 0.000
django/utils/dateparse.py:124(<dictcomp>)
}}}
but after adapting as above, `parse_datetime` is about 1%, same ordering:
{{{
101000 0.433 0.000 0.606 0.000
django/db/models/base.py:406(__init__)
102000 0.198 0.000 0.682 0.000
django/db/models/sql/compiler.py:1128(apply_converters)
102000 0.031 0.000 0.048 0.000
django/utils/dateparse.py:98(parse_datetime)
}}}
(the dictcomp `{k: int(v) for k, v in kw.items() if v is not None}` is
never accounted for because it never executes)
It would be possible to restrict the variants passed through to the
`fromisoformat` branch further, because the minimum acceptable length
string is `10` (for `YYYY-MM-DD`) and then there's ... I dunno, some other
number of valid options for the time parsing
(`HH[:MM[:SS[.fff[fff]]]][+HH:MM[:SS[.ffffff]]]` - eg, the tz string is
only `5` `8` or `15` chars long), but I doubt it's worth fussing over
that.
I suspect the same is true for `parse_date` which could use
`date.fromisoformat` if the input value is exactly `10` long, and fall
back to the regex method for the optional leading-zeroes `date_re` allows
for.
Zero tests seem to fail with the adapted method proposed above.
--
Ticket URL: <https://code.djangoproject.com/ticket/32892>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* stage: Unreviewed => Accepted
Comment:
Sounds reasonable, thanks.
--
Ticket URL: <https://code.djangoproject.com/ticket/32892#comment:1>
* owner: nobody => Hasan Ramezani
* status: new => assigned
* has_patch: 0 => 1
Comment:
I just created a [https://github.com/django/django/pull/14581 patch] based
on ticket.
--
Ticket URL: <https://code.djangoproject.com/ticket/32892#comment:2>
Comment (by Keryn Knight):
Ah, I've just come back to this to set the has patch info and I see
there's already one :/
Oh well, here's mine anyway: https://github.com/django/django/pull/14582
(which, to be fair, looks to have failed cos of flake8, aren't I great?
:))
FWIW, across the test suite, the `parse_date` and `parse_time` values
which weren't handled by the builtin `fromisoformat` for the appropriate
type are:
{{{
2009-7-8
2009-5-4
2012-4-9
2012-04-56
10:20:30,400
4:8:16
09:15:90
2011-13-10
2011-10-32
2011-10-32
25:50
}}}
by the look of it, which at least for the dates is entirely to be expected
due to the stricter nature of the stdlib handler (must have 2 digit padded
months/days etc)
--
Ticket URL: <https://code.djangoproject.com/ticket/32892#comment:3>
Comment (by Hasan Ramezani):
Sorry, @Keryn for that. I've just assigned it to myself because it didn't
own by anybody.
BTW, @Mariusz, feel free to close my PR if you want to continue with
@Keryn PR.
--
Ticket URL: <https://code.djangoproject.com/ticket/32892#comment:4>
* owner: Hasan Ramezani => Keryn Knight
--
Ticket URL: <https://code.djangoproject.com/ticket/32892#comment:5>
Comment (by Keryn Knight):
Hey Hasan, absolutely no worries. I was surprised to be in a race (and
losing!) , but we're all here for the same reason :)
-------
For historical note, here's the equivalent timings for `parse_date` and
`parse_time` with similar (expected) improvements for the happy path:
Before, using the regex method only. Any and all variants are roughly the
same 3µs timing
{{{
parse_time('00:05:23+04:00') -> 3.56 µs ± 50 ns per loop (mean ± std.
dev. of 7 runs, 100000 loops each)
parse_time('00:05: ') -> 3 µs ± 41.3 ns per loop (mean ± std.
dev. of 7 runs, 100000 loops each)
parse_date('2000-01-01') -> 3.14 µs ± 28.1 ns per loop (mean ± std.
dev. of 7 runs, 100000 loops each)
}}}
After, using the regex method as a fallback and the builtin stricter
format by preference:
{{{
parse_time('00:05:23+04:00') -> 1.05 µs ± 2.65 ns per loop (mean ± std.
dev. of 7 runs, 1000000 loops each)
parse_time('00:05: ') -> 3.66 µs ± 15.2 ns per loop (mean ± std.
dev. of 7 runs, 100000 loops each)
parse_time('00:05:00') -> 987 ns ± 6.68 ns per loop (mean ± std.
dev. of 7 runs, 1000000 loops each)
parse_time('00:05:00.001') -> 1.03 µs ± 12.9 ns per loop (mean ± std.
dev. of 7 runs, 1000000 loops each)
parse_date('2000-01-01') -> 228 ns ± 2.89 ns per loop (mean ± std.
dev. of 7 runs, 1000000 loops each
}}}
Which handily suggests that the most expensive part of parsing the whole
iso string is the time+timezone handling, as you might expect.
--
Ticket URL: <https://code.djangoproject.com/ticket/32892#comment:6>
* stage: Accepted => Ready for checkin
--
Ticket URL: <https://code.djangoproject.com/ticket/32892#comment:7>
* status: assigned => closed
* resolution: => fixed
Comment:
In [changeset:"f35ab74752adb37138112657c1bc8b91f50e799b" f35ab74]:
{{{
#!CommitTicketReference repository=""
revision="f35ab74752adb37138112657c1bc8b91f50e799b"
Fixed #32892 -- Optimized django.utils.dateparse functions by using
fromisoformat().
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/32892#comment:8>