1) Improved testing harness to be able to run all the tests on Chrome
from the command line. I'm using their TestShell executable as the
integration point, which has the same cookie logic as the full
browser.
2) Wrote a number of tests probing the parsing of cookies and values.
3) Replaced the bogus Set-Cookie grammar in the draft with something
more accurate. So far, we're covering only the cookie-name and the
cookie-value. I'm planning on doing the cookie-attributes (i.e.,
their names and opaque values) next.
Currently, I've expressed the syntax as a parsing algorithm because
that's easier to write. Eventually, I'd like to turn this into the
more traditional grammar style. It might be slightly too early to do
that yet.
I'm already finding a number of cases where the major implementations
differ. I've tried to pick the most reasonable behavior based on
testing and historical precedent. I've documented all the cases I
know of in tests.
Useful Areas for Contribution:
1) Expand the test harness to be able to run more implementations
automatically. This is extremely valuable because ensures we
understand the compatibility impact of our decisions.
a) Adding Safari should be fairly straightforward using their
DumpRenderTree executable in a similar way to how we're currently
using Chromium's TestShell.
b) Adding Internet Explorer is probably a matter of writing a simple
WinInet application that functions like CURL.
c) Are there other implementations we should be testing? Is there a
command line utility that uses libsoup?
2) Cookie date examples. I imagine specifying the cooke-date syntax
and semantics will be challenging. If you or anyone you know has a
corpus of cookie dates we can use, please let me know.
Thanks,
Adam
_______________________________________________
http-state mailing list
http-...@ietf.org
https://www.ietf.org/mailman/listinfo/http-state
Checking... what's the purpose of "picking" a specific behavior when
right now implementations differ? Doesn't it make more sense to just
document the areas that do not have inter-operability, and go on?
BR, Julian
Mostly so we can make progress. We should view these picks as
tentative, just like adding the text from 2109 was tentative.
Hopefully the new text is more accurate than the text it's replacing,
but we should continue to iterate.
> Doesn't it make more sense to just document the
> areas that do not have inter-operability, and go on?
I've documented everything I know so far in tests. I expect we'll end
up discussing each case that differs between implementations on the
mailing list. That's why it's important to integrate more
implementation into the testing harness. Once we do that, we can read
out all the interoperability issues with a few commands and have a
broader view when we make decisions.
Adam
> c) Are there other implementations we should be testing?
Wget is a very popular command line tool with a cookie parser.
> 2) Cookie date examples. I imagine specifying the cooke-date syntax and
> semantics will be challenging. If you or anyone you know has a corpus of
> cookie dates we can use, please let me know.
Dan Winship's note 2009-08-05* doesn't exactly mention specific numbers but
covers five actually used formats to be:
Wdy, DD-Mon-YYYY HH:MM:SS GMT
Wdy, DD Mon YYYY HH:MM:SS GMT
Wdy, DD-Mon-YY HH:MM:SS GMT
Weekday, DD-Mon-YY HH:MM:SS GMT
Wdy Mon DD HH:MM:SS YYYY GMT
Maybe someone with access to a proxy could extract more logs on this.
--
> c) Are there other implementations we should be testing? Is there a
> command line utility that uses libsoup?
I could write one easily enough (and in fact, once we have a good set of
tests, I'll probably add them to the libsoup regression tests). But
there is no "compatibility impact" to understand wrt libsoup's behavior;
wherever it doesn't behave like the majority of other browsers, it's
just a bug and I'm going to change it. So I don't think there's really
any reason for the spec's test suite to be testing it.
> 2) Cookie date examples. I imagine specifying the cooke-date syntax
> and semantics will be challenging. If you or anyone you know has a
> corpus of cookie dates we can use, please let me know.
OK, from the 2973 expires tokens in the cookies I'd collected a few
years ago (from a not at all representative sample of web sites). Counts
represent number of Set-Cookie headers, not number of unique cookies or
sites or whatever.
1987 "Mon, 10-Dec-2007 17:02:24 GMT"
Revised Netscape spec format
533 "Wed, 09 Dec 2009 16:27:23 GMT"
rfc1123-date
239 "Thursday, 01-Jan-1970 00:00:00 GMT"
4-digit-year version of Netscape spec example (see below).
Seems to only come from sites using PHP, but it's not PHP
itself; maybe some framework?
89 "Mon Dec 10 16:32:30 2007 GMT"
The not-quite-asctime format used by Amazon. (Still not fixed!)
62 "Wednesday, 01-Jan-10 00:00:00 GMT"
The syntax used by the example text in the Netscape spec,
although the actual grammar uses abbreviated weekday names
31 "Mon, 10-Dec-07 20:35:03 GMT"
Original Netscape spec
12 "Wed, 1 Jan 2020 00:00:00 GMT"
If this had "01 Jan" it would be an rfc1123-date. This *is* a
legitimate rfc822 date, though not an rfc2822 date because "GMT"
is deprecated in favor of "+0000" there.
8 "Saturday, 8-Dec-2012 21:24:09 GMT"
Would match the "weird php" syntax above if it was "08-Dec"
3 "Thu, 31 Dec 23:55:55 2037 GMT"
God only knows what they were thinking. This came from a
hit-tracker site, and it's possible that it's just totally
broken and no one parses it "correctly"
2 "Sun, 9 Dec 2012 13:42:05 GMT"
Another kind of rfc822 / nearly-rfc1123 date, using superfluous
whitespace.
2 "Wed Dec 12 2007 08:44:07 GMT-0500 (EST)"
Another kind of "lets throw components together at random". The
site that this cookie came has apparently been fixed since then.
(It uses the Netscape spec format now.)
2 "Mon, 01-Jan-2011 00: 00:00 GMT"
Note whitespace inside the time component. Also, the cookie came
with a domain= attribute that didn't match the domain it was
being sent from (at all). Nice job all around.
1 "Sun, 1-Jan-1995 00:00:00 GMT"
1 "Wednesday, 01-Jan-10 0:0:00 GMT"
1 "Thu, 10 Dec 2009 13:57:2 GMT"
Because fixed-width fields are for sissies.
So, you can match 96.6% of those with a parser that basically does
rfc1123-date, except:
- it accepts long or short day names
- it allows the day-of-month to be "1*2DIGIT" or "SPC DIGIT"
- it allows " " or "-" around month name
- it accepts 2 or 4 digit years (which is extra tricky for cookies
since there are legitimate reasons for sending both distant-past
and distant-future dates... we need to test how clients interpret
these).
You can get another 3% by accepting asctime-date, and being lenient if
they have " GMT" at the end. Given that HTTP clients are required to
accept asctime-dates in the Date header anyway, this isn't that harsh.
And also, it's Amazon, so you have no choice.
So that gets you to 99.6% of the cookies I saw, and that's probably good
enough for the grammar, though we could note that there are
even-more-broken dates out there.
-- Dan
Ok. If you'd like to implement it, I'll add it to the harness.
Currently we have support for:
* Firefox
* Safari
* Chrome
* CURL
I'd like to add IE to that list, but that will probably have to wait
until I get back to California and my Windows machine.
This is great data. Would you like to take a crack at writing a date grammar?
For reference here are the tests from Chrome's cookie date parser.
const CookieDateParsingCase tests[] = {
{ "Sat, 15-Apr-17 21:01:22 GMT", true, 1492290082 },
{ "Thu, 19-Apr-2007 16:00:00 GMT", true, 1176998400 },
{ "Wed, 25 Apr 2007 21:02:13 GMT", true, 1177534933 },
{ "Thu, 19/Apr\\2007 16:00:00 GMT", true, 1176998400 },
{ "Fri, 1 Jan 2010 01:01:50 GMT", true, 1262307710 },
{ "Wednesday, 1-Jan-2003 00:00:00 GMT", true, 1041379200 },
{ ", 1-Jan-2003 00:00:00 GMT", true, 1041379200 },
{ " 1-Jan-2003 00:00:00 GMT", true, 1041379200 },
{ "1-Jan-2003 00:00:00 GMT", true, 1041379200 },
{ "Wed,18-Apr-07 22:50:12 GMT", true, 1176936612 },
{ "WillyWonka , 18-Apr-07 22:50:12 GMT", true, 1176936612 },
{ "WillyWonka , 18-Apr-07 22:50:12", true, 1176936612 },
{ "WillyWonka , 18-apr-07 22:50:12", true, 1176936612 },
{ "Mon, 18-Apr-1977 22:50:13 GMT", true, 230251813 },
{ "Mon, 18-Apr-77 22:50:13 GMT", true, 230251813 },
// If the cookie came in with the expiration quoted (which in terms of
// the RFC you shouldn't do), we will get string quoted. Bug 1261605.
{ "\"Sat, 15-Apr-17\\\"21:01:22\\\"GMT\"", true, 1492290082 },
// Test with full month names and partial names.
{ "Partyday, 18- April-07 22:50:12", true, 1176936612 },
{ "Partyday, 18 - Apri-07 22:50:12", true, 1176936612 },
{ "Wednes, 1-Januar-2003 00:00:00 GMT", true, 1041379200 },
// Test that we always take GMT even with other time zones or bogus
// values. The RFC says everything should be GMT, and in the worst case
// we are 24 hours off because of zone issues.
{ "Sat, 15-Apr-17 21:01:22", true, 1492290082 },
{ "Sat, 15-Apr-17 21:01:22 GMT-2", true, 1492290082 },
{ "Sat, 15-Apr-17 21:01:22 GMT BLAH", true, 1492290082 },
{ "Sat, 15-Apr-17 21:01:22 GMT-0400", true, 1492290082 },
{ "Sat, 15-Apr-17 21:01:22 GMT-0400 (EDT)",true, 1492290082 },
{ "Sat, 15-Apr-17 21:01:22 DST", true, 1492290082 },
{ "Sat, 15-Apr-17 21:01:22 -0400", true, 1492290082 },
{ "Sat, 15-Apr-17 21:01:22 (hello there)", true, 1492290082 },
// Test that if we encounter multiple : fields, that we take the first
// that correctly parses.
{ "Sat, 15-Apr-17 21:01:22 11:22:33", true, 1492290082 },
{ "Sat, 15-Apr-17 ::00 21:01:22", true, 1492290082 },
{ "Sat, 15-Apr-17 boink:z 21:01:22", true, 1492290082 },
// We take the first, which in this case is invalid.
{ "Sat, 15-Apr-17 91:22:33 21:01:22", false, 0 },
// amazon.com formats their cookie expiration like this.
{ "Thu Apr 18 22:50:12 2007 GMT", true, 1176936612 },
// Test that hh:mm:ss can occur anywhere.
{ "22:50:12 Thu Apr 18 2007 GMT", true, 1176936612 },
{ "Thu 22:50:12 Apr 18 2007 GMT", true, 1176936612 },
{ "Thu Apr 22:50:12 18 2007 GMT", true, 1176936612 },
{ "Thu Apr 18 22:50:12 2007 GMT", true, 1176936612 },
{ "Thu Apr 18 2007 22:50:12 GMT", true, 1176936612 },
{ "Thu Apr 18 2007 GMT 22:50:12", true, 1176936612 },
// Test that the day and year can be anywhere if they are unambigious.
{ "Sat, 15-Apr-17 21:01:22 GMT", true, 1492290082 },
{ "15-Sat, Apr-17 21:01:22 GMT", true, 1492290082 },
{ "15-Sat, Apr 21:01:22 GMT 17", true, 1492290082 },
{ "15-Sat, Apr 21:01:22 GMT 2017", true, 1492290082 },
{ "15 Apr 21:01:22 2017", true, 1492290082 },
{ "15 17 Apr 21:01:22", true, 1492290082 },
{ "Apr 15 17 21:01:22", true, 1492290082 },
{ "Apr 15 21:01:22 17", true, 1492290082 },
{ "2017 April 15 21:01:22", true, 1492290082 },
{ "15 April 2017 21:01:22", true, 1492290082 },
// Some invalid dates
{ "98 April 17 21:01:22", false, 0 },
{ "Thu, 012-Aug-2008 20:49:07 GMT", false, 0 },
{ "Thu, 12-Aug-31841 20:49:07 GMT", false, 0 },
{ "Thu, 12-Aug-9999999999 20:49:07 GMT", false, 0 },
{ "Thu, 999999999999-Aug-2007 20:49:07 GMT", false, 0 },
{ "Thu, 12-Aug-2007 20:61:99999999999 GMT", false, 0 },
{ "IAintNoDateFool", false, 0 },
In general, I think we should plan to iterate on each part of the
draft several times instead of trying to get it perfect in one pass.
Dan has presented some pretty detailed data on what kinds of date
formats he's seeing on the web. Writing up a grammar for that would
be a big improvement over the big blank space in the current draft.
:)
> Once we have that, we can specify an "official" date format[1] that should be used, and alternative formats that should be parsable if encountered[2].
Yep. We should probably recommend that server implementors use
whatever date format that is the most widely implemented, sane format.
We also should explain how user agents implementors should cope with
enough crazy formats to achieve some desired level of compatibility
with existing practice.
> And as Dan pointed out[3], we should test how two-digits years are interpreted -- for my own date parser, anything >= 40 is considered 20th century and anything <= 39 is considered 21st century. It'd be good to know how the UAs handle it and provide a recommendation in the spec.
Definitely. If you investigate this, let us know what you find.
>> // Some invalid dates
>> { "98 April 17 21:01:22", false, 0 },
>
> Why is that considered invalid? Because it's unclear if it's 1998 or 2098?
Not sure. I haven't looked into date parsing in detail yet.
Adam
> This is great data. Would you like to take a crack at writing a date grammar?
Basically they all have two to five types of data:
A) Date (day, month and year)
B) Time (hour, minute, second)
C) Time zone (named or relative UTC)
D) Day (name of the day, mostly pointless)
E) Additional junk (entirely pointless)
[All these with all sorts of separators]
I'm sure all date parsers we use work basically like this. It needs to parse
the string and identify the individual components it specifies. If it has
gotten enough details (A and B are required), it doesn't matter if more junk
(E) is found or added to the date string.
The individual ordering of the components is mostly uninteresting to a parser,
except if you really want to make the parser check for a strict syntax.
Of course, if the same A, B or C type appear more than once the outcome will
be undefined as then the parser cannot reliably pick which one is the correct.
Given the look of that list of dates for the Chrome tests, it seems similar in
spirit to the curl parser.
Combined, it makes it a pain to write a formal syntax spec from.
--
With HTML5 I've found that rather than defining formal syntaxes for this
kind of thing, it's easier just to define imperative parsing steps that
lead to the right behaviour.
For example:
http://www.whatwg.org/specs/web-apps/current-work/#rules-for-parsing-floating-point-number-values
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
They might be useful for implementers of parsers, but they are almost
unreadable for producers.
BR, Julian
I think we first need to understand which UAs can successfully parse the various date formats. Once we have that, we can specify an "official" date format[1] that should be used, and alternative formats that should be parsable if encountered[2]. And as Dan pointed out[3], we should test how two-digits years are interpreted -- for my own date parser, anything >= 40 is considered 20th century and anything <= 39 is considered 21st century. It'd be good to know how the UAs handle it and provide a recommendation in the spec.
> // Some invalid dates
> { "98 April 17 21:01:22", false, 0 },
Why is that considered invalid? Because it's unclear if it's 1998 or 2098?
- Bil
[1] I'm leaning toward the revised Netscape spec format of "Mon, 10-Dec-2007 17:02:24 GMT"
[2] The list of alternative formats would be made up of those that are parsable by the majority (all?) of UAs.
[3] http://groups.google.com/group/http-state/browse_thread/thread/ba2b98c340eed3b8#msg_1e91e84ebc1df4f1
They're not intended for producers (in fact they're hidden in the HTML5
spec when you select the "author" option) so that's not really that
surprising. For producers, you want the much simpler description of what
is valid, which often has little bearing on the parsing rules.
--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
I ran that nice list of date formats through the libcurl date parser, and
there were quite a few that didn't parse or didn't match your output:
> { "WillyWonka , 18-Apr-07 22:50:12 GMT", true, 1176936612 },
> { "WillyWonka , 18-Apr-07 22:50:12", true, 1176936612 },
> { "WillyWonka , 18-apr-07 22:50:12", true, 1176936612 },
> { "Partyday, 18- April-07 22:50:12", true, 1176936612 },
> { "Partyday, 18 - Apri-07 22:50:12", true, 1176936612 },
> { "Wednes, 1-Januar-2003 00:00:00 GMT", true, 1041379200 },
> { "Sat, 15-Apr-17 21:01:22 DST", true, 1492290082 },
> { "Sat, 15-Apr-17 21:01:22 (hello there)", true, 1492290082 },
> { "Sat, 15-Apr-17 21:01:22 11:22:33", true, 1492290082 },
> { "Sat, 15-Apr-17 ::00 21:01:22", true, 1492290082 },
> { "Sat, 15-Apr-17 boink:z 21:01:22", true, 1492290082 },
> { "2017 April 15 21:01:22", true, 1492290082 },
> { "15 April 2017 21:01:22", true, 1492290082 },
All the above formats are invalid to libcurl. The last two are about the only
ones I was a bit surprised it doesn't handle and I might take a deeper look at
that later on.
> { "Thu, 012-Aug-2008 20:49:07 GMT", false, 0 },
libcurl parses this to be 1218574147.
> { "Thu, 12-Aug-31841 20:49:07 GMT", false, 0 },
> { "Thu, 12-Aug-9999999999 20:49:07 GMT", false, 0 },
In cases where the format is fine but the specified date is out of range (like
these examples), libcurl returns MAX_INT with the assumption that it should be
enough to not expire them for quite some time.
All the rest were parsed and got the same results as the table Adam posted.
--
The most-commonly-used format right now is the revised Netscape spec
format, but unless we can find a cookie implementation that doesn't
parse rfc1123-date (the second-most-common format), I think we should
recommend that, because that's the standard date format in HTTP.
>> And as Dan pointed out[3], we should test how two-digits years are interpreted -- for my own date parser, anything >= 40 is considered 20th century and anything <= 39 is considered 21st century.
Yeah, I was going to suggest 69 as the dividing line, for basically the
same reason; time_t==0 (1970) has to be in the past, and time_t==2^31-1
(2038) has to be in the future. It may be that browsers are inconsistent
about years from 39 to 68...
-- Dan
If that's workable w.r.t. existing implementations, that would be great.
Adam
> Maybe just:
>
> cookie-date = rfc1123-like-date | mystery-date
> rfc1123-like-date = weekday "," SP rfc1123-like-dmy SP time SP "GMT"
> weekday = "Monday" | "Mon" | "Tuesday" | "Tue" | ...
> rfc1123-like-dmy = day dmy-div month dmy-div year
> dmy-div = SP | "-"
> day = 2DIGIT | *1SP DIGIT
> month = "Jan" | "Feb" | ...
> year = 2DIGIT | 4DIGIT
> time = 2DIGIT ":" 2DIGIT ":" 2DIGIT
>
> mystery-date = *CHAR ; see below
>
> and then we explain some of the possibilities of mystery-date parsing,
> showing a few examples, but noting that the rfc1123-like-date grammar
> covers 99% of cookies
I think that sounds perfectly fine.
I think that's a good start. Reminder: please use RFC5234-style ABNF, so
"/" instead of "|" etc,
With respect to mystery-date: maybe that could be defined as something like:
( year / month / dayofmonth / weekday / time / tz / WSP / separators)*
and then have ultra-liberal definitions for each of these components,
and also prose that disallows forms where the same component repeats
multiple times?
BR, Julian
Yeah, RFC 2616 says "Recipients of date values are encouraged to be
robust in accepting date values that may have been sent by non-HTTP
applications" so probably all browsers just have a single all-purpose
very-relaxed date parser.
> Combined, it makes it a pain to write a formal syntax spec from.
Maybe just:
cookie-date = rfc1123-like-date | mystery-date
rfc1123-like-date = weekday "," SP rfc1123-like-dmy SP time SP "GMT"
weekday = "Monday" | "Mon" | "Tuesday" | "Tue" | ...
rfc1123-like-dmy = day dmy-div month dmy-div year
dmy-div = SP | "-"
day = 2DIGIT | *1SP DIGIT
month = "Jan" | "Feb" | ...
year = 2DIGIT | 4DIGIT
time = 2DIGIT ":" 2DIGIT ":" 2DIGIT
mystery-date = *CHAR ; see below
and then we explain some of the possibilities of mystery-date parsing,
showing a few examples, but noting that the rfc1123-like-date grammar
covers 99% of cookies, and there's a long tail of crap after that. (And
the major browsers probably aren't completely consistent about what
parts of that tail they accept.)
-- Dan
Perhaps we should specify the rfc1123-date (Dan Winship's suggestion) as the official date format and reference a separate date parsing spec that provides coverage for the other common formats (and more).
- Bil
I've added this grammar to the draft (with the / characters suggested
by Julian). I haven't tackled the mystery-date format yet.
Adam
FWIW, here are some date formats that Mozilla handles:
---8<---
983 * Many formats are handled, including:
984 *
985 * 14 Apr 89 03:20:12
986 * 14 Apr 89 03:20 GMT
987 * Fri, 17 Mar 89 4:01:33
988 * Fri, 17 Mar 89 4:01 GMT
989 * Mon Jan 16 16:12 PDT 1989
990 * Mon Jan 16 16:12 +0130 1989
991 * 6 May 1992 16:41-JST (Wednesday)
992 * 22-AUG-1993 10:59:12.82
993 * 22-AUG-1993 10:59pm
994 * 22-AUG-1993 12:59am
995 * 22-AUG-1993 12:59 PM
996 * Friday, August 04, 1995 3:54 PM
997 * 06/21/95 04:24:34 PM
998 * 20/06/95 21:07
999 * 95-06-08 19:32:48 EDT
1000 *
1001 * If the input string doesn't contain a description of the timezone,
1002 * we consult the `default_to_gmt' to decide whether the string should
1003 * be interpreted relative to the local time zone (PR_FALSE) or GMT (PR_TRUE).
1004 * The correct value for this argument depends on what standard specified
1005 * the time string which you are parsing.
(from: http://mxr.mozilla.org/mozilla1.8.0/source/nsprpub/pr/src/misc/prtime.c#961)
--->8---
- Bil
Adam