psl DAFSA auto-detection

11 views
Skip to first unread message

Daniel Kahn Gillmor

unread,
Jul 5, 2016, 2:07:44 PM7/5/16
to libps...@googlegroups.com
Hi Tim and other libpsl folks--

i'm looking at updating libpsl in debian from 0.11 to 0.13. I started
to review psl_load_fp(), and it looks to me like there are a few issues
with the autodetection:

* it introduces a call to rewind(), whose return code is not checked.
It's possible that someone passes psl_load_fp() a file descriptor
that is not seekable (e.g. a pipe or stdin), in which case rewind()
could fail silently here. It'd be better to avoid rewinding entirely
so that PSL data can be passed on arbitrary file descriptors.

* depending on the underlying OS and filesystem, the initial fread()
might be a very short read (e.g. n < 16). In that case, the
autodetection would claim that the file is a DAFSA simply because
there was a short read.

* The autodetection seems to assume that any non-DAFSA PSL will have
the text "This Source Code Form is subject to" somewhere in the first
256 bytes. But this isn't the case, since a legitimate PSL input
file could be under any license, or could be pre-stripped of all
comments.

It seems like it would be better for the autodetection to work the other
way around: have a "magic string" to start any DAFSA file that would not
be a legitimate start of an existing PSL file (e.g. something that is
neither a comment nor lowercase UTF-8), and detect that string in the
file descriptor in question.

Then, if the magic string was detected, skip over it and load the
remainder of the file into psl->dafsa.

If the magic string was just: "PSL_DAFSA\n", we could put the entire
test inside the fgets loop.

The only potential problem here is if there is existing code that
depends on finding DAFSA files *without* such a magic string. I can
guarantee that's not the case in debian currently (since debian has
0.11), but i don't know about outside of it.

Lastly, i'm wondering whether pre-compiled DAFSAs might be
architecture-dependent in any way -- if i build one on a 32-bit
little-endian platform with aligned RAM access, and ship it to a 64-bit
big-endian platform with packed RAM representations, will it be
interpreted correctly?

What do you think?

--dkg
signature.asc

Tim Ruehsen

unread,
Jul 8, 2016, 6:57:51 AM7/8/16
to libps...@googlegroups.com, Daniel Kahn Gillmor
On Tuesday, July 5, 2016 2:07:39 PM CEST Daniel Kahn Gillmor wrote:
> Hi Tim and other libpsl folks--
>
> i'm looking at updating libpsl in debian from 0.11 to 0.13. I started
> to review psl_load_fp(), and it looks to me like there are a few issues
> with the autodetection:
>
> * it introduces a call to rewind(), whose return code is not checked.
> It's possible that someone passes psl_load_fp() a file descriptor
> that is not seekable (e.g. a pipe or stdin), in which case rewind()
> could fail silently here. It'd be better to avoid rewinding entirely
> so that PSL data can be passed on arbitrary file descriptors.

rewind() could be worked around.

> * depending on the underlying OS and filesystem, the initial fread()
> might be a very short read (e.g. n < 16). In that case, the
> autodetection would claim that the file is a DAFSA simply because
> there was a short read.

The initial read should be a fgets(). That handles small chunks of incoming
data and also makes it easy to avoid rewind(). We could also read line by line
(storing the lines for later assembling if we detect DAFSA), see below.

> * The autodetection seems to assume that any non-DAFSA PSL will have
> the text "This Source Code Form is subject to" somewhere in the first
> 256 bytes. But this isn't the case, since a legitimate PSL input
> file could be under any license, or could be pre-stripped of all
> comments.
>
> It seems like it would be better for the autodetection to work the other
> way around: have a "magic string" to start any DAFSA file that would not
> be a legitimate start of an existing PSL file (e.g. something that is
> neither a comment nor lowercase UTF-8), and detect that string in the
> file descriptor in question.
>
> Then, if the magic string was detected, skip over it and load the
> remainder of the file into psl->dafsa.
>
> If the magic string was just: "PSL_DAFSA\n", we could put the entire
> test inside the fgets loop.
>
> The only potential problem here is if there is existing code that
> depends on finding DAFSA files *without* such a magic string. I can
> guarantee that's not the case in debian currently (since debian has
> 0.11), but i don't know about outside of it.

I am against adding a magic word.
We are talking about the Mozilla PSL, not about Debian or any private PSL
which could have any form.
The MPSL holds meta-data within comments (start/stop of sections). That means
a complete strip of comments renders the PSL unusable (semantically, not
technically).
The first non-empty line *MUST* be a comment. Also, before we see the first
rule, there must be a comment line starting with '// ===BEGIN '. It's a bit
like a magic word.

> Lastly, i'm wondering whether pre-compiled DAFSAs might be
> architecture-dependent in any way -- if i build one on a 32-bit
> little-endian platform with aligned RAM access, and ship it to a 64-bit
> big-endian platform with packed RAM representations, will it be
> interpreted correctly?

The lookup code works byte-by-byte. There is no endian dependant code. That
means the generator is (or *must* be) endian independent as well.
As a last proof we should check wether 'make_dafsa.py' produces exactly the
same output on little and big endian systems. But have no big endian system at
hand.

Tim
signature.asc

Daniel Kahn Gillmor

unread,
Jul 8, 2016, 6:02:19 PM7/8/16
to Tim Ruehsen, libps...@googlegroups.com
On Fri 2016-07-08 12:57:18 +0200, Tim Ruehsen wrote:
> rewind() could be worked around.

great! that would be good to resolve.

> The initial read should be a fgets(). That handles small chunks of incoming
> data and also makes it easy to avoid rewind(). We could also read line by line
> (storing the lines for later assembling if we detect DAFSA), see below.

how are you proposing to detect DAFSA? currently, libpsl detects
*not-DAFSA* by looking for magic comments :/

> I am against adding a magic word.

a magic word before the DAFSA and fgets() go really well together though
:)

> We are talking about the Mozilla PSL, not about Debian or any private PSL
> which could have any form.

libpsl is capable of loading any public suffix list. the PSL format
specification says nothing about magic comments.

[0] https://publicsuffix.org/list/

> The MPSL holds meta-data within comments (start/stop of sections). That means
> a complete strip of comments renders the PSL unusable (semantically, not
> technically).

Such a list would only be unusable for the purposes of distinguishing
ICANN registries from private registries. It would still be usable for
all the other uses the PSL is taken today.

> The first non-empty line *MUST* be a comment.

no code in libpsl currently verifies this, and nothing in the spec
mandates it.

> Also, before we see the first rule, there must be a comment line
> starting with '// ===BEGIN '. It's a bit like a magic word.

But again, the spec that libpsl is written against doesn't have this
constraint.

I'd much rather use a magic word prefix for the relatively-unused DAFSA
format than change our expectations for an already widely-deployed file
format. It would also let us unequivocally distinguish even the most
perversely-generated DAFSA from a document that meets the PSL
specification, just by starting with a line that is not a valid domain
name should do the trick. and we get fgets for free as long as we
terminate the magic word with a newline.

> The lookup code works byte-by-byte. There is no endian dependant code. That
> means the generator is (or *must* be) endian independent as well.
> As a last proof we should check wether 'make_dafsa.py' produces exactly the
> same output on little and big endian systems. But have no big endian system at
> hand.

cool. if we can get this version into debian, we'll get it building on
all sorts of machines and we can compare then.

--dkg

Tim Ruehsen

unread,
Jul 12, 2016, 9:36:55 AM7/12/16
to libps...@googlegroups.com, Daniel Kahn Gillmor
On Saturday, July 9, 2016 12:02:16 AM CEST Daniel Kahn Gillmor wrote:
> On Fri 2016-07-08 12:57:18 +0200, Tim Ruehsen wrote:
> > rewind() could be worked around.
>
> great! that would be good to resolve.
>
> > The initial read should be a fgets(). That handles small chunks of
> > incoming
> > data and also makes it easy to avoid rewind(). We could also read line by
> > line (storing the lines for later assembling if we detect DAFSA), see
> > below.
> how are you proposing to detect DAFSA? currently, libpsl detects
> *not-DAFSA* by looking for magic comments :/
>
> > I am against adding a magic word.
>
> a magic word before the DAFSA and fgets() go really well together though
>
> :)
> :
> > We are talking about the Mozilla PSL, not about Debian or any private PSL
> > which could have any form.
>
> libpsl is capable of loading any public suffix list. the PSL format
> specification says nothing about magic comments.
>
> [0] https://publicsuffix.org/list/

see 'Divisions'
"The Public Suffix List is subdivided, using markers in the comments, into two
sections, labelled as ICANN domains and PRIVATE domains."

The explicit "magic comment" can be found in the PSL linter used for each
commit on Travis-CI (https://github.com/publicsuffix/list/tree/master/linter).
(BTW, I am the author of this linter, but it has been review and accepted by
the PSL maintainers).

There is also a current guideline for editing the PSL:
https://github.com/publicsuffix/list/wiki/Guidelines

> > The MPSL holds meta-data within comments (start/stop of sections). That
> > means a complete strip of comments renders the PSL unusable
> > (semantically, not technically).
>
> Such a list would only be unusable for the purposes of distinguishing
> ICANN registries from private registries. It would still be usable for
> all the other uses the PSL is taken today.
>
> > The first non-empty line *MUST* be a comment.
>
> no code in libpsl currently verifies this, and nothing in the spec
> mandates it.

We currently parse these comments in psl_load_fp() to assign a rule to ICANN
or PRIVATE.

> > Also, before we see the first rule, there must be a comment line
> > starting with '// ===BEGIN '. It's a bit like a magic word.
>
> But again, the spec that libpsl is written against doesn't have this
> constraint.
>
> I'd much rather use a magic word prefix for the relatively-unused DAFSA
> format than change our expectations for an already widely-deployed file
> format. It would also let us unequivocally distinguish even the most
> perversely-generated DAFSA from a document that meets the PSL
> specification, just by starting with a line that is not a valid domain
> name should do the trick. and we get fgets for free as long as we
> terminate the magic word with a newline.

What exactly do you propose ?
There should be at least some version information, just in case the format
evolves.

Tim
signature.asc

Daniel Kahn Gillmor

unread,
Jul 12, 2016, 5:12:08 PM7/12/16
to Tim Ruehsen, libps...@googlegroups.com
On Tue 2016-07-12 15:36:24 +0200, Tim Ruehsen wrote:
>> [0] https://publicsuffix.org/list/
>
> see 'Divisions'
> "The Public Suffix List is subdivided, using markers in the comments, into two
> sections, labelled as ICANN domains and PRIVATE domains."

hm, i agree that is there, but there's no explicit specification for
what the comment actually is -- they could change the strings or reverse
the order of the domains without violating that clause, right? (your
linter would complain though).

maybe the spec needs updating to say explicitly how the sections are
structured, and should guarantee something about how the PSL list is
itself introduced? If it did that then we could just use that part as
the "magic string", right?

> There is also a current guideline for editing the PSL:
> https://github.com/publicsuffix/list/wiki/Guidelines

yep, i see that stuff, and i understand the reason that the two parts of
the list exist. There are even more reasons to use the list, though
(see the discussion around the possible uses of DBOUND). In the
abstract, i think what they're specifying are two distinct PSLs:

* a PSL for cookies (ICANN + PRIVATE)
* a PSL for X.509 wildcard issuance (ICANN)

it's entirely possible that the PSL for, say, DMARC would be distinct
From (though partially overlapping with) either of these.

If you're loading a PSL for a specific purpose, it'd be great to just
load it and get boolean answers back :)

> [ dkg wrote: ]
>> I'd much rather use a magic word prefix for the relatively-unused DAFSA
>> format than change our expectations for an already widely-deployed file
>> format. It would also let us unequivocally distinguish even the most
>> perversely-generated DAFSA from a document that meets the PSL
>> specification, just by starting with a line that is not a valid domain
>> name should do the trick. and we get fgets for free as long as we
>> terminate the magic word with a newline.
>
> What exactly do you propose ?
> There should be at least some version information, just in case the format
> evolves.

how about this leading string:

'.DAFSA@PSL_0 \n'

in hex: 2e 44 41 46 53 41 40 50 53 4c 5f 30 20 20 20 0a

This has several advantages:

* the leading . explicitly violates the spec for the non-DAFSA file:
"Each rule lists a public suffix, with the subdomain portions
separated by dots (.) as usual. There is no leading dot."

* it contains both @ and _, characters traditionally not allowed in
host labels, and very unlikely to be present in the suffix of a
public registry despite more recent loosening of generic DNS label
constraints.

* it is newline-terminated, so it is a natural thing to parse when
using fgets().

* it is exactly 16 octets, so mmapping the file on aligned
architectures should leave the start of the DAFSA itself well-aligned
with typical memory architectures up to 128-bits, if alignment ever
matters to the layout.

* it has (after the _) a version number, so that it is potentially
extendable.

* it distinguishes both the data structure (DAFSA) and the intent for
the data structure (PSL), so that such a structure won't be
accidentally mis-used for other purposes.

wdyt?

--dkg

signature.asc

Tim Ruehsen

unread,
Jul 13, 2016, 3:37:03 AM7/13/16
to libps...@googlegroups.com, Daniel Kahn Gillmor
On Tuesday, July 12, 2016 11:12:01 PM CEST Daniel Kahn Gillmor wrote:
> On Tue 2016-07-12 15:36:24 +0200, Tim Ruehsen wrote:
> >> [0] https://publicsuffix.org/list/
> >
> > see 'Divisions'
> > "The Public Suffix List is subdivided, using markers in the comments, into
> > two sections, labelled as ICANN domains and PRIVATE domains."
>
> hm, i agree that is there, but there's no explicit specification for
> what the comment actually is -- they could change the strings or reverse
> the order of the domains without violating that clause, right? (your
> linter would complain though).
>
> maybe the spec needs updating to say explicitly how the sections are
> structured, and should guarantee something about how the PSL list is
> itself introduced? If it did that then we could just use that part as
> the "magic string", right?

Right. But you already convinced me to add your proposed magic, see below :-)

Anyways this should be reported (issue or better as a pull request).

> > There is also a current guideline for editing the PSL:
> > https://github.com/publicsuffix/list/wiki/Guidelines
>
> yep, i see that stuff, and i understand the reason that the two parts of
> the list exist. There are even more reasons to use the list, though
> (see the discussion around the possible uses of DBOUND). In the
> abstract, i think what they're specifying are two distinct PSLs:
>
> * a PSL for cookies (ICANN + PRIVATE)
> * a PSL for X.509 wildcard issuance (ICANN)
>
> it's entirely possible that the PSL for, say, DMARC would be distinct
> From (though partially overlapping with) either of these.
>
> If you're loading a PSL for a specific purpose, it'd be great to just
> load it and get boolean answers back :)

I definitely agree.
Very good !
I can't think of anything else to consider.

I am going to implement that for 0.14.0

Tim
signature.asc

Tim Ruehsen

unread,
Jul 13, 2016, 5:20:17 AM7/13/16
to libps...@googlegroups.com, Daniel Kahn Gillmor
Added code to branch 'develop' for review/testing.

Generate `psl.dafsa` from `list/public_suffix_list.dat`:

$ src/make_dafsa.py --output-format=binary --input-format=psl list/public_suffix_list.dat psl.dafsa

Test the result

$ tools/psl --load-psl-file psl.dafsa aeroclub.aero

Tim
signature.asc

Daniel Kahn Gillmor

unread,
Jul 13, 2016, 7:06:15 AM7/13/16
to Tim Ruehsen, libps...@googlegroups.com
this looks reasonable to me -- maybe the docstring for words_to_binary()
could be fixed, but otherwise i think it's fine.

Any reason not to ship make_dafsa.py in a binary package in debian? I'm
thinking of putting it in either "psl" or "libpsl-dev". I with its
current name, i probably wouldn't put it directly in $PATH at all, but
someplace like /usr/share/libpsl-dev/make_dafsa.py

Having it ship in a binary package would mean i can just make
publicsuffix build-depend on that binary package, and ship the dafsa
directly in /usr/share/publicsuffix/public_suffix_list.dafsa. Then we
automate rapid updates to the publicsuffix package in between releases
of libpsl and have libpsl pick them up at no significant performance
cost.

Do you have a preference for where make_dafsa.py should be shipped?

--dkg

Tim Ruehsen

unread,
Jul 13, 2016, 7:14:49 AM7/13/16
to Daniel Kahn Gillmor, libps...@googlegroups.com
That sounds great !
That obsoletes our old discussion about libpsl being automatically built on
publicsuffix updates, very good.

> Do you have a preference for where make_dafsa.py should be shipped?

No. You are free to decide.

Tim
signature.asc

Daniel Kahn Gillmor

unread,
Jul 13, 2016, 8:50:10 AM7/13/16
to Tim Ruehsen, libps...@googlegroups.com
On Wed 2016-07-13 13:14:19 +0200, Tim Ruehsen wrote:
> That sounds great !
> That obsoletes our old discussion about libpsl being automatically built on
> publicsuffix updates, very good.

yes, i'm pretty happy with it. It does introduce a weird
build-dependency cycle (publicsuffix build-depends on libpsl-dev, which
build-depends on publicsuffix), but i'll ask folks who have been around
the release cycle longer than i have about whether that kind of thing is
going to cause any problems.

If there is a problem, i'll report back and we can talk more about how
we might clear that dependency cycle.

>> Do you have a preference for where make_dafsa.py should be shipped?
>
> No. You are free to decide.

Ok, i'm leaning toward libpsl-dev at the moment, but if anyone makes a
convincing argument to the contrary before 0.14 is released, i'm open to
persuasion.

all the best,

--dkg

Daniel Kahn Gillmor

unread,
Jul 14, 2016, 5:58:25 AM7/14/16
to Tim Ruehsen, libps...@googlegroups.com
On Wed 2016-07-13 14:50:07 +0200, Daniel Kahn Gillmor wrote:
> yes, i'm pretty happy with it. It does introduce a weird
> build-dependency cycle (publicsuffix build-depends on libpsl-dev, which
> build-depends on publicsuffix), but i'll ask folks who have been around
> the release cycle longer than i have about whether that kind of thing is
> going to cause any problems.
[...]
> Ok, i'm leaning toward libpsl-dev at the moment, but if anyone makes a
> convincing argument to the contrary before 0.14 is released, i'm open to
> persuasion.

To make it easier to avoid the build-dep cycle, i propose renaming
make_dafsa.py to psl-make-dafsa and to document it.

Please see https://github.com/rockdaboot/libpsl/pull/52 for suggested
changes. Thanks!

--dkg
signature.asc

Tim Ruehsen

unread,
Jul 14, 2016, 6:08:51 AM7/14/16
to libps...@googlegroups.com, Daniel Kahn Gillmor
Thanks, accepted via Github. All your text is now part of the commit message
8')

Tim
signature.asc

Tim Ruehsen

unread,
Jul 29, 2016, 5:19:15 AM7/29/16
to libps...@googlegroups.com, Daniel Kahn Gillmor
Are we ready for a 0.14 release yet ? I am still waiting for something... but
not sure for what ;-)

Regards, Tim
signature.asc

Daniel Kahn Gillmor

unread,
Jul 29, 2016, 5:43:17 PM7/29/16
to Tim Ruehsen, libps...@googlegroups.com
On Fri 2016-07-29 05:18:34 -0400, Tim Ruehsen wrote:
> Are we ready for a 0.14 release yet ? I am still waiting for
> something... but not sure for what ;-)

Hm, i thought i was waiting on you, but i'm not sure what for either :P

If you're up for 0.14, i'm game to try to get it packaged.

--dkg

Tim Rühsen

unread,
Jul 30, 2016, 8:10:06 AM7/30/16
to libps...@googlegroups.com, Daniel Kahn Gillmor
Released, but the new man pages are not in the tar ball... (just saw that
after release). Now sure if we just put them to EXTRA_DIST or even moving to
docs/.

Regards, Tim
signature.asc
Reply all
Reply to author
Forward
0 new messages