golang.org/x/net/publicsuffix: Separate git repo? Automatic "go generate" and mail?

618 views
Skip to first unread message

Nigel Tao

unread,
Jun 8, 2016, 8:45:14 AM6/8/16
to golang-dev, Brad Fitzpatrick, Volker Dobler
publicsuffix.org maintains a list of public suffixes, used to e.g.
scope HTTP cookies.

The golang.org/x/net/publicsuffix package contains a 'compiled'
version of the plain text list [0]. That compiled version that is more
efficient at runtime, but not human-readable.

The canonical upstream list gets ad hoc updates roughly once a week in
recent times [1]. We (usually me, sometimes Brad or Volker)
occasionally update the Go package, maybe once every few months [2]. I
hesitate to do so more frequently because even a one line change to
the upstream list pretty much changes the entire generated table ([3]
is a typical diff), and I don't want to bloat the golang.org/x/net git
repo with lots of essentially binary changes. I'm really only hand
waving and guessing here, as I don't really know how git works under
the hood, but for example, the generated table.go file currently
weighs 528K. At [4], Brad (who's away for some weeks) said that each
publicsuffix commit grows the x/net repo by 0.1MB, which isn't huge,
but it would add up over time.

On the other hand, we've had a number of CLs sent in over the years
because the checked in, generated version becomes stale, and these CLs
have arrived more frequently in recent months. It may be time to
update the generated form more frequently, if not automatically. Any
thoughts or experience out there with automatic code gen and "git
codereview mail"?

Automatic or not, it might then make sense to move the package to its
own dedicated git repo, instead of filling golang.org/x/net with noisy
churn. If so, any bikeshedding opinions between
golang.org/x/publicsuffix (gerrit + codereview) or
github.com/golang/publicsuffix (vanilla git) or something else?

Any other thoughts, golang-dev?



golang.org/issue/15518 also has some discussion of building the list
at runtime instead of at "go generate" time, if users supply a
public_suffix_list.dat file at runtime, but it is a nice property that
the package is currently usable out of the box with "go get" and
without having to supply a separate list. Also, if you need a process
to update that separate list, you might as well have a process to
update the Go package, modulo the repo size issue, if the Go package
was updated more frequently.

It is also trivial for third parties like chromium or letsencrypt to
fork the package and run "go generate" at whatever cadence they want
to, although all this might lead to is having N replicated problems
instead of 1 centralized problem.

[0] https://publicsuffix.org/list/public_suffix_list.dat
[1] https://github.com/publicsuffix/list/commits/master
[2] https://go.googlesource.com/net/+log/master/publicsuffix
[3] https://go.googlesource.com/net/+/d58ca6618b994150e624f6888d871f4709db51a0%5E%21/#F0
[4] https://github.com/letsencrypt/boulder/issues/1374#issuecomment-182429297

js...@letsencrypt.org

unread,
Jun 8, 2016, 7:15:54 PM6/8/16
to golang-dev, brad...@golang.org, dr.volke...@gmail.com
A lot of the increased pull request problem probably results from the fact that Let's Encrypt groups its rate limits based on the Public Suffix List. This has lead to a number of free DNS providers that hadn't previously known about the PSL suddenly getting pressure from their customers to get added to the list. My apologies that this has had the unexpected knock-on effect of increasing pull request volume on this package.

Simone Carletti, a PSL maintainer, has written an alternate implementation at https://github.com/weppos/publicsuffix-go that can read directly from a public_suffix_list.dat, and is planning to send a Boulder pull request to incorporate that library: https://github.com/letsencrypt/boulder/issues/1479#issuecomment-224543735. Hopefully that will reduce the pressure on golang.org/x/net/publicsuffix to update frequently.

Andrew Gerrand

unread,
Jun 9, 2016, 1:19:38 AM6/9/16
to Nigel Tao, golang-dev, Brad Fitzpatrick, Volker Dobler

On 8 June 2016 at 18:45, Nigel Tao <nige...@golang.org> wrote:
golang.org/x/publicsuffix (gerrit + codereview)

If we're going for a new repo, this is what I recommend.

Nigel Tao

unread,
Jun 9, 2016, 5:25:13 AM6/9/16
to js...@letsencrypt.org, golang-dev, Brad Fitzpatrick, Volker Dobler
On Thu, Jun 9, 2016 at 4:45 AM, <js...@letsencrypt.org> wrote:
> Simone Carletti, a PSL maintainer, has written an alternate implementation
> at https://github.com/weppos/publicsuffix-go that can read directly from a
> public_suffix_list.dat, and is planning to send a Boulder pull request to
> incorporate that library:
> https://github.com/letsencrypt/boulder/issues/1479#issuecomment-224543735.
> Hopefully that will reduce the pressure on golang.org/x/net/publicsuffix to
> update frequently.

That might relieve the update pressure from letsencrypt on
golang.org/x/net/publicsuffix, but the most recent CL I saw
(https://go-review.googlesource.com/#/c/23832/) came from a Chromium
developer, in response to a Chromium bug
(https://github.com/chromium/hstspreload/issues/79).

In any case, I've just submitted
https://go-review.googlesource.com/#/c/23930/ which discards the
for-debugging comments, shrinking the generated table.go file from
538K to 139K.

Dave MacFarlane

unread,
Jun 9, 2016, 3:27:45 PM6/9/16
to Nigel Tao, golang-dev, Brad Fitzpatrick, Volker Dobler
On Wed, Jun 8, 2016 at 4:45 AM, Nigel Tao <nige...@golang.org> wrote:
publicsuffix.org maintains a list of public suffixes, used to e.g.
scope HTTP cookies.

The golang.org/x/net/publicsuffix package contains a 'compiled'
version of the plain text list [0]. That compiled version that is more
efficient at runtime, but not human-readable.

The canonical upstream list gets ad hoc updates roughly once a week in
recent times [1]. We (usually me, sometimes Brad or Volker)
occasionally update the Go package, maybe once every few months [2]. I
hesitate to do so more frequently because even a one line change to
the upstream list pretty much changes the entire generated table ([3]
is a typical diff), and I don't want to bloat the golang.org/x/net git
repo with lots of essentially binary changes. I'm really only hand
waving and guessing here, as I don't really know how git works under
the hood, but for example, the generated table.go file currently
weighs 528K. At [4], Brad (who's away for some weeks) said that each
publicsuffix commit grows the x/net repo by 0.1MB, which isn't huge,
but it would add up over time.


Internally git stores a zlib compressed version of the file contents in a file
under .git/objects named after the SHA1 hash of the file contents for every
version of every file tracked by git, but since it's a SHA1 hash it's not
duplicated if different files or commits have the same content.

Occasionally, either when you run "git gc" or git decides to on its own, it takes the
seldom used objects and compressed them into a pack file, which is the
same as the format used to clone or fetch a repo over the wire. It's a file containing a
bunch of objects where each one is represented as either the same zlib
compressed entire contents, or as a (zlib compressed) binary delta from either
another object or a delta from an absolute offset in the pack file.

So if your "essentially binary" changes are localized to one part of the file (or
get repetitive inside the file itself in a way that compresses well), then it will eventually
get compressed into a way  that's diskspace efficient (and regardless it'll use that compressed
version to be network efficient while cloning. The git client and server negotiate what objects
they have in common beforehand so that it can minimize bandwidth.), but you don't have any
control over when that will happen on other people's local repos. Otherwise, it'll continuous grow
at the size that you're seeing forever.

(At least, that's my understanding from having tried to write a pure go git client and
giving up before getting it to a useable state due to the complexity of git's command line..)
 
Automatic or not, it might then make sense to move the package to its
own dedicated git repo, instead of filling golang.org/x/net with noisy
churn. If so, any bikeshedding opinions between
golang.org/x/publicsuffix (gerrit + codereview) or
github.com/golang/publicsuffix (vanilla git) or something else?

Any other thoughts, golang-dev?

Speaking as a Go user, the fact that some things are golang.org/x/ and some things are github.com/golang/x has
always been weird and inconsistent to me. golang.org/x looks more "official", while I actually initially thought that 
github.com/golang was just a mirror.

- Dave

Travis Johnson

unread,
Jun 12, 2016, 11:09:28 AM6/12/16
to golang-dev, nige...@golang.org, brad...@golang.org, dr.volke...@gmail.com
So if your "essentially binary" changes are localized to one part of the file (or
get repetitive inside the file itself in a way that compresses well)

They're not, the data being committed is basically already "compressed" in a sense, as I understand it, and because the linebreaks are result-length based and not content-length based (think "gzip --rsyncable") you end up with diffs that don't work well with git's (frankly black-magic-esque) diff compression. Though it does still get zlib'd as you say.

The removal of the comments seems to have shrunk the file pretty dramatically, but if this is going to be updated regularly it may still be a concern. I wonder if using a different format to store the data may help? Like storing one node/literal per line (bigger/longer file, but equivalent results and better diffs?), though I don't fully understand the method being used in this so I can't speak as to other potential options.

Volker Dobler

unread,
Jun 12, 2016, 5:53:20 PM6/12/16
to Travis Johnson, golang-dev, Nigel Tao, Brad Fitzpatrick
The labels are compressed (in the sense that overlapping labels are
combined into one large string). The actual table is stored in a form
which allows fast lookup; it is not "compressed" it is the direct memory
representation of the tree used to look up the publicsuffix of a domain.

The source was never meant to be optimised, only the memory footprint
of the list during runtime (and the time spent for a lookup).
A more efficient or better diffable or better compressable source code
representation probably would require building up the internal representation
of the list from the source-code-efficient data either during init or the
first use. I would consider this worse than the bytes wasted as git
cannot compress the source well.
 
V.

lga...@chromium.org

unread,
May 22, 2018, 4:50:10 AM5/22/18
to golang-dev
Following up on this: this issue prevented preloading HSTS for 1password.com using hstspreload.org , because the data in golang.org/x/net/publicsuffix is 3.5 months out of date.

On the other hand, it's nice that I haven't seen this problem come up in the last two years!
Reply all
Reply to author
Forward
0 new messages