Public Suffix Tree?

Ryan Williams

unread,

Apr 15, 2024, 8:40:28 PM4/15/24

to psl-discuss

I have started using the Public Suffix List for my own web application, as a way to reliably parse a hostname into its parts (subdomains, private domain, public suffix). In order to accomplish this, I parse the Public Suffix List and transform it into a Tree-like JSON object.

The structure is simple: The root object contains keys corresponding to the Top Level Domains found in the Public Suffix List. Each of these keys leads to another object containing all of the Second Level Domains that are found under that TLD, and so on and so forth.

Wildcards entries are registered under the key '*'. Don't register any child nodes under a wildcard node.

For an exception, like "!except", register the domain label without the exclamation ("except"), but also add a special property to that node to mark it as an exception (`@exception: true`). Also don't register any child nodes under an exception node.

For a node that corresponds with the left-most domain in a rule, you can likewise add a special property to demarcate it (`@leaf: true`). All wildcard and exception entries will also be `@leaf` nodes.

Once this transformation is done, it becomes trivial to lookup the public suffix of a hostname. Simply break the hostname into its parts (lowercase), based upon the delimiter (`.`), and then traverse the Public Suffix Tree as far as possible using these parts starting from the TLD. As you traverse, keep track of the last `@leaf` node encountered and its depth.

If you have matched as far as possible and the current @leaf node is not an exception, check to see if there is a wildcard '*' entry under that leaf node. If so, update your results accordingly. In either case, you now know the depth of the public suffix for the hostname in question. The private domain is at that that depth + 1, and any subdomains are found at any depth greater than that.

I have not had a need to keep track of comments for my application, but could probably work in the comments into the structure as well by adding a @comment attribute to leaf nodes, or perhaps a @commentId and keeping a separate hash or array of comments that can be used to lookup the corresponding comment contents.

-----------------------------

Now, the nice thing about this Public Suffix Tree structure is not only that it can easily be used to parse the public suffix of a hostname, but it can also be readily serialized and deserialized as a simple, standard JSON object. This means that the Public Suffix List could be pre-processed into the Public Suffix Tree structure and be made available for download the same way that the Public Suffix List is now. And since modern programming languages have built-in support for parsing JSON, this makes it that much easier to make use of the Public Suffix Tree.

I would like to propose that we perform such pre-processing and make the Public Suffix Tree .json document available for download along-side the existing Public Suffix List .dat document.

Message has been deleted

Simone Carletti

unread,

Jun 15, 2024, 12:33:03 PM6/15/24

to psl-discuss

I kept this thread in my TODO list as, an both an implementer and Ruby coder, I am particularly interested in what was the subject.
Any idea what happened to the other conversations on this thread? I see they have been deleted.

The current format is indeed a bit challenging to parse, but adopting any optimized/compressed version could be even more challenging for consumers. As Simon pointed out, most consumer these days are performing manipulation in the list within their own library, and the format can vary significantly depending on the language conventions and the project itself. Just a few examples:

Mozilla/Chromium compile it into a static list strongly optimized for their need
Go x/net/publicsuffix lib compiles it into a heavily compressed zip-alike format, optimizing for performance, lookup speed with less focus on frequent updates
Go publicsuffix-go lib compiles into native Go code, optimizing for compilation, maintenance, and performance
libpsl pre-processes it in a format suitable for C code

The list goes on and on. JSON is an easy to read format, but not a very common one for libraries that focus on optimization. While using a Tree is a good approach, it is actually not the most effective one. Should we ever, finally, get to a point where we provide a slightly more structured format than TEXT, I would vote for keeping the simplest structure that represents a list of items that any lib can easily load and pre-process as they want, without having to un-process a pre-processed list.

My two cents.

Best,

-- Simone

Simon Friedberger

unread,

Jul 26, 2024, 5:11:48 AM7/26/24

to psl-discuss

JSON parsers are available everywhere and developers can trivially import a JSON PSL. However, how are you going to determine if something is a public suffix from the JSON? You'll have to implement lookups in the PSL and take exceptions into account (or not if you're lazy, because there are only 8 of them) and now you might just be better off using a library anyway.

We might actually want to not provide JSON to encourage use of good libraries instead of everybody hacking their own thing.

The PSL is just a list as Jothan likes to say, and that is one of our problems, the interpretations vary. This is one of the reason's why it is so hard to fix some of our problems, like how to more efficiently encode Amazon's long list of domains.