implementation experience and questions

15 views
Skip to first unread message

Marc Lehmann

unread,
Oct 12, 2018, 8:13:04 PM10/12/18
to libps...@googlegroups.com
Hi!

I just used libpsl in various places (perl interface, C++), and wanted to
share my experiences with adopting libpsl.

First of all, I looked at https://github.com/rockdaboot/libpsl for
documentation, and found some links there to be outdated (e.g. mailing list
archive, public suffix download link).

When trying to use the library, I ran into multiple issues, some of which
I could only resolve by studying the source code or running tests.

* it's not clear that psl_load_file can load both dafsa files and the actual
public suffix list, the documentation doesn't mention either.

* psl_free is only documented to work for pointers returned by psl_load_file
&c, which requires users to track whether the pointer is 0, returned by
psl_builtin, or loaded in addition to the context value.
looking at the source, psl_free works fine for pointers returned by
psl_builtin, and 0 as well. it would be great if this could be officially
supported in the future.

* this is similarly true for many other functions which have safe defaults
for NULL psl_ctx_t *. since the library already does the overhead for
testing for this, maybe this could be officially supported?

* alternatively, psl_builtin currently always returns the same pointer. if true,
it would be nice if it were documented, so one can compare against it
before calling psl_free.

* dafsa files are not only much smaller, but also much much slower (in my
tests, about 4-5 times compared to a text list). this surprised me, I
expected dafsa to be faster. It would be cool if this were documented
somehow, so one can make a proper trade-off between speed and file/memory
size.

* overall, libpsl feels surprisingly slow (it parses a few million domains
per second on a fast machine with a text list, much less with dafsa) for
what it does. maybe it has to be so slow, but to my not-very-informed
self, it feels it could be much faster.

* it would be somewhat nice if there was a programmatic way other than to
call an external python script to create the dafsa blob, although I can
understand that the effort to do this would be very high (and it isn't an
issue for me, as I have to avoid it for performance reasons).

* I need the public suffix list for various applications, but most of them
need to split domains into three parts, the part before the registrable
domain, the middle part and the unregistrable domain part. Currently, this
requires two function calls, which compounds the speed problem. Looking at
the source, libpsl only supports a limited subset of the public suffix list
rules though. so I guess I could rely on that and assume the middle part
is exactly one label long. Might be useful if this were documented.
(also, if libpsl is ever updated to support the full psl list syntax, it
might be nice to have a function that can return the wildcard parts of
the domain in some way).

* It seemsthis code is redundant in _psl_is_public_suffix:

#if defined(WITH_LIBIDN) || defined(WITH_LIBIDN2) || defined(WITH_LIBICU)
if (psl == &_builtin_psl)
need_conversion = 0;
#endif

Last not least, despite the above I was able to quickly make use of libpsl
and, given the source, could work around the problems, so take the above
as a simple "implementation experience" report by some random user of the
library.

I also have two questions:

* as mentioned earlier, libpsl currently only (seems to) support a small
subset of the public suffix set rule syntax, namely rules with at most
one "*", and it must be in the beginning. Are there plans to support
the current file format in the future or will libpsl simply misbehave
at some point in the future? (Maybe I misread the source code here, my
source for the current syntax is https://publicsuffix.org/list/, and it
seems the current version of the list doesn't make use of rules that libpsl
does not implement)

* The libpsl source code uses a very large number of C reserved identifiers
(basically everything starting with "_"). Are there plans to make libpsl
more portable by porting it to ISO C?

Thanks for providing and maintaining libpsl (and thanks for not forgetting
static for internal identifiers, as so many libraries do these days :).

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sch...@schmorp.de
-=====/_/_//_/\_,_/ /_/\_\

Darshit Shah

unread,
Oct 12, 2018, 8:51:50 PM10/12/18
to libps...@googlegroups.com
Hi Marc,

Interesting set of points and thanks for the amazing write up of issues you faced. This is indeed quite useful.

Since, I'm on phone I wont comment on the other things right away, but could you please share the code you used to benchmark dafsa vs. plain text lists? Because the DAFSA should indeed be faster. I will run some tests tomorrow myself, but having concrete numbers and a set of code that causes troubles could be very useful.

Also, what machine and operating system are you running on? Since one major goal with libpsl is indeed performance.
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Marc Lehmann

unread,
Oct 12, 2018, 9:35:41 PM10/12/18
to Darshit Shah, libps...@googlegroups.com
On Sat, Oct 13, 2018 at 12:51:47AM +0000, Darshit Shah <dar...@gmail.com> wrote:
>
> Since, I'm on phone I wont comment on the other things right away, but could you please share the code you used to benchmark dafsa vs. plain text lists? Because the DAFSA should indeed be faster. I will run some tests tomorrow myself, but having concrete numbers and a set of code that causes troubles could be very useful.

I don't think it's useful to you, I simply ran a few one-liner benchmarks
with a number of domains via my perl interface, doing 1 million calls,
e.g.:

perl -MEV -Mxqp::psl -e 'xqp::psl::ctx_default "public_suffix_list.dat"; $x=EV::time;xqp::psl::registrable_domain "my.abel.dyndns.org.now.sh" for 1..1e6; warn EV::time-$x'

I varied the domain to be something from beginning or end of the public
suffix list, in case that makes a difference and either used the public
suffix list downloaded today (text) or the builtin one (presumably
dafsa). A typical result after stabilising (cpu power save &c) would be:

* 0.41s for the text based list
* 1.72s for the builtin ruleset
* 0.19s with NULL ruleset, measuring function call overhead, which is quite
high due to going via perl, a script language.

So, it's not as slow as I assumed at first (I didn't take the function
call overhead into account), but still, I feel it could be faster, and the
builtin rules are very much slower.

The reason why I think it's unnaturally slow is because I initially
considered rolling my own solution with libhyperscan (a pcre-like regex
engine), and to test the achievabke performance, I matched a mock regex
with 15000+ simple domains, basically:

(aaa|aab|...|zzz|dyndns\.org)\.([^.]+)$

against the above domain, and the benchmark takes 0.46s. If I reverse the
domain and match with an ^ anchor (the regex is not exactly< reversed, but
that doesn't change the results), i.e.:

^([^.]+)\.(aaa|aab|...|zzz|dyndns\.org)

Then perl does it in 0.17s, including perl overhead. Of course, the mock
regex is simpler than a real regex would be (especially when taking
exceptions into account - libhyperscan can match millions of regexes
concurrently though, so it should be doable - especially given the simpler
rule subset that libpsl implements), but still, I would not expect a
generic regex engine in a script language to outperform a specialised
algorithm in libpsl - and perl can even give me both registrable and
unregistrable domain offsets in one go (libhyperscan not so much :).

In the end, the headache of writing my own vs. using libpsl was too
much of a premature optimisation to seriously consider, so libpsl wins
hands down until I run into actual performance issues.

> Also, what machine and operating system are you running on? Since one major goal with libpsl is indeed performance.

Standard PC with i7-4790K @4.7GHz and Debian GNU/Linux 9/10. I used the
Debian libpsl5 package version 0.20.2-1 from testing.

Tim Rühsen

unread,
Oct 13, 2018, 4:34:26 PM10/13/18
to Marc Lehmann, libps...@googlegroups.com
Hi Marc,

great and useful write up. Thanks for having taken time for it.

Let me answer in short

* the links have been fixed by now

* the docs will be amended as you suggest

* DAFSA performance is lower due to it's compressed format, see
src/lookup_string_in_fixed_set.c for details. Loading the non-DAFSA PSL
from file stores it into a sorted array which allows binary search,
which seems to be a bit faster.
I won't do anything here since browsers and other web clients normally
do not need more then 10-20 lookups per second. If you have a special
application that continuously needs 1M lookups / s or more, please
optimize the code as needed and send us a patch. Or let us know the
details + test data...

* creating a fast tool for DAFSA encoding would be nice and has been
discussed. I simply don't have the time to work on it. Such a tool would
also be relevant for libhsts DAFSA encoding.

* "full psl list syntax" with any kind of wildcard/star has been
discussed upstream and has been dropped. There is no need for it and a
change in the PSL would break most tools out there.
But if you see a real life feature is missing, let us know (best is to
open an issue at Github.)

* "The libpsl source code uses a very large number of C reserved
identifiers". The _PSL_FLAG_* defines or is there anything more ? NP to
change that, but is there a real life issue with them ? On which system
do you have a portability issue with it ?

Let me know if I missed a point.

Again, thanks for your hints and suggestions !

Regards, Tim
signature.asc

Marc Lehmann

unread,
Oct 13, 2018, 8:23:15 PM10/13/18
to Tim Rühsen, libps...@googlegroups.com
On Sat, Oct 13, 2018 at 10:34:25PM +0200, Tim Rühsen <tim.r...@gmx.de> wrote:
> * DAFSA performance is lower due to it's compressed format, see

So it's a classical memory/speed trade-off, that's fine with me, basing
my applications on the text list when performance is needed makes things
much simpler anyway, so win-win for me.

My issue was merely that this was not mentioned anywhere, it seems, and would
be highly useful to know :)

> I won't do anything here since browsers and other web clients normally
> do not need more then 10-20 lookups per second. If you have a special
> application that continuously needs 1M lookups / s or more, please

I figured as much. I also don't need 1M lookups/s, but even if I only need
100k/s, then libpsl already takes 20% of my available time budget if it
does 1M/s, for something that probably isn't the main job of my app (which
is dns analysis and involves hahs lookups, dns decoding and a lot more).

> optimize the code as needed and send us a patch. Or let us know the
> details + test data...

As mentioned, if I ever need it, I would base my solution on libhyperscan.
The fact that the simplified syntax seems to be the official syntax will
majorly help here. This solution would probably not be applicable to
psl, though, as libhyperscan is another (big-ish, maybe less portable)
dependency, and there is a place in the world for a libpsl that isn't
hyperfast but compact and easy to use...

> * creating a fast tool for DAFSA encoding would be nice and has been
> discussed. I simply don't have the time to work on it. Such a tool would
> also be relevant for libhsts DAFSA encoding.

I think the issue at hand is not the speed, but ease of use - not having to
run an external python interpreter (and having it on your system in the first
place) from your C application would help *iff* you need DAFSA.

Now, since DAFSA is actually slower (on purpose), I think this is a
non-issue, since the only real reason to use it is when you have some kind
of distribution scheme, in which case the dependency probably isn't an
issue: every systems where you can have python as a dependency should be
fine with having the parsed text list in memory instead.

It's certainly not an issue for me at all. It would be more of an issue if
dafsa gave me speed improvements, in which case it would be tempting to
use it :)

> * "full psl list syntax" with any kind of wildcard/star has been
> discussed upstream and has been dropped.

Ah, upstream is mozilla? That is great news, because I can't imagine a
use case where multiple non-contiguous wildcards would be used, and it
of course greatly complicates both implementations and their API. I hope
mozilla updates their spec, so that people don't implement it in vain.

> But if you see a real life feature is missing, let us know (best is to
> open an issue at Github.)

As far as I analyzed the public suffix list, there are no rules with more
than one wildcard, and it is always at the beginning, so I think the
current psl should work with libpsl.

However, if this is the case, it again would be great if this were
documented, so users of psl_(un)registrable_domain would know that the
"registrable" domain part is always exactly one label longer than the
unregistrable domain part.

Thinking about it, this hasn't anything to do with wildcards at all, it's
basically how psl defines registrable domain.

Still, would be nice if it were spelled out explicitly.

> * "The libpsl source code uses a very large number of C reserved
> identifiers". The _PSL_FLAG_* defines or is there anything more

Anything starting with _ is a reserved identifier in C (e.g. _utf8_to_utf32,
_psl_idna_t, _vector_get...).

I haven't run into problems yet, but I will only ever use it on GNU/Linux
systems.

If you wait long enough and have enough platforms, it is, in my experience,
just a matter of time until you get clashes, especially with generic names
such as _vector_alloc or _utf8_to_utf32.

Given that the cost of _not_ using reserved identifiers is trivial, my
_personal_ choice would always be to not bother with them, but I don't have
to maintain libpsl (and I hope even if upstream goes away, debian will
probably maintain it a bit more for me, so it isn't an isue for me).

Anyways, thanks again, for providing and maintaining libpsl, it
probably saved me a few days of implementation and debugging, and, more
importantly, removes one more thing I have to maintain.

Marc Lehmann

unread,
Oct 13, 2018, 8:34:01 PM10/13/18
to Tim Rühsen, libps...@googlegroups.com
On Sun, Oct 14, 2018 at 02:23:13AM +0200, Marc Lehmann <sch...@schmorp.de> wrote:
> _psl_idna_t, _vector_get...).

I forgot to mention, _psl_idna_t is even "doubly" reserved, both due to
leading underscore (C), and trailing _t (stdint.h, POSIX).

Tim Rühsen

unread,
Oct 14, 2018, 5:39:34 AM10/14/18
to Marc Lehmann, libps...@googlegroups.com
On 14.10.18 02:23, Marc Lehmann wrote:
> On Sat, Oct 13, 2018 at 10:34:25PM +0200, Tim Rühsen <tim.r...@gmx.de> wrote:
>> optimize the code as needed and send us a patch. Or let us know the
>> details + test data...
>
> As mentioned, if I ever need it, I would base my solution on libhyperscan.

Not sure how to use regexes when there are multiple sections in the PSL,
e.g. if you want to lookup an ICANN domain only.

Also there are exceptions from * rules, like
*.kawasaki.jp
!city.kawasaki.jp

There are rules with "gaps", like
.name is a public suffix
his.name is *not* a public suffix
forgot.his.name is a public suffix

I am not a real regex expert and here I don't know a regex solution
apart from scanning multiple times.

But maybe an inexact lookup is sufficient for your use case.


> It's certainly not an issue for me at all. It would be more of an issue if
> dafsa gave me speed improvements, in which case it would be tempting to
> use it :)

One starting point would be to 'uncompress' the current DAFSA files into
a more C-like structure. That *should* be much faster than the binary
search.


>> * "full psl list syntax" with any kind of wildcard/star has been
>> discussed upstream and has been dropped.
>
> Ah, upstream is mozilla? That is great news, because I can't imagine a
> use case where multiple non-contiguous wildcards would be used, and it
> of course greatly complicates both implementations and their API. I hope
> mozilla updates their spec, so that people don't implement it in vain.

Upstream repo is https://github.com/publicsuffix/list.


>> But if you see a real life feature is missing, let us know (best is to
>> open an issue at Github.)
>
> As far as I analyzed the public suffix list, there are no rules with more
> than one wildcard, and it is always at the beginning, so I think the
> current psl should work with libpsl.

This is the convention I agreed upon with the maintainers.
BTW, libpsl is used in their Travis CI to check for compliance of any
changes made to the PSL.


> However, if this is the case, it again would be great if this were
> documented, so users of psl_(un)registrable_domain would know that the
> "registrable" domain part is always exactly one label longer than the
> unregistrable domain part.
>
> Thinking about it, this hasn't anything to do with wildcards at all, it's
> basically how psl defines registrable domain.
>
> Still, would be nice if it were spelled out explicitly.
>

See above, wrong assumption.


>> * "The libpsl source code uses a very large number of C reserved
>> identifiers". The _PSL_FLAG_* defines or is there anything more
>
> Anything starting with _ is a reserved identifier in C (e.g. _utf8_to_utf32,
> _psl_idna_t, _vector_get...).

Not sure where this myth comes from, it's only about #defines and _ with
uppercase letter. Apart from that, identifiers with leading _ are file
scope.

From the C99 standard:

7.1.3 Reserved identifiers

Each header declares or defines all identifiers listed in its associated
subclause, and optionally declares or defines identifiers listed in its
associated future library directions subclause and identifiers which are
always reserved either for any use or for use as file scope identifiers.

All identifiers that begin with an underscore and either an
uppercase letter or another underscore are always reserved for any use.
All identifiers that begin with an underscore are always reserved
for use as identifiers with file scope in both the ordinary and tag name
spaces.
Each macro name in any of the following subclauses (including the
future library directions) is reserved for use as specified if any of
its associated headers is included; unless explicitly stated otherwise
(see 7.1.4).
All identifiers with external linkage in any of the following
subclauses (including the future library directions) are always reserved
for use as identifiers with external linkage.154
Each identifier with file scope listed in any of the following
subclauses (including the future library directions) is reserved for use
as a macro name and as an identifier with file scope in the same name
space if any of its associated headers is included.

No other identifiers are reserved. If the program declares or defines an
identifier in a context in which it is reserved (other than as allowed
by 7.1.4), or defines a reserved identifier as a macro name, the
behavior is undefined.

If the program removes (with #undef) any macro definition of an
identifier in the first group listed above, the behavior is undefined.


Regards, Tim

signature.asc

Marc Lehmann

unread,
Oct 14, 2018, 10:11:42 AM10/14/18
to Tim Rühsen, libps...@googlegroups.com
On Sun, Oct 14, 2018 at 11:39:28AM +0200, Tim Rühsen <tim.r...@gmx.de> wrote:
> On 14.10.18 02:23, Marc Lehmann wrote:
> > As mentioned, if I ever need it, I would base my solution on libhyperscan.
>
> I am not a real regex expert and here I don't know a regex solution
> apart from scanning multiple times.

None of this should be an obstacle - as I see it, given the list of
matches, you just have to prioritize them (when loading), by length and
whether it was an exception or not.

Or to see it another way, you should be able to make a corretc decision by
knowing which of the entries match.

> > It's certainly not an issue for me at all. It would be more of an issue if
> > dafsa gave me speed improvements, in which case it would be tempting to
> > use it :)
>
> One starting point would be to 'uncompress' the current DAFSA files into
> a more C-like structure. That *should* be much faster than the binary
> search.

That indeed sounds like a lot of work :)

> > Ah, upstream is mozilla? That is great news, because I can't imagine a
>
Hmm, doesn't seem to have a spec, the official place for the spec seems to be
publicsuffix.org then.

> > As far as I analyzed the public suffix list, there are no rules with more
> > than one wildcard, and it is always at the beginning, so I think the
> > current psl should work with libpsl.
>
> This is the convention I agreed upon with the maintainers.
> BTW, libpsl is used in their Travis CI to check for compliance of any
> changes made to the PSL.

Both give me a very safe feeling then :)

> > However, if this is the case, it again would be great if this were
> > documented, so users of psl_(un)registrable_domain would know that the
> > "registrable" domain part is always exactly one label longer than the
> > unregistrable domain part.
> >
> > Thinking about it, this hasn't anything to do with wildcards at all, it's
> > basically how psl defines registrable domain.
> >
> > Still, would be nice if it were spelled out explicitly.
>
> See above, wrong assumption.

I am sorry, I can't find what you are referring to. So basically,
registrable/unregistrable domain can give me points differeing by more than
one label?

That means speed has just been halved, although I don't see that in the
libpsl code. Is this a future thing?

> > Anything starting with _ is a reserved identifier in C (e.g. _utf8_to_utf32,
> > _psl_idna_t, _vector_get...).
>
> Not sure where this myth comes from, it's only about #defines and _ with
> uppercase letter. Apart from that, identifiers with leading _ are file
> scope.

The latter covers all my examples, though, does it not? All the examples
are file scope identifiers with leading _, and therefore reserved by the
rules you even quoted. In addition, trailing _t is reserved under certain
circumstances, e.g. when stdint.h is included. You don't include that
directly, but since libpsl uses uint8_t and so on, it's likely that some
of the libraries you use include it for you.

> All identifiers that begin with an underscore are always reserved
> for use as identifiers with file scope in both the ordinary and tag name
> spaces.

Maybe you misunderstand the word "reserved" here - it doesn't mean they
are reserved for use in libpsl or other programs in file scope, it means
they are reserved for the implementation/language, but only for file scope
(so header files won't #define _vector_get, but they might have a function
called _vector_get).

_psl_idna_t is in the tag namespace, and _vector_get is in file scope,
both contexts in which identifierfs with leading underscores are reserved.

This is made explicit in the part you quoted as well:

> If the program declares or defines an identifier in a context in which
> it is reserved (other than as allowed by 7.1.4), or defines a reserved
> identifier as a macro name, the behavior is undefined.

So, it's not so much a myth but hard, cold, reality that libpsl invokes
undefined behaviour with all these identifiers.

Again, it's trivial to avoid, so why not make it correct C.

Marc Lehmann

unread,
Oct 14, 2018, 9:11:09 PM10/14/18
to libps...@googlegroups.com
Hi!

This is possibly not on-topic for this list, but after the rather fruitful
discussion I had, I was curious what kind of performance *really* could be
had with a regex, and came up with this demo:

http://data.plan9.de/mypsl

This is a quick pure-perl hack that effectively implements psl parsing and
execution, at least in spirit. It is (on my machine, with my perl) about
25% faster than calling libpsl in "text mode", and more than five times
faster than libpsl with builtin dafsa, using a pcre/perl-compatible regex.

It's not meant as a replacement algorithm for libpsl in any way,
obviously, it's just me playing around to see how libpsl performance
compares to a non-optimised (for this case) script language regex engine.

It basically builds a mammoth regex (114kb text size, 50kb compressed
with lzf - not the best or fastest compression algorithm, but it has a
very tiny code size) that first matches the exception list, then the
non-exception entries, longest first. It matches on the lowercased,
reversed domain, doesn't care about IDNs, and only reports whether a
domain is matched by an exception, or the longest match. Computing actual
domains from this is somewhat trivial in C, but somewhat slow in perl, so
I didn't bother with it.

In short, I think it computes enough information to trivially compute the
unregistrable domain length.

It is probably also completely buggy - it's really just meant for a speed
comparison :)

Playing around with it gave me some insight into the existing
performance. I think using a regex engine, while competitive to non-dafsa
and certainly much faster than dafsa mode (at higher memory cost, without
doubt) is not the best match, and dafsa might not be, either.

If I really wanted to get higher performance, I would try a tree of
hashtables by label, going backwards, which can exploit the fact that for
the psl always match whole labels, something that dafsa cannot exploit (it
is character by character).

I am just guessing here, but it's likely that this could rival the size of
the dafsa blob, and would almost certainly be faster than using a generic
regex engine which is already way faster than dafsa mode. It would also
likely be relatively easy to implement in C (the hash table being the
biggest concern).

Anyway, and again, this is not meant as criticism for libpsl, or a request
to change it in some way, but merely as a performance comparison of what
is possible (I was simply curious), and maybe it will inspire somebody or
something.

Greetings,

PS: I will likely unsubscribe soon from this list. Thanks for answering my
questions and improving the documentation!
Reply all
Reply to author
Forward
0 new messages