On Thursday 03 July 2014 21:38:56 Rockdaboot wrote:
> ---------- Weitergeleitete Nachricht ----------
> Von: Daniel Kahn Gillmor <
d...@fifthhorseman.net>
> Datum: 03.07.2014 18:36
> Betreff: Re: libpsl 0.5.0 released
> An: Tim Ruehsen <
rockd...@gmail.com>
>
> Cc:
> > (replying offlist, since your message to me was offlist; i have no
> > objection to any of this being posted publicly)
Me fighting with the mobile client ;-)
> > On 07/03/2014 12:17 PM, Tim Ruehsen wrote:
> > > I would like to use libicu runtime since it has the most complete
> > > features
> > > and seems pretty common on a standard debian install.
> > > The builtin library doesn't matter... All 3 libs generate the same data
> > > currently.
> > > Whar do you think ?
> >
> > in debian, the packages have these priorities:
> >
> > 0 dkg@alice:~$ for x in libidn11 libidn2-0 libicu52; do printf "%s " $x;
> > apt-cache show $x | grep Priority | head -n1; done
> > libidn11 Priority: standard
> > libidn2-0 Priority: extra
> > libicu52 Priority: optional
> > 0 dkg@alice:~$
> >
> > the priority ordering goes:
> > required > important > standard > optional > extra
> > see:
> >
https://www.debian.org/doc/debian-policy/ch-archive.html#s-priorities
> >
> > So libidn is most likely to be available everywhere, but if you think
> > the featureset for libicu will be superior, i have no problem going with
> > that (i confess i don't understand the specific tradeoffs).
$ apt-rdepends --reverse --follow=Depends libidn2-0 2>/dev/null | awk '/^[^
]/' | wc -l
4
$ apt-rdepends --reverse --follow=Depends libidn11 2>/dev/null | awk '/^[^ ]/'
| wc -l
3767
$ apt-rdepends --reverse --follow=Depends libicu52 2>/dev/null | awk '/^[^ ]/'
| wc -l
2558
libidn11 is tiny in size, but IDNA2003 very outdated.
libidn2-0 is also tiny in size, but IDNA2008 outdated / has incompatibilties
with IDNA2003. Not used at all by any Debian package (except the idn2 and
libidn2-0* packages).
libicu is huge, but has IDNA2008 UTS#46 (TR46). It also has iconv
functionality built in, also casing.
Using iconv() from the libc6 will use (more or less) the same data tables as
libicu does (found on amd64 in /usr/lib/x86_64-linux-gnu/gconv/).
I have to read more about C90 standard unicode conversion. Those functions
always use the current locale/encoding - not sure how this can be handled in
multi threaded code (when threads need to handle a different encodings at the
same time). Especially when the input comes from different sources with
different encodings. I guess multithreading was not in the minds in the
1980ies. But it would be cool to have psl_str_to_utf8lower() just using
standard functions.
> > I definitely think it would be wise to use the same library for runtime
> > as for builtin, though, as fun as it would be to learn about library
> > incompatibilities that way :)
Right now the PSL contains data that leads to exactly the same output for
libidn, libidn2 and libicu. That might change when the PSL becomes extended.
I configured Travis CI to check all combinations of libraries for runtime and
builtin - I should also extend test-is-public-all.c to test each PSL file
entry against the builtin data.
With my current knowledge, I would choose libicu for libpsl Debian packages.
### some measurements nobody asked for ###
Some performance measurements (instruction cycles on a 3.1GHz i3 sandy brigde)
with (psl is using the built-in psl data for lookup, but calls
psl_str_to_utf8lower() once for 'Übel.com'):
$ LD_LIBRARY_PATH=/usr/oms/src/libpsl/src/.libs valgrind --tool=callgrind
tools/.libs/psl Übel.com
runtime with libicu: 1,992,663
runtime with libidn2: 385.056
runtime with libidn: 411.543
The libpsl code itself (psl_is_public) takes <1% of the instructions,
psl_str_to_utf8lower() is not even shown by kcachegrind.
setlocale() takes 85k, the remaining cycles are due to library loading.
Another test for IDNA library performance - instead of using the built-in
data, we load the current PSL file:
LD_LIBRARY_PATH=/usr/oms/src/libpsl/src/.libs valgrind --tool=callgrind
tools/.libs/psl --load-psl-file data/effective_tld_names.dat Übel.com
libicu: 13,594,978 (just 895k instructions for 292x punycode conversions*)
idn2: 29,222,281 (17.080k instructions for 292x punycode conversions*)
idn: 38,991,549 (28.001k instructions for 292x punycode conversions*)
(*in the current PSL data there are 292 non-ASCII domain names)
Conclusion: Libicu adds ~1.900k instructions for library loading but does
punycode conversion 31x faster than libidn and still 19x faster than libidn2.
So if speed and/or functionality matters, libicu seems unbeaten.
Tim