Hi Daniel,
Am Freitag, 11. Dezember 2015, 22:32:03 schrieb Daniel Kahn Gillmor:
> Hi libpsl folks--
>
> i had a couple recent thoughts about libpsl in debian in conversation
> with pabs on IRC:
>
> a) I think libpsl should Recommend: publicsuffix. the normal
> installation should be to see that updated file. I'll probably go
> ahead and make that change to the debian packaging shortly.
Yes, as a step one (before the next release of libpsl).
I am still waiting for Darshit's .travis.yml patch (and he is waiting for
Travis to add libunistring package to their new CI environment). This should
happen in the next days.
> b) i think the main reason to not use an external file each time is the
> speed of parsing. I also notice that upstream is working with a
> DAWG/DAFSA approach borrowed from Chromium's analysis of gperf data.
>
> I could have the publicsuffix package ship both a plaintext version
> and a DAWG/DAFSA-compiled byte object, which libpsl could then mmap
> and work from. Then we don't need a builtin list at all, we can
> just Depends: publicsuffix (stronger than Recommends:) to keep the
> byte object up-to-date.
Very reasonable for Debian packaging. This would also solve the 'packaging
libpsl triggered by new version of publicsuffix' (we talked about that in the
past).
From my experience mmap() isn't faster than read() on current Linux (when
reading a whole file into memory). Since the data isn't very big either (~32kb
currently), read() is preferrable IMO. It has better compatibility than
mmap().
We'll need another function like psl_load_dafsa() or psl_load_file2() with an
additional filetype argument. Or do you think we could auto-detect the type of
file ? (the PSL text file begins with // ..., not sure if DAFSA could build
into such a string).
Should libpsl provide an amended make_dafsa.py to convert PSL text into DAFSA
? Right now it converts to C code (array).
Or should make_dafsa.py go into the publicsuffix/list project ?
BTW, right now, psl2c filters out plain TLDs, since they are covered by the
implicit '*' rule anyways. Just a few bytes less to care for.
Tim