Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#1019980: lintian: source-is-missing check for HTML is much too sensitive

117 views
Skip to first unread message

Colin Watson

unread,
Sep 17, 2022, 7:20:03 PM9/17/22
to
Package: lintian
Version: 2.115.3
Severity: normal

Lintian issues these errors for putty 0.77-1:

E: putty source: source-is-missing [doc/html/AppendixA.html]
E: putty source: source-is-missing [doc/html/AppendixB.html]
E: putty source: source-is-missing [doc/html/AppendixE.html]
E: putty source: source-is-missing [doc/html/Chapter10.html]
E: putty source: source-is-missing [doc/html/Chapter2.html]
E: putty source: source-is-missing [doc/html/Chapter3.html]
E: putty source: source-is-missing [doc/html/Chapter4.html]
E: putty source: source-is-missing [doc/html/Chapter5.html]
E: putty source: source-is-missing [doc/html/Chapter7.html]
E: putty source: source-is-missing [doc/html/Chapter8.html]
E: putty source: source-is-missing [doc/html/Chapter9.html]
E: putty source: source-is-missing [doc/html/IndexPage.html]

This is pretty oversensitive. Firstly, it's HTML, which is still often
enough written by hand anyway. As it happens, these particular HTML
files are generated from halibut input that's also provided in the
source package, though I can't see how Lintian could possibly expect to
know that.

I tried to work out whether I should be overriding this or whether it's
a bug in Lintian, and I think it's the latter. The current relevant
code is this in lib/Lintian/Check/Files/SourceMissing.pm:

sub visit_patched_files {
my ($self, $item) = @_;

return
unless $item->is_file;
[...]
return
if !defined $longest || $line_length{$longest} <= $VERY_LONG_LINE_LENGTH;
[...]

if ($item->basename =~ /\.(?:x?html?\d?|xht)$/i) {

# html file
$self->pointed_hint('source-is-missing', $item->pointer)
unless $self->find_source($item, {'.fragment.js' => $DOLLAR});
}

return;
}

So it issues a diagnostic for every HTML file with a somewhat long line
(over 512 characters) unless it has an associated .fragment.js somewhere
(I think - the find_source sub is undocumented and a bit obscure to me)?
That doesn't sound right - surely that would catch far too many false
positives.

Next, I went looking through git history to try to figure out where this
was introduced. I found this commit:

https://salsa.debian.org/lintian/lintian/-/commit/4f24ab7fca

The commit message makes it sound as though it was probably just
refactoring, but it wasn't. The corresponding bit of code there was
previously in a warn_prebuilt_javascript sub called from a
warn_long_lines sub, which in turn was called in two places: once for
certain kinds of .js files, and once from this sub:

# check javascript in html file
sub check_html_cruft {
my ($self, $item, $lowercase) = @_;

my $blockscript = $lowercase;
my $indexscript;

while (($indexscript = index($blockscript, '<script')) > $ITEM_NOT_FOUND) {

$blockscript = substr($blockscript,$indexscript);

# sourced script ok
if ($blockscript =~ m{\A<script\s+[^>]*?src="[^"]+?"[^>]*?>}sm) {

$blockscript = substr($blockscript,$+[0]);
next;
}

# extract script
if ($blockscript =~ m{<script[^>]*?>(.*?)</script>}sm) {

$blockscript = substr($blockscript,$+[0]);

my $lcscript = $1;
$self->check_js_script($item, $lcscript);

return 0
if $self->warn_long_lines($item, $lcscript);

next;
}

# here we know that we have partial script. Do the check nevertheless
# first check if we have the full <script> tag and do the check
# if we get <script src=" "
# then skip
if ($blockscript =~ /\A<script[^>]*?>/sm) {

$blockscript = substr($blockscript,$+[0]);
$self->check_js_script($item, $blockscript);
}

return 0;
}

return 1;
}

This made much more sense! I could get on board with issuing a
diagnostic for <script> tags in HTML files that look like unminified
JavaScript, and that appears to be what this check was originally meant
to do. Unfortunately, it looks like that extra logic was dropped in
this "Further rationalize cruft check; separate concerns" commit, and
now we have a very much broader check on HTML files with no indication
that this change was intentional. Something like this <script> check is
still present in Lintian, but in a different context, and it's no longer
used for the source-is-missing check.

I suggest restoring something like this code to check for <script> tags
around the source-is-missing check for HTML files. I suspect that this
might also deal with reports such as #1017094 and #1017966, though I've
filed this separately as I'm not sure of that.

Thanks,

--
Colin Watson (he/him) [cjwa...@debian.org]

Paul Wise

unread,
Sep 17, 2022, 8:20:04 PM9/17/22
to
On Sun, 18 Sep 2022 00:14:07 +0100 Colin Watson wrote:

> This is pretty oversensitive.  Firstly, it's HTML, which is still often
> enough written by hand anyway.  As it happens, these particular HTML
> files are generated from halibut input that's also provided in the
> source package, though I can't see how Lintian could possibly expect to
> know that.

I am not a lintian maintainer, but:

HTML is very often generated and there are many different ways to
generate it. I think the right thing for lintian to do here is to know
about more of the source formats and when there is generated HTML in
the tarball but source is also present, then emit a new lower severity
generated-files tag instead of the existing source-is-missing tag.

I think the right thing for putty here is for upstream to remove the
HTML from their VCS and tarballs, then add the generation process to
their build system and continuous integration, so that they always know
when there are problems with generating the HTML. If they refuse then
you could exclude the HTML from Debian's copy of the upstream tarball.

Until either lintian changes or the putty HTML gets removed, overriding
the lintian warning in putty seems the correct thing to do.

PS: I note that manual pages are similar to HTML in this regard and I
think the same reasoning above applies to the putty manual pages and to
lintian's treatment of manual pages in source packages.

> I suggest restoring something like this code to check for <script>
> tags around the source-is-missing check for HTML files.

If that is done, I think lintian should add more heuristics to detect
other generated HTML. The halibut generated HTML doesn't make that easy
but there are some signals that can be added I think, like this:

halibut-1.3/bk_html.c: html_raw(&ho, "<!-- version IDs:\n");

--
bye,
pabs

https://wiki.debian.org/PaulWise
signature.asc

Colin Watson

unread,
Sep 18, 2022, 8:10:03 AM9/18/22
to
On Sun, Sep 18, 2022 at 08:11:03AM +0800, Paul Wise wrote:
> I think the right thing for putty here is for upstream to remove the
> HTML from their VCS and tarballs, then add the generation process to
> their build system and continuous integration, so that they always know
> when there are problems with generating the HTML.

The HTML files have never been in PuTTY upstream's VCS. They are
generated automatically as part of PuTTY's build system for release
tarballs, as a convenience to people who want to build PuTTY without
Halibut, since it's a somewhat niche documentation tool. Since I agree
with upstream that this is a reasonable convenience, I'm not going to
ask them to stop doing it.

> If they refuse then you could exclude the HTML from Debian's copy of
> the upstream tarball.

We're not talking about opaque object code here. This is perfectly
readable plain HTML that just happens to be generated from another
perfectly readable text format. It's not the preferred form of
modification, sure (I wouldn't edit it directly since I have the Halibut
input files available, but if nobody told me that those existed then I'd
happily edit the HTML without even noticing), but this package isn't
covered by the GPL so that's not very relevant.

I'm not going to waste a second on editing Debian's copy of the upstream
tarball for this complete non-issue. I already take care to ensure that
the package rebuilds the documentation from source, and there's no DFSG
issue with the pre-generated files being present so there's no reason to
remove them from the tarball. The only reason that the presence of
pre-generated files is even coming up is because Lintian's heuristics
are misfiring in a way that seems clearly incorrect and probably
unintentional.

> Until either lintian changes or the putty HTML gets removed, overriding
> the lintian warning in putty seems the correct thing to do.

Done.

> If that is done, I think lintian should add more heuristics to detect
> other generated HTML. The halibut generated HTML doesn't make that easy
> but there are some signals that can be added I think, like this:
>
> halibut-1.3/bk_html.c: html_raw(&ho, "<!-- version IDs:\n");

Firstly, that's conditional and not present in the generated PuTTY
documentation. (I've sent a patch upstream to add a suitable <meta
name="generator"> tag, since that seems like a reasonable thing to
include in any event.)

Secondly, with respect, this is a distraction from the point of this
bug. Feel free to file a separate bug for more detailed heuristics, but
Lintian should start by making its current heuristics not entirely wrong
(the presence of a .fragment.js file obviously has nothing to do with
whether general HTML files are generated, only ones that have certain
kinds of <script> tags). That's what I'm requesting here.

Bill Allombert

unread,
Oct 14, 2023, 2:30:04 PM10/14/23
to
On Sun, Sep 18, 2022 at 12:14:07AM +0100, Colin Watson wrote:
> Package: lintian
> Version: 2.115.3
> Severity: normal
>
> Lintian issues these errors for putty 0.77-1:
>
> E: putty source: source-is-missing [doc/html/AppendixA.html]
> E: putty source: source-is-missing [doc/html/AppendixB.html]
> E: putty source: source-is-missing [doc/html/AppendixE.html]
> E: putty source: source-is-missing [doc/html/Chapter10.html]
> E: putty source: source-is-missing [doc/html/Chapter2.html]
> E: putty source: source-is-missing [doc/html/Chapter3.html]
> E: putty source: source-is-missing [doc/html/Chapter4.html]
> E: putty source: source-is-missing [doc/html/Chapter5.html]
> E: putty source: source-is-missing [doc/html/Chapter7.html]
> E: putty source: source-is-missing [doc/html/Chapter8.html]
> E: putty source: source-is-missing [doc/html/Chapter9.html]
> E: putty source: source-is-missing [doc/html/IndexPage.html]
>
> This is pretty oversensitive. Firstly, it's HTML, which is still often
> enough written by hand anyway. As it happens, these particular HTML
> files are generated from halibut input that's also provided in the
> source package, though I can't see how Lintian could possibly expect to
> know that.

Dear Lintian maintainers,

This test is causing hundreds of false positive and should be disabled as
soon as possible. This is a huge waste of time for everybody.

If you need help with that, please tell me, I have worked on lintian in the past.

Cheers,
--
Bill. <ball...@debian.org>

Imagine a large red swirl here.
signature.asc

Santiago Ruano Rincón

unread,
Feb 8, 2024, 1:40:05 PM2/8/24
to
Dear Lintian maintainers,

I cannot offer the same help as ballombe, but I also find it would help
to disable these errors. At least, could they be "demoted" to warnings?

Thanks in advance,

Santiago
signature.asc

Bastien Roucariès

unread,
Feb 8, 2024, 1:50:05 PM2/8/24
to
Le jeudi 8 février 2024, 18:31:28 UTC Santiago Ruano Rincón a écrit :
> On Sat, 14 Oct 2023 20:23:18 +0200 Bill Allombert <ball...@debian.org> wrote:
> > On Sun, Sep 18, 2022 at 12:14:07AM +0100, Colin Watson wrote:
> > > Package: lintian
> > > Version: 2.115.3
> > > Severity: normal
> > >
> > > Lintian issues these errors for putty 0.77-1:
> > >
> > > E: putty source: source-is-missing [doc/html/AppendixA.html]
> > > E: putty source: source-is-missing [doc/html/AppendixB.html]
> > > E: putty source: source-is-missing [doc/html/AppendixE.html]
> > > E: putty source: source-is-missing [doc/html/Chapter10.html]
> > > E: putty source: source-is-missing [doc/html/Chapter2.html]
> > > E: putty source: source-is-missing [doc/html/Chapter3.html]
> > > E: putty source: source-is-missing [doc/html/Chapter4.html]
> > > E: putty source: source-is-missing [doc/html/Chapter5.html]
> > > E: putty source: source-is-missing [doc/html/Chapter7.html]
> > > E: putty source: source-is-missing [doc/html/Chapter8.html]
> > > E: putty source: source-is-missing [doc/html/Chapter9.html]
> > > E: putty source: source-is-missing [doc/html/IndexPage.html]
> > >
> > > This is pretty oversensitive. Firstly, it's HTML, which is still often
> > > enough written by hand anyway. As it happens, these particular HTML
> > > files are generated from halibut input that's also provided in the
> > > source package, though I can't see how Lintian could possibly expect to
> > > know that.

Are you sure it is not embdeded base64 encoded png or minified javascript* ?

If not we could try to know why it choke ?

In this particular case, it is the source package that choke. If halibut include the name of the source
in the html we could magically remove the source is missing warnings.

Another alternative if we could determine the file was compiled by halibut, we could demote to pedantic warning
and ask to repack in order to be sure to recompile from source.

Thanks
signature.asc

Bill Allombert

unread,
Feb 8, 2024, 3:10:06 PM2/8/24
to
There are far too many different HTML generators out there to handle.
You would need to define a standard way to indicate the path to the source in
the generated file.
But some generator authors might consider this is an inacceptable data leak, so
this would only be done if some environment variable is defined.

In the short term, I suggest to disable it since there is no policy requirement
for the source code to be in a particular path, so it is not an error.

At the very least, it should not be generated more than once per package.

Bastien Roucariès

unread,
Feb 8, 2024, 3:40:06 PM2/8/24
to
We have done this for doxyen and sphinx, so maybe not for more
> You would need to define a standard way to indicate the path to the source in
> the generated file.
> But some generator authors might consider this is an inacceptable data leak, so
> this would only be done if some environment variable is defined.
for doxygen or sphinx we only detect some string in html file and whitelist....

Generared by something will work

Moreover adding missing-source override like could be done be done by adding manualy a symlink debian/missing-sources/ fullname pointing to the righ location.

We also magically search know source by using some heurtistic in SourceMissing.pm

So the basic framework is here, we only need to add more rules

Bastien
signature.asc

Bill Allombert

unread,
Feb 8, 2024, 4:40:06 PM2/8/24
to
On Thu, Feb 08, 2024 at 08:27:40PM +0000, Bastien Roucaričs wrote:
> > > > > > source package, though I can't see how Lintian could possibly expect to
> > > > > > know that.
> > >
> > > Are you sure it is not embdeded base64 encoded png or minified javascript* ?
> > >
> > > If not we could try to know why it choke ?
> > >
> > > In this particular case, it is the source package that choke. If halibut include the name of the source
> > > in the html we could magically remove the source is missing warnings.
> > >
> > > Another alternative if we could determine the file was compiled by halibut, we could demote to pedantic warning
> > > and ask to repack in order to be sure to recompile from source.
> >
> > There are far too many different HTML generators out there to handle.
>
> We have done this for doxyen and sphinx, so maybe not for more

This is two out of how many ?

For example, my packages use TtH, GAPDoc, hevea, pod2html.

I do not think it is sustainable.

Colin Watson

unread,
Feb 9, 2024, 4:30:06 AM2/9/24
to
On Thu, Feb 08, 2024 at 06:39:18PM +0000, Bastien Roucariès wrote:
> Are you sure it is not embdeded base64 encoded png or minified javascript* ?

Yes, I'm absolutely certain.

> If not we could try to know why it choke ?

I already gave a full explanation of this in my first message, which for
some reason people are ignoring:

"""
So it issues a diagnostic for every HTML file with a somewhat long line
(over 512 characters) unless it has an associated .fragment.js somewhere
"""

The HTML files it's issuing a diagnostic on here are perfectly innocuous
and readable. Here's an example of one of the "offending" lines:

In version 0.51 and before, local echo could not be separated from local line editing (where you type a line of text locally, and it is not sent to the server until you press Return, so you have the chance to edit it and correct mistakes <em>before</em> the server sees it). New in version 0.52, local echo and local line editing are separate options, and by default PuTTY will try to determine automatically whether to enable them or not, based on which protocol you have selected and also based on hints from the server. If you have a problem with PuTTY's default choice, you can force each option to be enabled or disabled as you choose. The controls are in the Terminal panel, in the section marked &#8216;Line discipline options&#8217;.

I mean, come on. Sure, there are a couple of character entities (which
have nothing to do with the diagnostic here anyway), but otherwise you
can't tell me with a straight face that that's some kind of obscure
compiled format; I would have written it exactly the same way by hand
except for the word-wrapping.

> Another alternative if we could determine the file was compiled by halibut, we could demote to pedantic warning
> and ask to repack in order to be sure to recompile from source.

Or we could fix the ridiculously-oversensitive diagnostic.

On the matter of repacking (which I will not do in this case), please
see my comment in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1019980#15.
0 new messages