Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: Problem with encoding of filenames - SOLVED, follow-up question

0 views
Skip to first unread message

Thomas 'PointedEars' Lahn

unread,
Oct 28, 2017, 6:46:24 PM10/28/17
to
[Will you *please* stop this amok-crossposting? Usenet is _not_ your
personal private support forum/playground. If you must crosspost, then
crosspost to the *correct* newsgroup (see charters and taglines), and
*set Followup-To*. In particular, Apache is _not_ a UNIX-*only* Web server
(RTFM).

X-Post & F’up2 <news:comp.infosystems.www.authoring.misc>]


Ivan Shmakov wrote in <news:comp.infosystems.www.authoring.html>:

[Fixed quotes; see <http://www.netmeister.org/news/learn2quote.html>]

> Hendrik Maryns […] writes:
>> The strange redirects are due to some experimenting with .htaccess,
>> I’ll have to fix that, disabled it for now.
>
> (I’ve suspected something along these lines.)

Me too.

>> Ivan Shmakov also noted that I claim html4 compliance but should move
>> to html5 if I want to use “unencoded” UTF-8 in ‘href’.

That was and is “not even wrong”. Sorry to break this to you, but you have
been listening to a *wannabe*.

<https://unicode.org/faq/>
<https://www.w3.org/TR/html/links.html#element-attrdef-a-href>

>> Clicking the button in the footer seems to indeed validate, so I wonder
>> what the exact problem is. I vaguely remember that in the past I
>> decided not to move to html5, but forgot for what reason. Maybe I will
>> for this reason.
>
> Frankly, I’m unsure if HTML4 allowed whitespace in href

It does not, and that is not hard to find out either. Just RTFSpec:

<http://www.w3.org/TR/1999/REC-html401-19991224/struct/links.html#adef-href>

> (and I’m pretty sure it didn’t allow UTF-8;

Percent-encoded characters according to RFC 3986 & children: no problem.

Unescaped non-ASCII characters: *big* problem.

> hence I suspect that failing to catch that may be due to a bug in the
> validator),

Sure, blame the Validator for your incompetence. What else is new? :->

> but at least the validator at [1] correctly reports space characters as
> (HTML5?) errors:

It is more likely that an HTML5-supporting validator will catch this error
because HTML5 is not based on a DTD that can be checked against. This
encourages validator developers to check more carefully against the
Specification *prose*.

It certainly is so in the case in the case of the *W3C* Validator. Why are
you not using *it* instead (<https://validator.w3.org/>)? It has been
supporting HTML5 for several years now (although as an implicit switch to
the HTML5 validator – the “Nu Html Checker” at
<https://validator.w3.org/nu/> – when the HTML5 doctype is recognized or
selected).

> 3. Error: Bad value Antropozofi/Valentin Wember – Waar gaan we
> eigenlijk heen%3F.pdf for attribute href on element a: Illegal
> character in path segment: space is not allowed.

Correct. Neither are unescaped non-ASCII characters. Supportive UA
behavior to the contrary is *implementation-dependent*.

> [2] http://httpd.apache.org/docs/2.4/mod/core.html#errordocument
>
> However, the problem is not in the “document,” but rather in the
> Content-Type: header, which is:
>
> Content-Type: text/html; charset=iso-8859-1
>
> At the same time, Apache includes the (supposed) filename in the
> response “as is”: in UTF-8.
>
> Curiously, adding ‘AddDefaultCharset utf-8’ [3] to my .htaccess
> didn’t seem to have any effect on the 404 response header,

[3] says

| AllowOverride: FileInfo

On the other hand, if the error message files are UTF-8 encoded – and

| $ file -i /usr/share/apache2/error/HTTP_NOT_FOUND.html.var
| /usr/share/apache2/error/HTTP_NOT_FOUND.html.var: text/html; charset=utf-8
|
| $ dpkg -S /usr/share/apache2/error/HTTP_NOT_FOUND.html.var
| apache2-data: /usr/share/apache2/error/HTTP_NOT_FOUND.html.var
|
| $ dpkg -l apache2-data | awk '/^.i/ {print $3}'
| 2.4.23-4

suggests just that –, “AddDefaultCharset” is stupidly set to “On” (the
previous default) or “iso-8859-1” and it *works* with the OP, then it would
be no surprise that the error messages are garbled.

> so I’m interested in how it can be fixed, too.

AddDefaultCharset off

or (with Apache 2.4.x+)

# AddDefaultCharset on

(disabling it, therefore falling back to the default, which should be “off”)
in the httpd.conf/apache2.conf. LART that stuck-in-the-1980s server admin
if necessary. (Unicode 1.0.0 was published in 1991.)

> Cross-posting to
> news:comp.infosystems.www.servers.unix, as the question is
> specific to server software, not HTML.)

,-------------.
: ↑ Go to top :
`-------------'

> [3] http://httpd.apache.org/docs/2.4/mod/core.html#adddefaultcharset

As you can read there, “AddDefaultCharset” != “off” is a *deprecated*
approach:

,-<http://httpd.apache.org/docs/2.4/mod/core.html.en#adddefaultcharset>
|
| […]
| AddDefaultCharset should only be used when all of the text resources to
| which it applies are known to be in that character encoding and it is too
| inconvenient to label their charset individually. One such example is to
| add the charset parameter to resources containing generated content, such
| as legacy CGI scripts, that might be vulnerable to cross-site scripting
| attacks due to user-provided data being included in the output. Note,
| however, that a better solution is to just fix (or delete) those scripts,
| since setting a default charset does not protect users that have enabled
| the "auto-detect character encoding" feature on their browser.

It has been deprecated for more than 10 years:

<https://bz.apache.org/bugzilla/show_bug.cgi?id=23421>

Fun fact: Before the Apache default was changed in 2004 CE, the problem with
this default was *obvious* in the Bugzilla interface (but IIRC using a
different URI then) because the reporter of this bug (Martin Dürst) has a
name that contains a non-ASCII character which Bugzilla properly served
UTF-8-encoded, but Apache’s header field default caused HTML UAs to
interpret it as ISO-8859-1 regardless of the correct Content-Type “meta”
element (IIRC); so his name was displayed as “Martin Dürst” there for quite
some time.

> (Reading [4] wasn’t enlightening so far, either.
>
> [4] http://httpd.apache.org/docs/2.4/mod/mod_mime.html

This module has nothing to do with the problem.


PointedEars
--
Sometimes, what you learn is wrong. If those wrong ideas are close to the
root of the knowledge tree you build on a particular subject, pruning the
bad branches can sometimes cause the whole tree to collapse.
-- Mike Duffy in cljs, <news:Xns9FB6521286...@94.75.214.39>
0 new messages