Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: Problem with encoding of filenames

2 views
Skip to first unread message

Ivan Shmakov

unread,
Oct 25, 2017, 11:24:42 PM10/25/17
to
>>>>> James Moe <jimoe...@sohnen-moe.com> writes:
>>>>> On 10/25/2017 06:16 AM, Hendrik Maryns wrote:

>> Since the switch to a new hoster, the files on
>> http://hendrikmaryns.name/antro.shtml are no longer downloadable,
>> due to garbling of utf8 filenames. How to solve this?

That depends on what kind of access you have to your public
directory on the server. A solution that I expect to work for a
variety of cases would be to remove all the files with mangled
filenames and reupload them under proper ones.

If you have SSH (command-line) access, then, depending on the
tools available to you, you may be able to, say, run a Perl
script to rename them on the server.

>> P.S. I just realize this is probably not the right newsgroup for
>> this. Please refer me to the proper place.

I’m cross-posting to news:comp.infosystems.www.misc just in case.

As for HTML, the page seems to claim HTML4 compliance, but using
“unencoded” UTF-8 in ‘href’ is something that is only allowed in
HTML5. Moreover, even there, spaces need to be encoded as %20,
unless I be mistaken. Cf.:

<a href="https://ru.wikipedia.org/wiki/%D0%9E%D0%BC%D0%BE%D0%BD_%D0%A0%D0%B0"
>(strict HTML4)</a>
<a href="https://ru.wikipedia.org/wiki/Омон_Ра" >(allowed in HTML5)</a>

(Although the browsers seem to be rather forgiving in this regard.)

> It does not appear to be a UTF-8 issue. This is how one of the URLs
> writes:

> http://hendrikmaryns.name/Antroposofie/Valentin%20Wember%20%E2%80%93%20Waar%20gaan%20we%20eigenlijk%20heen?.pdf

> In plain text: Antroposofie/Valentin Wember – Waar gaan we eigenlijk
> heen?.pdf

> Note the “?” at the end. I doubt that is what is supposed to be
> printed; it is a replacement character to some other value.

I’m unsure of what you mean by “replacement character” here, but
indeed, ‘?’ in a URI signifies the start of a ‘query’ portion,
so it has to be encoded as %3F. Cf.:

https://en.wikipedia.org/wiki/Main_page?
https://en.wikipedia.org/wiki/Main_page?action=history
https://en.wikipedia.org/wiki/Main_page%3F
https://en.wikipedia.org/wiki/Main_page%3Faction=history

(Then, it appears that the Wikimedia servers are slightly
misconfigured in that respect. Admittedly, this behavior may be
rather tricky to get right.)

That said, replacing ? with %3F in the URI above results in a
surprising 301 “permanent” redirect:

HTTP/1.1 301 Moved Permanently
Date: Thu, 26 Oct 2017 02:53:33 GMT
Server: Apache/2
Location: http://hendrikmaryns.name/Antroposofie/Valentin%20Wember%20%e2%80%93%20Waar%20gaan%20we%20eigenlijk%20heen.shtml?.pdf
Content-Length: 325
Keep-Alive: timeout=2, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

Yet it still doesn’t explain why some other URIs may be
inaccessible; say:

http://hendrikmaryns.name/Antroposofie/Spirituele%20opgaven%20Belgi%C3%AB%20%E2%80%93%20Johan%20Steverlinck.pdf

HTTP/1.1 404 Not Found
Date: Thu, 26 Oct 2017 02:57:57 GMT
Server: Apache/2
Content-Length: 382
Keep-Alive: timeout=2, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

--
FSF associate member #7257 np. Unforgettable — Illya Leonov

James Moe

unread,
Oct 26, 2017, 7:43:41 PM10/26/17
to
On 10/25/2017 08:24 PM, Ivan Shmakov wrote:
>
>> Note the “?” at the end. I doubt that is what is supposed to be
>> printed; it is a replacement character to some other value.
>
> I’m unsure of what you mean by “replacement character” here, but
> indeed, ‘?’ in a URI signifies the start of a ‘query’ portion,
> so it has to be encoded as %3F. Cf.:
>
The "?" replaces some other non-displayable character.

--
James Moe
jmm-list at sohnen-moe dot com
Think.

James Moe

unread,
Oct 26, 2017, 7:56:05 PM10/26/17
to
On 10/25/2017 08:24 PM, Ivan Shmakov wrote:
>
> Yet it still doesn’t explain why some other URIs may be
> inaccessible; say:
>
> http://hendrikmaryns.name/Antroposofie/Spirituele%20opgaven%20Belgi%C3%AB%20%E2%80%93%20Johan%20Steverlinck.pdf
>
A couple of possibilities:
- The web server does not understand UTF-8. It decodes, say "%E2%80%93%"
(the e umlaut) to binary characters, tests the string for valid ASCII
characters, and rejects the UTF-8 values.
- Or the underlying filesystem does not accept UTF-8 values.

Thomas 'PointedEars' Lahn

unread,
Oct 26, 2017, 8:11:03 PM10/26/17
to
Both are *very* unlikely.


PointedEars
--
realism: HTML 4.01 Strict
evangelism: XHTML 1.0 Strict
madness: XHTML 1.1 as application/xhtml+xml
-- Bjoern Hoehrmann

Thomas 'PointedEars' Lahn

unread,
Oct 26, 2017, 8:14:45 PM10/26/17
to
James Moe wrote:

> On 10/25/2017 08:24 PM, Ivan Shmakov wrote:
>>> Note the “?” at the end. I doubt that is what is supposed to be
>>> printed; it is a replacement character to some other value.
>>
>> I’m unsure of what you mean by “replacement character” here, but
>> indeed, ‘?’ in a URI signifies the start of a ‘query’ portion,
>> so it has to be encoded as %3F. Cf.:
>>
> The "?" replaces some other non-displayable character.

No, it does not. The character you mean is “�”, which looks similar,
but has a different Unicode codepoint (U+FFFD, not U+003F).

Please do not crosspost without Followup-To. F’up2 ciw.misc set.


PointedEars
--
When all you know is jQuery, every problem looks $(olvable)

James Moe

unread,
Oct 27, 2017, 3:26:37 PM10/27/17
to
On 10/26/2017 05:14 PM, Thomas 'PointedEars' Lahn wrote:
>
>> The "?" replaces some other non-displayable character.
> No, it does not. The character you mean is “�”, which looks similar,
> but has a different Unicode codepoint (U+FFFD, not U+003F).
>
It does in ASCII text displays.

> Please do not crosspost without Followup-To. F’up2 ciw.misc set.
>
I am not crossposting. AKAIK this is the original topic. I am not
subscribed to ciw.misc.
0 new messages