URL encoding slash character ('/') and Apache web servers

3,591 views
Skip to first unread message

Julien Lerouge

unread,
Sep 18, 2015, 10:27:32 AM9/18/15
to IIIF Discuss
Hi,

I'm facing a problem with IIIF Image API 1.1 (it should be the same with API 2.0 though).

I'm using :
- digilib (http://digilib.sourceforge.net) on server side
- OpenSeaDragon (https://openseadragon.github.io/) on client side
- Apache 2 and Apache Tomcat 7 web servers to host my web application

The documentation (http://iiif.io/api/image/1.1/#url-encoding-and-decoding) says that the identifier part must be url-encoded. I have images in subdirectories of the base directory in which digilib searches in, so in the identifier part the file separator ('/') is url-encoded as '%2F'.

The problem is that Apache2 and Tomcat(5-7) forbids the use of this specific url-encoded character, in order to prevent some malicious use of the slash in non-secured CGI scripts. When encountered, Apache2 answers with a HTTP 404, while Tomcat answers with a HTTP 400. The "solution" is to set variables to deactivate this behaviour :
- http://httpd.apache.org/docs/2.2/en/mod/core.html#allowencodedslashes
- https://tomcat.apache.org/tomcat-7.0-doc/security-howto.html#System_Properties

I have some control over Tomcat, but not over Apache2, which forwards the requests to Tomcat.

Have you ever faced this problem in your IIIF implementations, and how would you solve it if applicable ? Is there any known workaround ?

Thanks.

Julien

Robert Casties

unread,
Sep 18, 2015, 11:00:59 AM9/18/15
to iiif-d...@googlegroups.com
Hi Julien,

On 18.09.15 16:13, Julien Lerouge wrote:
> The documentation (http://iiif.io/api/image/1.1/#url-encoding-and-decoding) says that the identifier part must be url-encoded. I have images in subdirectories of the base directory in which digilib searches in, so in the identifier part the file separator ('/') is url-encoded as '%2F'.
>
> The problem is that Apache2 and Tomcat(5-7) forbids the use of this specific url-encoded character, in order to prevent some malicious use of the slash in non-secured CGI scripts. When encountered, Apache2 answers with a HTTP 404, while Tomcat answers with a HTTP 400. The "solution" is to set variables to deactivate this behaviour :
> - http://httpd.apache.org/docs/2.2/en/mod/core.html#allowencodedslashes
> - https://tomcat.apache.org/tomcat-7.0-doc/security-howto.html#System_Properties
>
> I have some control over Tomcat, but not over Apache2, which forwards the requests to Tomcat.

I had the same problem. I am also using digilib (I'm the main developer
BTW ;-) but the issue is relevant to any server that can have slashes in
the identifier.

I have configured Tomcat and Apache accordingly to pass the encoded
slashes but it is a PITA.

I could add a hacky option to digilib to replace the slash in the
path/filename with a different character. Does anyone have a suggestion
of a replacement or a better idea?

Cheers
Robert

P.S. I never liked the use of a URL-path to transport multiple
parameters. The slash-issue is one problem caused by that design, not
being able to add or remove parameters is another. That's why they
invented query parameters.

Robert Sanderson

unread,
Sep 18, 2015, 12:00:50 PM9/18/15
to iiif-d...@googlegroups.com

And for reference, the apache directive:

Python implementations also have this issue, even with AllowEncodedSlashes turned on, as the WSGI implementations automatically decode them :(

The best solution is, simply, not to use slashes in the identifier part of the Image API. You could expose them as some other character perhaps?

R



--
-- You received this message because you are subscribed to the IIIF-Discuss Google group. To post to this group, send email to iiif-d...@googlegroups.com. To unsubscribe from this group, send email to iiif-discuss...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/iiif-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "IIIF Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to iiif-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Rob Sanderson
Information Standards Advocate
Digital Library Systems and Services
Stanford, CA 94305

Justin Coyne

unread,
Sep 18, 2015, 1:22:46 PM9/18/15
to iiif-d...@googlegroups.com
And why is it we can't just parse the URL from the right side? Then we could have slashes in the URI.

-Justin

Jon Stroop

unread,
Sep 18, 2015, 1:36:22 PM9/18/15
to iiif-d...@googlegroups.com
Because image servers need to be able to distinguish between a malformed request (should respond w/ 400) and an unknown identifier (404). Whether you're counting slashes from the right or left it becomes almost impossible to know where a bad request ends and a good ID begins, or vice-versa.

At least that's where I wound up w/ Loris, and (as I've said elsewhere) the routing in Loris is an unholy mess of regexes because of it. And there are still edge cases where it could give the wrong response.

-Jon

Justin Coyne

unread,
Sep 18, 2015, 1:39:05 PM9/18/15
to iiif-d...@googlegroups.com
If it has more than the minimum number of slashes, then it would be a valid request and (potentially) an invalid identifier. That's up to your resolver to determine.

-Justin

Robert Sanderson

unread,
Sep 18, 2015, 1:48:53 PM9/18/15
to iiif-d...@googlegroups.com

This brings up an interesting point.  In the discussion around error conditions 
(http://iiif.io/api/image/2.0/#error-conditions), there's no requirements made, just a table that implies you MUST follow them, and the recommendation that the body be human readable.

We should be clearer that the individual error status code used is also SHOULD, not MUST.  And thus if a system can't distinguish between a badly formed request and an unknown identifier, it can simply issue either 400 or 404 as it wishes.

Would that solve the problem?

Rob

Jon Stroop

unread,
Sep 18, 2015, 2:16:31 PM9/18/15
to iiif-d...@googlegroups.com
That's sort of this[1] right? I don't immediately see anything in the corresponding commit that fixes the issue.

1. https://github.com/IIIF/iiif.io/issues/93

Jon Stroop

unread,
Sep 18, 2015, 2:26:49 PM9/18/15
to iiif-d...@googlegroups.com
prefix/identifier
prefix/identifer/info.json
prefix/identifier/region/size/rotation/quality.fmt

All need to work. If there are (4) slashes in my identifier, how can I know if it's a bad request or a bare indentifier? I'm not saying it's impossible, just that simply being able to tokenize on unescaped '/'s would be much easier.

Alex D.

unread,
Sep 18, 2015, 4:18:53 PM9/18/15
to iiif-d...@googlegroups.com
Seems to me that what is being discussed is not so much a problem with the IIIF Image API as an issue with certain web application stacks' handling of the percent-encoded slash.

Fortunately, there's nothing in the Image API spec that says that paths on a filesystem have to be mapped to %2F in an identifier. A resolver in an Image API implementation, if that implementation finds %2F to be a problem, can solve it by, as Robert C. suggested earlier, simply mandating that a different character sequence than %2F be used as a path separator.

I think that percent-encoded slashes are the most elegant and standards-compliant way of representing paths in identifiers and appreciate that the Image API, as it is written, seems to agree.

Regarding different methods of URI parsing: a lot of web frameworks' routing systems will have problems dealing with an arbitrary number of path components, and won't allow control over e.g. in what direction they parse. I appreciate the simplicity of the API in this respect.

Alex

Justin Coyne

unread,
Sep 18, 2015, 4:35:46 PM9/18/15
to iiif-d...@googlegroups.com
I think this comes down to who you want to make this easy for. The person who is building a IIIF application server, or the person who is deploying a IIIF server (possibly behind Apache or another proxy).  Currently the spec is written to favor the IIIF application authors over those who have to deploy the infrastructure.

-Justin

Alex D.

unread,
Sep 18, 2015, 4:56:07 PM9/18/15
to iiif-d...@googlegroups.com
The spec doesn't favor anybody. All it says is that if you want to use a slash in the identifier you have to percent-encode it. It doesn't say that filesystem slashes have to map to percent-encoded slashes, nor whether using slashes in identifiers is a good idea, etc. That's up to implementers to decide. Implementers can simply express the path separator differently if they want.

Alex

Justin Coyne

unread,
Sep 18, 2015, 5:00:30 PM9/18/15
to iiif-d...@googlegroups.com
I'm talking about implementer as software developers who build IIIF applications and have no control over what sort of identifiers their users would like to use.  So lets make the assumption that they want to support slashes as they are used in some places.  If the identifier of the object (not controlled by us) has slashes, it would be good user experience for the user to be able to provide slashes to the IIIF application.

-Justin

Robert Sanderson

unread,
Sep 18, 2015, 5:26:49 PM9/18/15
to iiif-d...@googlegroups.com

Hi Justin, all,

The identifier may also include ?, @, % or #, but those characters can't be used directly either. You just have to percent encode them.

The problem is not that you can't use a slash, it's that by the time the encoded slash gets to the IIIF Image server implementation, it has already been decoded, and hence the implementation cannot determine whether it's a bad request or an unknown identifier.
And that's not a specification problem, it's a web server / web framework implementation detail that affects some platforms. 

AllowEncodedSlashes is documented in the Apache implementation notes:

The misbehavior of web servers / web frameworks is not *our* problem. And in those situations where they do misbehave, there are work arounds by not using slash or by giving a generic error when the path doesn't work.

Rob

Robert Sanderson

unread,
Sep 18, 2015, 5:44:06 PM9/18/15
to iiif-d...@googlegroups.com

The commit fixed it by changing must return 400 to should return 400.  This means that the 404 that a level 0 implementation would return is not out of conformance, just not what is recommended.

Rob

Justin Coyne

unread,
Sep 18, 2015, 5:45:56 PM9/18/15
to iiif-d...@googlegroups.com
While I would agree that it is not our problem that Apache behaves badly, we could certainly make it easier for people who work with misbehaving servers by just allowing slashes.  We are adding extra constraints in the spec where none are required.  We IIIF developers can cope with handling slashes without escaping.  It seems to me that the only reason it is in the spec is to make it easier to write a IIIF server.  This ease of development comes at the cost of being harder to deploy in some environments.

-Justin

Robert Sanderson

unread,
Sep 18, 2015, 6:01:40 PM9/18/15
to iiif-d...@googlegroups.com

Agree completely that we should make systems easy to deploy, but there is a cost ... that of ambiguity in the specification.  And that's a pretty big cost, which we decided was not worth the gain when it can be worked around at the implementation level.

Here's the ambiguous situation if you allow slashes in identifier:


Is that:
  prefix: a/b
  identifier: c
  region: d    
  size: e
  rotation: f
  quality: default
  format: jpg

(e.g. 400, you messed up region (and size, and rotation))

Or is it:

  prefix: a/b/c/d/e/f
  identifier: default.jpg

(e.g. 404, I don't know about that prefix or identifier)

It's impossible to know :(

You can make a best guess what was intended but given that it's a workaround for misbehaving platforms, the decision was to document how to fix it in the most common platforms (apache, as per the note) and work around it at the implementation level where it's not possible.  In that case it just means not including encoded slashes in identifiers, but instead replacing with some other character.

We did consider query parameters, but the level 0 implementation is almost impossible. The query param model also failed to get traction in Djatoka's OpenURL pattern, and canonicalization and caching are much harder.

Rob

Justin Coyne

unread,
Sep 18, 2015, 6:08:07 PM9/18/15
to iiif-d...@googlegroups.com

The IIIF Image API URI for requesting an image must conform to the following URI Template:

{scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}


The server is aware of scheme, server, prefix, so we can factor those right out.  Then starting from the right, tokenize by . and /
the first token is format, second is quality, third rotation, forth size, fifth is region and the rest of the tokens comprise the identifier.

-Justin

Robert Sanderson

unread,
Sep 18, 2015, 6:15:44 PM9/18/15
to iiif-d...@googlegroups.com

Right, but if you read just above that...

When the base URI is dereferenced, the interaction SHOULD result in the Image Information document.

So in the case I gave, if you interpret it as a request for a base URI (prefix+identifier) rather than a image request with an unknown prefix, you should return a 404.  If you interpret it as a very broken image request, then you should return 400.

Hence my suggestion that we can be more agnostic about the exact error conditions and get out of this problem :)

Rob

Justin Coyne

unread,
Sep 18, 2015, 6:20:16 PM9/18/15
to iiif-d...@googlegroups.com
Ah! That is problematic.  I glossed over that part because it didn't seem very useful to me. I'm always hitting the Image Information Request URI, not the Base URI.  Does anyone use the Base URI?  Why?

-Justin

Robert Sanderson

unread,
Sep 18, 2015, 6:34:31 PM9/18/15
to iiif-d...@googlegroups.com

It's currently not very useful, I agree :) Implementations should always request the info.json response with identifier/info.json (and do, as far as I know).  So why is it there?

The discussion (as far as I recall) was around two aspects:

1. REST based resource management

If you wanted to use a REST paradigm for managing a collection of images, you would probably POST to prefix/ and PUT/DELETE to prefix/identifier.  So what happens when you do GET on the identifier? We made the choice to go to the info.json rather than the full image based on the ...

2. Service pattern

If you have a service in the Presentation API (or in Image API 2.1 for a logo), and some system follows its nose and dereferences the URI, it would help that system more to get back the info.json than it would to have a set of bytes that happen to be an image, as the info.json describes the capabilities of the service that provides the image.


In the case where there's lots of slashes, a dot in the final path segment, and garbage in all of the tokens, we could simply specify that servers MUST return 404.  That would be compatible with the level 0 scenario.  Or we can say it doesn't matter which you return (400 or 404) as it's an error either way.  There's no situation (as far as I can tell) where you end up at a useful response, it's just the detail with which to tell the client that it did something wrong.

Justin Coyne

unread,
Sep 18, 2015, 6:56:03 PM9/18/15
to iiif-d...@googlegroups.com
Yes, that was my conclusion. We can still tell the client, "nope" when we can't find a path or resolve an image and we can give them a resource when we have a resource that matches whatever ID they send along.

-Justin

Majewski Stefan

unread,
Sep 21, 2015, 4:50:35 AM9/21/15
to iiif-d...@googlegroups.com
This might be a slightly provocative view, but the standard does not
state what the precise purpose the prefix serves and that the prefix
MUST be for a server in all cases the same. Or have I missed a relevant
part. So indeed, as far as I can see there could be several prefixes for
the disambiguation of different identifier schemes. It might be
counter-intuitive, but a combination of prefix and identifier could be
the key to resolving the resource, even if compounding of both is used
for looking up the resource in a file-system. But then, there is
obviously still the issue that the URL has to be parsed from right to
left, which it already must as the prefix may contain slashes.

As it stands now the specs appear to be unambigous, but it is important
to factor in that a number of path segments may occur within the prefix.
So given example

http://example.org/a/b/c/d/e/f/g.jp2/1200,1200,300,300/pct:100/270/native.jpg

prefix: /a/b/c/d/e/f/
identifier: g.jp2

In Rob's example d does not follow the production rules for the region,
e does not follow the production rule for size, f does not follow the
rules for rotation and so on. So I don't see much ambiguity, only when
allowing slashes in identifiers and prefixes. When allowing slashes in
one of them, everything should be fine. Talking about a grammar for the
URL scheme, making an EBNF or similar for the URL scheme could be very
useful for implementers. What do you think?

cheers,
Stefan
--
Mag. Stefan Majewski
Projektmanager
Abteilung Forschung und Entwicklung
Österreichische Nationalbibliothek

Josefsplatz 1, 1015 Wien
Tel.: (+43 1) 534 10-434
E-Mail: stefan....@onb.ac.at
Skype: stefan.majewski.onb.ac.at

Robert Sanderson

unread,
Sep 21, 2015, 12:02:57 PM9/21/15
to iiif-d...@googlegroups.com


Hi Stefan,

On Mon, Sep 21, 2015 at 3:50 AM, Majewski Stefan <stefan....@onb.ac.at> wrote:
This might be a slightly provocative view, but the standard does not state what the precise purpose the prefix serves and that the prefix MUST be for a server in all cases the same.

I don't think that's heretical or provocative :)  The bits of the identifier would indeed be treated as part of the prefix, without the slashes being encoded.  A system built to look for those path components in the URL would be able to do the right thing.


In Rob's example d does not follow the production rules for the region, e does not follow the production rule for size, f does not follow the rules for rotation and so on. So I don't see much ambiguity, only when allowing slashes in identifiers and prefixes. When allowing slashes in one of them, everything should be fine. Talking about a grammar for the URL scheme, making an EBNF or similar for the URL scheme could be very useful for implementers. What do you think?

One thing that we've consciously resisted doing is saying that URL patterns other than the ones specified are somehow wrong.  For example, we make no claims about formats or qualities (or anything else) outside of the ones in the specification.  A formal grammar might be useful, but the positioning of it would need to be around implementation not around standardization, I think.

Rob


Julien Lerouge

unread,
Sep 25, 2015, 5:52:00 AM9/25/15
to IIIF Discuss
Hi, thanks for all your useful answers.

I also think that parsing the query from right to left is not a solution, and I agree to the fact that the problem of resolving the identifiers is left to the implementers of IIIF.
But since, there is absolutely no guideline / advice on how the identifier part should be resolved, I fear that there will be some sort of inconsistency between the several server side implementations.
This may lead to a point where IIIF servers won't be *easily* interchangeable, without rewriting the identifier part and/or modifying the system that stores the images.

Julien

Julien Lerouge

unread,
Sep 25, 2015, 5:56:06 AM9/25/15
to IIIF Discuss
Hi Robert,

I would be very thankful if you could add some option in Digilib to replace the slash.
I suggest adding a parameter in digilib-config.xml, something like :

<parameter name="alternative-file-separator-char" value="place your char here" />

Please let me know if you plan to do it.

Julien

Robert Casties

unread,
Sep 25, 2015, 8:12:07 AM9/25/15
to iiif-d...@googlegroups.com
Hi Julien,

On 25.09.15 11:56, Julien Lerouge wrote:
> I would be very thankful if you could add some option in Digilib to replace
> the slash.
> I suggest adding a parameter in digilib-config.xml, something like :
>
> <parameter name="alternative-file-separator-char" value="place your char
> here" />

I already started implementing this after the discussion on the mailing
list last Friday. Unfortunately I got stuck on a problem with the
initialisation order. I hope to get it done over the weekend.

The parameter will be called "iiif-slash-replacement" and the default
value will be the exclamation mark "!".

Do you think "!" is a good choice?

I wanted to be extra sure not to have to deal with encodings and went
for a "safe character" (as per
<https://perishablepress.com/stop-using-unsafe-characters-in-urls/>)

Best
Robert
--
Dr. Robert Casties -- Information Technology Group
Max Planck Institute for the History of Science
Boltzmannstr. 22, D-14195 Berlin
Tel: +49/30/22667-342 Fax: -299

Robert Casties

unread,
Sep 26, 2015, 8:15:36 AM9/26/15
to iiif-d...@googlegroups.com
On 25.09.15 14:12, Robert Casties wrote:
> On 25.09.15 11:56, Julien Lerouge wrote:
>> I would be very thankful if you could add some option in Digilib to replace
>> the slash.
>> I suggest adding a parameter in digilib-config.xml, something like :
>>
>> <parameter name="alternative-file-separator-char" value="place your char
>> here" />
>
> I already started implementing this after the discussion on the mailing
> list last Friday. Unfortunately I got stuck on a problem with the
> initialisation order. I hope to get it done over the weekend.
>
> The parameter will be called "iiif-slash-replacement" and the default
> value will be the exclamation mark "!".

I just checked the code for the slash-replacement in the IIIF identifier
into the digillib sourceforge repository. Please try it if you like.

Best
Robert

Jason Ronallo

unread,
Sep 27, 2015, 9:08:11 PM9/27/15
to iiif-d...@googlegroups.com
Is this slash issue relevant to this section in the specification? It would seem that percent encoding (or otherwise changing slashes to some other character) would potentially break the minimal implementation using pre-computed files.

"Both convey the request’s information in the path segments of the URI, rather than as query parameters. This makes responses easier to cache, either at the server or by standard web-caching infrastructure. It also permits a minimal implementation using pre-computed files in a matching directory structure." -- section 2

So let's say an implementer just wants to pre-compute their image files. If they have enough files they may want to break them up across directories. Maybe they want to use something like Pairtree to organize the images within (potentially deeply nested) directories. If the identifier for the image now involves directories like this, but the specification requires percent encoding the identifier, will it still work with any web server?

I'm new to the spec and the discussion around it, and I can't tell if this has been addressed here. Maybe I'm misunderstanding something important here?

Jason

Julien Lerouge

unread,
Sep 28, 2015, 4:15:39 AM9/28/15
to IIIF Discuss
Thank you very much, I'll give it a try soon !

Julien
Reply all
Reply to author
Forward
0 new messages