Percent-encoded colons in Islandora URLs -- do they have to be that way?

326 views
Skip to first unread message

Peter Murray

unread,
Apr 10, 2013, 5:20:16 PM4/10/13
to islandora@googlegroups.com List
I've been staring at Islandora URLs for a couple of months now, and there is one aspect of them that is quite, well, ugly: the percent-encoded colon ("%3A") that stands in for the colon between the Fedora Commons namespace prefix and the simple string identifier. The percent encoding makes URLs just a tad harder to write. For instance, it is the difference between:

http://www.example.org/islandora/object/myrepo%3Amyobject

...versus...

http://www.example.org/islandora/object/myrepo:myobject

Section 3.3 of RFC 3986 governing the structure of URIs (https://tools.ietf.org/html/rfc3986#section-3.3) makes some special cases for the use of colons in the path part of URIs -- "In addition, a URI reference may be a relative-path reference, in which case the first path segment cannot contain a colon (":") character" -- but the way Islandora URIs are structured we are guaranteed not to have a colon in the first segment of a URI.

So my questions:

* Is the percent-encoded colon something that Drupal is doing on our behalf? If so, has anyone looked at forcing Drupal not to do that?

* Is the percent-encoded colon something that Islandora is doing? If so, are there design constrains that make that a requirement?

I'm pretty sure I could fix this with Apache HTTPD rewrite rules, but would like to avoid that path unless I have no other choices.


Peter
--
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
Peter....@lyrasis.org
+1 678-235-2955
800.999.8558 x2955

Adam Vessey

unread,
Apr 10, 2013, 6:15:19 PM4/10/13
to isla...@googlegroups.com
The percent-encoding is automatic, done by passing the path through the
url() function (which is used by the l() function). We pass it something
like "islandora/object/some:pid", it passes it down to
drupal_encode_path(), which passes it down to rawurlencode(), replaces
the percent-escaped forward slashes with "real" forward slashes
again... Hurray for tracing the calls? :P

We want paths to go through the url() function, because it can deal with
clean URLs and language prefixes and the like, so without patching
Drupal or reimplementing the url() function such that we undo the
encoding of the colon as well and changing all the calls, you're pretty
much stuck with filtering the HTML(/output, as these links may be
contained in Javascript or other formats as well)... I'm not sure that
mod_rewrite (if that's what's meant by an "Apache [...] rewrite rules")
would handle this case. I was of the impression that it just transformed
the request, so if you requested
"http://www.example.org/islandora/object/myrepo%3Amyobject", it could
transform it to
"http://www.example.org/islandora/object/myrepo:myobject" behind the
scenes, with the user being none the wiser... Instead, you'd have to
change the URLs anywhere they might happen in the body of the
response... Something like the sed example here:
http://httpd.apache.org/docs/2.2/mod/mod_ext_filter.html

... I'm actually not sure I completely understand the problem... You
should be able to enter URLs with the unescaped colon in the PID, and
have it find the proper object, such that both of the example URLs you
provided should work the same (provided they pointed at a valid install,
of course), so as to "write" it either way.

- Adam

Peter Murray

unread,
Apr 10, 2013, 8:55:54 PM4/10/13
to isla...@googlegroups.com
Thanks for the reply, Adam. I can indeed use a URL without the percent-encoded colon (http://www.example.org/islandora/object/myrepo:myobject) to access the object. The problem I'm trying to address is that isn't what is seen in the URL bar as users normally browse the website. For instance, if I go to the full screen view of that object, I'll get this URL:

http://www.example.org/islandora/object/myrepo%3Amyobject/datastream/OBJ/view

... and I'd like to avoid seeing the percent-encoded colon addresses entirely. (This is an aesthetic desire, not a functional one.)

If I were to go the Apache HTTPD route, I'd try to use something like mod_proxy_html (https://httpd.apache.org/docs/2.4/mod/mod_proxy_html.html) to rewrite the outgoing representations. (Or, heaven forbid, try to embed a mod_perl routine in the response chain.)

I appreciate the stack trace. I agree that it wouldn't be good to circumvent all of the good that the url() function does. Maybe there is a way to address this in Drupal.


Peter

Nick Ruest

unread,
Apr 10, 2013, 10:14:58 PM4/10/13
to isla...@googlegroups.com
Somewhat related, before I left McMaster, I was working on writing all
of the urls to something much more aesthetically pleasing as well.
Something like:

example.org/prefix:pid

I was able to resolve the above, but not rewrite without breaking
everything. I'm no regex master, so I probably have some silly errors
below. But, maybe it is a start?

RewriteBase /
RewriteRule ^(macrepo:)(.*)$ fedora/repository/$1$2
RewriteRule ^fedora/repository/(macrepo:)(.*)$ /$1$2$3 [R=301]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !=/favicon.ico
RewriteCond %{REQUEST_URI} !/^macrepo:-[0-9]+-.+/
RewriteRule (macrepo:)(.*)$ fedora/repository/$1$2
RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]

-nruest

Aaron Coburn

unread,
Apr 11, 2013, 8:37:13 AM4/11/13
to <islandora@googlegroups.com>
Just my two cents, but if all of the objects in your repository use the same namespace, why not simply exclude the namespace and colon altogether. This involves some code changes to Islandora, but the drupal module can simply be configured to add the fedora namespace behind the scenes, leaving the URLs much "cooler" for the public.

For example,

>> http://www.example.org/islandora/object/myrepo:myobject/datastream/OBJ/view

could simply be:

http://www.example.org/islandora/object/myobject/datastream/OBJ/view

In our repository, URLs looks more like:

http://www.example.org/view/myobject

for the default object dissemination, or

http://www.example.org/view/myobject/DSID

for particular datastreams

-Aaron
> --
> You received this message because you are subscribed to the Google Groups "islandora" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Peter Murray

unread,
Apr 12, 2013, 3:00:52 PM4/12/13
to isla...@googlegroups.com
In case anyone else is interested, there is a way to address the percent-encoded colons, but one has to modify core Drupal -- at least in Drupal 7. This appears to be addressed in Drupal 8:

http://api.drupal.org/api/drupal/core!vendor!symfony!routing!Symfony!Component!Routing!Generator!UrlGenerator.php/class/UrlGenerator/8

See "View Source" at these lines:

// the following chars are general delimiters in the URI specification but have only special meaning in the authority component
// so they can safely be used in the path in unencoded form
'%40' => '@',
'%3A' => ':',

For Drupal 7, however, one has to go into includes/common.inc and tweak drupal_encode_path() (line 639 of version 7.20):

function drupal_encode_path($path) {
return str_replace('%3A', ':', str_replace('%2F', '/', rawurlencode($path)));
}

I find it interesting that there is a notation in the comment block of this function: "For aesthetic reasons slashes are not escaped." Unfortunately, I cannot find any hooks that modify the encoded path after the PHP rawurlencode() function has been called. (The Drupal hook_url_outbound_alter lets you alter the path before the rawurlencode(), and there isn't anything after that I can find.)

I have not fully contemplated the implications of this change -- particularly in the security context -- but it does solve the problem. (As an aside, fortunately I've decided to track Drupal in my Git repo, so I have some confidence that I'd be able to carry this change forward through versions of Drupal.)


Peter

Mitchell MacKenzie

unread,
Apr 12, 2013, 8:34:19 PM4/12/13
to isla...@googlegroups.com
If you don't care about what javascript unaware bots/browsers see, you could try some javascript to pretty the URLs.


Mitch

Peter Murray

unread,
Apr 15, 2013, 10:50:40 AM4/15/13
to isla...@googlegroups.com
Interesting idea, Mitch. That would be a good alternative. I'm also thinking about the linked data aspects of the URLs, which is also giving me pause about hacking the Drupal core, and the JavaScript solution wouldn't address that. That the underlying (pre-JavaScript-modified) URL is different from the one displayed in the browser could be a source of confusion.


Peter

On Apr 12, 2013, at 8:34 PM, Mitchell MacKenzie <mitc...@gmail.com> wrote:
>
> If you don't care about what javascript unaware bots/browsers see, you could try some javascript to pretty the URLs.
>
> Quick example here... https://gist.github.com/mitchmac/5376222
>
> Mitch

Rosemary Le Faive

unread,
Apr 17, 2013, 11:32:15 AM4/17/13
to isla...@googlegroups.com
Hey everyone,

I'm glad I'm not the only one annoyed with the "ugly" islandora urls. I tried several of the suggestions above, but wanted to avoid modifying core. 

My concerns were to put SEO keywords in the url, get rid of the "%3A", as well as have the option to change the "islandora/object" part. After screwing around unsuccessfully with mod_rewrite, I went for aliases and pathauto. 

I *think* aliases are a good choice for making a "pretty" public-facing url while keeping the canonical [right word?] /islandora/object/[pid] url's accessible for linked data and such. You don't have to rewrite how islandora generates the string it serves to url(), because the url() function replaces it with a Drupal alias when one exists. I'm hoping that because of this, the site will never expose a canonical url to crawlers, so it won't cause duplication in search engines. 

I made a module for 7 that is *alpha as heck* that uses Tokens and Pathauto. When installed, you can configure the desired url patterns at http://[host]/admin/config/search/path/patterns, then run batchupdate to create aliases for the objects in your repository. In part of pathauto's token replacement code, it removes the colon entirely from the token [fedora:pid], so you get (for example) islandora123 - not exceptionally pretty but takes care of the %3A. 

Peter Murray

unread,
Apr 17, 2013, 6:03:58 PM4/17/13
to isla...@googlegroups.com
Ah! I'm glad to see someone shares my annoyance! SEO is actually something I hadn't considered for the URLs, and I like your solution. A quite common pattern that I've seen is to put the title of something (say a news article) in the URL. In our world that would look something like:

http://example.org/object/demo:5/data-object-coliseum-for-local-simple-image-demo

In reality, Drupal would ignore everything after the "demo:5" in the URL (so the title could change, if needed) and resolve to the correct object. I haven't thought through the SEO implications of this, but we might want to use a rel=canonical link in <head> that points to the shorter version:

<link rel="canonical" href="http://example.org/object/demo:5" />

Thanks for sharing your code. Using Tokens and Pathauto seems like a good solution to this problem that avoids hacking core.


Peter

Nick Ruest

unread,
Apr 18, 2013, 1:10:32 AM4/18/13
to isla...@googlegroups.com
I second Peter's annoyance, and want to express my appreciation for
whipping this together and sharing the code!

If you don't mind, I have a couple questions :-)

I have it installed, and I'm able to manually setup aliases for
Islandora/Fedora objects. But, I'm not able to bulk update. When I try
to do it, I get: 1) Initializing, 2) Completed two out of two, then 3) 0
object aliases were updated. No new URL aliases to generate.

I don't see anything in watchdog, or my apache log, and I can't seem to
replicate your sparql query in risearch (although my sparqlfu is
*weak*). Any suggestions on where to look to troubleshoot?

thanks!

-nruest

(**secretly hopes this work leads to an Islandora sitemap module to
really make the SEO gods happy**)

Rosemary Le Faive

unread,
Apr 18, 2013, 7:16:27 AM4/18/13
to isla...@googlegroups.com
Glad you tried it! Indeed, it was the sparql query, because - oops! - I left it tailored to my config. Sorry. 

The easy option: change "pubs:collection" to the pid of the collection of documents that you want to alias. 

The hard option is also beyond my sparql-fu; I'm not sure how to "get all objects" but exclude system objects, content models, and possibly collections.

-rosie

Nick Ruest

unread,
Apr 18, 2013, 7:49:45 AM4/18/13
to isla...@googlegroups.com
Awesome! I'll play around with the sparql query.

As for the hard option, that's where you say, "Patches welcome!" :-)

-nruest
> > Peter....@lyrasis.org <javascript:>
> > +1 678-235-2955
> > 800.999.8558 x2955
> >
> >
>
> --
Reply all
Reply to author
Forward
0 new messages