clojure.java.io/Coercions doesn't do URL-escaping

155 views
Skip to first unread message

Colin Jones

unread,
Nov 27, 2011, 11:37:09 PM11/27/11
to cloju...@googlegroups.com
I noticed a problem today when trying to slurp a resource file (using clojure.java.io/resource to get ahold of the file): https://gist.github.com/1398972

The gist is that when the path to a resource contains (for instance) spaces, attempts to read that resource with clojure.java.io's make-reader function will fail, because we look up the URL-escaped resource URL directly as a File / FileInputStream, without decoding it. Same problem with writing to that resource using make-writer. Granted, since resources are often in jarfiles, writing is less important, but hey, it could happen? From looking at the code, these same issues must exist with URLs of other kinds, though I also haven't looked deeply into other cases for making a writer out of a URL.

The core problem here seems to me to be that the clojure.java.io/Coercions implementations for URL -> File and File -> URL don't take URL escaping into account, and I think they should.

Going in the other direction (File -> URL), I don't have a concrete error case, but it seems reasonable that when creating a URL from a File, URL-escaping should be used. Java 1.6 agrees and deprecates File#toURL in favor of using File#toURI and calling .toURL on the URI that comes back, in order to do the proper escaping. A similar recommendation has been there since 1.4, but the deprecation is new to 1.6.

The small changes here fix the behavior in both directions (File->URL and URL->File), for reading & writing: https://gist.github.com/1398988

I know I should be pointing out tradeoffs, so the one thing I can imagine here is that there's a performance penalty. I'm not sure exactly how big the hit is for the extra #toURI call or to call URLDecoder/decode, but by the time you hit this code, you're doing I/O anyway. So maybe the hit is less important.  The other thing is that you'd no longer be guaranteed that to have a string-equivalent thing on both sides of the coercion, or that this might complect as-file / as-url, but I'd argue that URL-escaping is pretty core to what a URL is, and that any conversion process between URLs and anything else needs to know about it.

I've created a JIRA ticket here: http://dev.clojure.org/jira/browse/CLJ-885

Thoughts? Apologies in advance if I'm missing something big here.
 
- Colin


Micah Martin

unread,
Nov 28, 2011, 11:53:54 AM11/28/11
to cloju...@googlegroups.com
I have run into this problem as well.  It would be good to see the patch included.

Micah

--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To view this discussion on the web visit https://groups.google.com/d/msg/clojure-dev/-/aOTRj5W8lbYJ.
To post to this group, send email to cloju...@googlegroups.com.
To unsubscribe from this group, send email to clojure-dev...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/clojure-dev?hl=en.

David Powell

unread,
Nov 28, 2011, 12:14:21 PM11/28/11
to cloju...@googlegroups.com
Yes, this is a bug and should be fixed.

The fix to File -> URL coercion is fine, but the fix to URL -> File
coercion isn't.

URLDecoder is actually a query-string decoder, not a URL path decoder,
so the patch converts +'s in URLs to spaces, which is incorrect. (I
generally avoid using the term url-encoded to avoid this confusion).

--
Dave

Chris Perkins

unread,
Nov 28, 2011, 12:25:39 PM11/28/11
to cloju...@googlegroups.com

David Powell

unread,
Nov 28, 2011, 12:56:12 PM11/28/11
to cloju...@googlegroups.com

Even the suggested Plexus Utilities gets it wrong too, by treating the
percent encoding as iso-8859-1 characters rather than utf-8 octets.
Commons IO gets that right.

--
Dave

Colin Jones

unread,
Nov 28, 2011, 1:35:54 PM11/28/11
to cloju...@googlegroups.com
Thanks a bunch for the feedback. So I see a couple of options to fix
the URL -> File problems that David and Chris point out:

0. <strikethrough>depend on Commons IO</strikethrough>
1. Re-implement the Commons IO toFile conversion, which is ~15 lines
and would get smaller in Clojure
(http://grepcode.com/file/repo1.maven.org/maven2/commons-io/commons-io/1.2/org/apache/commons/io/FileUtils.java#FileUtils.toFile%28java.net.URL%29)
2. Use (File. (URI. (.toExternalForm url))), like the last example
given at http://maven.apache.org/plugin-developers/common-bugs.html#Converting_between_URLs_and_Filesystem_Paths,
and just don't allow converting malformed URLs from pre-1.4 or Maven
2.x classloaders.

#1 is straightforward and seems the most helpful, and it seems like a
nice-to-have for clojure.java.io anyway. I'm assuming we don't need to
worry about licensing here.
#2 seems less good, though I don't have a strong opinion about whether
the ability to handle malformed URLs here is a requirement.

What do you think? Are there other options? If there's a consensus
around one of these (or another), I'm happy to do an updated patch.

- Colin

> --
> You received this message because you are subscribed to the Google Groups "Clojure Dev" group.

> To post to this group, send email to cloju...@googlegroups.com.
> To unsubscribe from this group, send email to clojure-dev...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/clojure-dev?hl=en.
>
>

--
Colin Jones

David Powell

unread,
Nov 28, 2011, 2:40:37 PM11/28/11
to cloju...@googlegroups.com
> 2. Use (File. (URI. (.toExternalForm url))), like the last example
> given at http://maven.apache.org/plugin-developers/common-bugs.html#Converting_between_URLs_and_Filesystem_Paths,
> and just don't allow converting malformed URLs from pre-1.4 or Maven
> 2.x classloaders.

Clojure requires Java 1.5 anyway, so it would only be an issue if
people call getResource on a Maven classloader for affected issues of
Maven. That doesn't sound too bad.

--
Dave

Colin Jones

unread,
Nov 29, 2011, 12:52:07 AM11/29/11
to cloju...@googlegroups.com
Looks like option #2 causes test failures, namely... malformed URLs :)
So that option would expand to add leading slashes on two URL-ish
things:

- (File. "baz") (URL. "file:baz")
- (File. "quux") (URI. "file:quux")
+ (File. "baz") (URL. "file:/baz")
+ (File. "quux") (URI. "file:/quux")

But after seeing the scary error, and not knowing all the intricacies
of RFC 2396, I worry that there might be places where these malformed
URLs might happen in the wild. Is this something we should worry
about? That Maven article names 2 groups of classloaders, and as you
mention, one is kind of moot, but could there be more? Or other ways,
like (URI. "file:foo.txt"), for malformed URI objects to get generated
that we should nevertheless support? FWIW, googling the exception
message, "uri is not hierarchical", gets quite a few hits, which feels
like evidence that it happens, but may all be due to those specific
classloaders...

I'm not convinced one way or another that the bad URLs should be
supported, but this would be a breaking change for them, as they
currently kind of work. So at this point I'm leaning towards the
CommonsIO string-translation approach.

Thoughts?

Colin Jones

unread,
Nov 29, 2011, 11:03:54 PM11/29/11
to cloju...@googlegroups.com
OK, one last call for advice before I put a revised patch up, using a
similar algorithm to CommonsIO: any opposition to requiring
clojure.string from clojure.java.io? I don't take adding dependencies
lightly, but I do like the simplicity of the algorithm when
clojure.string/replace is available, and clojure.string feels like a
reasonable thing to depend on for a string-replacement algorithm.

See the difference here:
With clojure.string: https://gist.github.com/e240ea21d93638df6c26
Without: https://gist.github.com/42584777aceaab6625e4

Granted, the second example tends a bit towards straw man territory;
there are surely some simplifications available there. But the core
algorithm isn't going to get much clearer without doing something
drastic to abstract away the loop-recur holdover from the more direct
translation from Java. So I haven't yet taken the time to really clean
that up seriously, since I so much prefer the clojure.string/replace
solution. Performance-wise, it's probably a bit of a hit to do regexp
matching, but I'm not sure whether it's more than instantiating a
bunch of new strings, as CommonsIO does.

Thoughts? I'll post an updated patch to JIRA tomorrow for the
clojure.string version if nobody sees problems with the basic idea
right away.

--
Colin Jones

Colin Jones

unread,
Nov 30, 2011, 1:38:28 PM11/30/11
to cloju...@googlegroups.com
I've updated the patch (http://dev.clojure.org/jira/browse/CLJ-885) as
described already.

I'm not sure what I need to do process-wise to get vetting/screening,
but I'm assuming I haven't done it yet. I also don't have permissions
to edit any fields on the ticket (not sure whether that's required).

If someone knows what the next step should be, could they point me in
the right direction?

Thanks!
- Colin

--
Colin Jones

Stuart Sierra

unread,
Nov 30, 2011, 3:05:05 PM11/30/11
to Clojure Dev
Patch vetted.
-S
Reply all
Reply to author
Forward
0 new messages