A URL is a string. If you don't do anything potentially dangerous
with a string, there is no security issue. Contrary to popular
opinion, there is no computer equivalent to the "killer joke", which
kills people who hear the joke, if the computer doesn't process it.
You haven't said whether or not the URL points to YOUR servers or
not. Example: the vast majority of the URLs in Google's search
engine point to somewhere besides Google. That's the whole point
of a worldwide search engine, right? It's also not at all uncommon
for ads on a website to have URLs that point to the advertiser's
website, not the website you're viewing. A PHP page might be set up
to select a random ad, output a link to it, and log the page view,
so subsequent runs prefer ads with fewer views.
Some potentially dangerous things to do with a string in PHP, especially
if it contains user-supplied data, or comes from a database that might
have unvetted user-supplied data:
(1) Use parts of the string to reference the file system.
(bypassing web server permission restrictions) Quoting doesn't
really work here - you need to reject attempts to access outside
the section of the filesystem.
(2) Use parts of the string in a SQL query without proper quoting.
(SQL injection)
(3) Use the string in content passed to eval() without proper quoting.
(Executing arbitrary code)
(4) Use the string in content passed to a shell without proper quoting.
(Executing arbitrary code, `rm -rf /` being the standard bad example)
WARNING: DO NOT USE THIS POST AS A SHELL "HERE" DOCUMENT.
(5) Use the string as an email address passed to a mail transport service.
(Spamming, injecting extra destinations into email headers.)
(6) Output the string to a web page. (Possible XSS attack.)
(7) Validate the string using only Javascript, which can be turned off,
or with HTML (such as input length limits), which can be bypassed,
say, by telnet to the HTTP port, with a URL manually typed in.
(Bots tend not to use real browsers anyway.) The string should also
be validated on the server side, although letting the user know of a
problem BEFORE pressing submit is more user-friendly.
(8) Using a non-constant string (subject to variable substitution) in
a filesystem (or, *MUCH WORSE*, URL) reference, such as include,
require, etc.
(9) Using even a constant string that refers to a website you don't
control in a reference that executes the output as code, such as
include, require, etc. DNS spoofing or playing games with ARP
could make even references to *YOUR* sites dangerous.
(10) It's generally a bad idea to alter data in response to a HTTP GET
request, such as transferring money, ordering merchandise, or
deleting records (think about what a webcrawler might do to your
database!). Use HTTP POST for that. Logging hits and page view
counts is an exception.
(11) The data fetched from a URL should be treated as user-supplied.
Lots of other stuff I forgot to mention.
For example:
http://www.google.com/news/../../../../../etc/passwd
is not dangerous because:
(1) It does not refer to YOUR server, so it's not YOUR security problem.
(2) If it *DOES* refer to YOUR server, rejecting the string doesn't
improve the situation much if a direct request to your web server
processes it anyway. However, you need to avoid introducing new
problems of this type by copying a string that is part of a URL
reference to a filesystem reference.
> Sorry. I think I have been at fault in not being clear that although I
> am getting data from a URL my application is really using the URL path
> as a unique key.
You appear to be using it as part of a *reference into your
filesystem*, which is dangerous (but it can be made safe with
appropriate checking), and that should suggest what checks are
necessary. You also may have to deal with the "unique key" issue
of not really being unique if your filesystem is case-insensitive
but the URL is case-sensitive.
If the URL is supposed to refer to YOUR site, you can put arbitrary
restrictions on what is valid. For example, you might allow / (if
it's first, and only one of them), and 0, O, o, i, I, L, l, 1, !,
and | (Note: look carefully - there are no repeat characters in
that list, which intentionally contains characters that are hard
to distinguish from each other) (and absolutely *NO* other characters).
This might not make a whole lot of sense, but it's allowed. That
ends the problem with ../ . It's no longer just a URL, it's a
reference into your filesystem. Or database, or whatever.
../ may be a problem in a filesystem reference but not in a database reference.
> What the URL contains after the site name is taken as a
> hierarchical sequence of key values. In
Sequence of key values, or filesystem reference? There's a difference.
A sequence of key values doesn't have an issue with ../ other than that
it's probably an invalid key. It has other meanings in the filesystem.
>
http://site.com/folder/subfolder/file