#36131: URLValidator not correctly validating URLs
-------------------------------+----------------------------------------
Reporter: Ludwig Kraatz | Type: Bug
Status: new | Component: Core (Other)
Version: 5.1 | Severity: Normal
Keywords: URL Validator | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 1 | UI/UX: 0
-------------------------------+----------------------------------------
== Abstract
An URL is a way of describing a Resource.
https://resource -> is a valid URL.
== Why do i raise this as issue
An URL resource-descriptor is constructed like that [RFC 3986#section-3]:
{{{
foo://
example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
}}}
so: scheme, authority, rest...
The issue in djangos URLValidation I want to address, is a over-
specification and 'selective circumvention of wrongful parsing' when it
comes to the -host- compnent of the authority part.
What djangos URLValidator currently does:
host_re = "( FQDN-REGEX | localhost )"
Basically, django parses IP-OR-FQDN-OR-LOCALHOST-URLs.
This is basically the 'selective circumvention of wrongful parsing' i
mentioned earlier. By ”| localhost" the URL field "feels" more okay,
because all the obvious URLs on localhost that exist, now pass. But there
is so much more than "localhost" besides FQDN as used for "(global) DNS
URLs".
The RFC also acknowledges this. It is recommending using a syntax for
hosts that conforms to the DNS syntax.
[
https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.2]
{{{
A host identified by a registered name is a sequence of characters
usually intended for lookup within a locally defined host or service
name registry, though the URI's scheme-specific semantics may require
that a specific registry (or fixed name table) be used instead. The
most common name registry mechanism is the Domain Name System (DNS).
A registered name intended for lookup in the DNS uses the syntax
defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
Such a name consists of a sequence of domain labels separated by ".",
each domain label starting and ending with an alphanumeric character
and possibly also containing "-" characters. The rightmost domain
label of a fully qualified domain name in DNS may be followed by a
single "." and should be if it is necessary to distinguish between
the complete domain name and some local domain.
reg-name = *( unreserved / pct-encoded / sub-delims )
If the URI scheme defines a default for host, then that default
applies when the host subcomponent is undefined or when the
registered name is empty (zero length). For example, the "file" URI
scheme is defined so that no authority, an empty host, and
"localhost" all mean the end-user's machine, whereas the "http"
scheme considers a missing authority or empty host invalid.
This specification does not mandate a particular registered name
lookup technology and therefore does not restrict the syntax of reg-
name beyond what is necessary for interoperability. Instead, it
delegates the issue of registered name syntax conformance to the
operating system of each application performing URI resolution, and
that operating system decides what it will allow for the purpose of
host identification. A URI resolution implementation might use DNS,
host tables, yellow pages, NetInfo, WINS, or any other system for
lookup of registered names. However, a globally scoped naming
system, such as DNS fully qualified domain names, is necessary for
URIs intended to have global scope. URI producers should use names
that conform to the DNS syntax, even when use of DNS is not
immediately apparent, and should limit these names to no more than
255 characters in length.
}}}
What is said in many ways:
- local host resolution is completely okay.
- no "." is required // as, a sequence (which is not further specified to
length restrictions) can consist of 1, which would lack a "." seperator
- host names that are -compatible-, are valid.
[RFC 6762 Multicast DNS # Section 3]
{{{
It is unimportant whether a name ending with ".local." occurred
because the user explicitly typed in a fully qualified domain name
ending in ".local.", or because the user entered an unqualified
domain name and the host software appended the suffix ".local."
because that suffix appears in the user's search list.
}}}
It is stated clearly, that a user can describe a resource with the
implication, that if its not a fully qualified domain name, the TLD .local
is to be assumed. As such - the URL, which is what the user would be
referencing, was to be able to deal with more non-FQDN than just
"localhost". This is in the context of Multicast DNS, which seems more
than close enough to be considered relevant, when talking about URLs - as
the URL RFC was so closely described around DNS.
[RFC 3986 URI/URL # 1.1]
{{{
URIs that
identify in relation to the end-user's local context should only be
used when the context itself is a defining aspect of the resource,
such as when an on-line help manual refers to a file on the end-
user's file system (e.g., "file:///etc/hosts").
}}}
- clearly states, that URI's are valid, even if they clearly only 'make
sense' in a end-users local context.
As such - restricting django URLs to only Fully Qualified Domain
Names/IPs, (except localhost.. for whatever reason except inconsitency :-*
) - is a restriction that contradicts that notion.
== What i am proposing:
fully allowing for URLs as per rfc3986#section-3.2.2 - with a regex
solution for localhost (and whatever else is possible) instead of a
hardcoded < "magicnumber"-80%-"solution" >
To be Commited to django repository and pull requested. My earlier pull
request is more - a starting point for discussion.
== Why this is necessary & usefull:
Single-label URLs might be used
- in intranet situations
- for URLs that represent services / schemes that do not comply to
FQDNaming conventions
- for local testing (local DNS resolution that is not based on FQDN)
- mDNS [RFC 6762] solutions, operating under .local TLD (which as of that
RFC can be ommitted in a local context)
- the django validator is named URLValidator, not
FQDN_IP_LOCALHOST_URLValidator
== Further notes:
i already submitted a pull request - which probably isn't mature enough..
given i did not even check which tests would break..
but - there was one test, that should not have broken:
FAIL: test_urlfield_clean_invalid
(forms_tests.field_tests.test_urlfield.URLFieldTest.test_urlfield_clean_invalid)
[<object object at 0x000001C1038C1760>] (value='foo')
URL <= "foo" should not be valid, even with my little changes, replacing
'localhost' with hostname_re
It feels like there are some (- -) missing - but i did not check.. i
focused on providing a more solid ticket first..
So - if i am not mistaken, there is another issue besides what i propose.
It seems, limiting hosts via FQDN was the thing, preventing missing URI-
scheme's to be rejected by the validator, not a correct validation of uri-
schemes themselves.
== PS
its kindof late - i might polish this ticket tomorrow. if you feel like
i'm drunk or disorganized - its just my brain thats screaming for relief.
sry.
--
Ticket URL: <
https://code.djangoproject.com/ticket/36131>
Django <
https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.