You've entered a special hell. It is dark and scary. You are likely to be
eaten by a grue.
The world is an awful place. Hostnames, doubly so. A big part of this is
due to how MSFT originally implemented their resolver code, although
arguably it can affect non-MSFT platforms as well, depending on the name
Recall that DNS labels are full 8-bit, however, for practical purposes
(read: compatability), it's best to treat them as 7-bit ASCII. This is
somewhat touched upon in 1034 ("By convention, domain names can be stored
with arbitrary case ...") and in 1123 ("The DNS defines domain name syntax
very generally -- a string of labels each containing up to 63 8-bit
Terminology wise, let's call those "labels". A series of labels,
terminated by the empty label, make up a domain name. One type of domain
name is the host name (c.f. 1034, "For hosts, the mapping depends on the
existing syntax for host names which is a subset of the usual text
representation for domain names"), which corresponds to the A domain
record type (or AAAA, as later modified by the IPv6 specs)
OK, so we're clear so far? Recap is:
label = 0-63 octets
domain name = a series of labels, terminated by an empty label, not to
exceed 255 octets (counting label lengths as well)
host name = a subset of types of domain names, that in DNS corresponds to
the A/AAAA record
Now let's get messier yet still. 1034 introduces the "Preferred Name
Syntax", which is a recommendation for how to encode names. For example,
one part is that it suggests that all labels start with at least one
letter. This is to avoid ambiguity when parsing IPs, since if labels could
be all numeric (10.0.0.1), then it could be ambiguous as to how to parse
as a host name versus an IP address. However, 1123, Section 2.1, relaxed
this to allow the first character to be a digit, on the presumption that
all TLDs would be alpha-numeric.
This latter point wasn't enshrined anywhere, as far as I've been able to
tell, but was practiced by the set of gTLDs at the time and continues to
be practiced by ICANN (thus far).
So, now, the question is, where do the '_' come from?
1) The URI spec (RFC 2396) permitted them because it didn't couple a URL
to the underlying name resolution system (DNS), but instead permitted a
variety of name and name resolution schemes. The ABNF from this spec
diverged from 1123, and 3986 tried to bring alignment again, but the
'damage' of permissiveness was done.
2) Microsoft's host resolution API, which supported a variety of name
types (DNS, NetBios, WINS, etc), in which the incoming string was looked
up against a variety of name resolution services. Their DNS resolver
adhered to the '8 bit is good bit' and '7 bit ASCII is good', and thus let
Further, it's important to consider that _ are valid (domain) names, and
ARE valid (URL host) names, even if they're not valid (DNS host) names.
Consider, for example, SRV names.
You hate everything yet? Because I sure do.
I captured some of these thoughts in
just because no
browser I've looked at 'does the right thing' and rejects underscores.
I mention all of this to say that I actually find it 'not clear cut' as to
what's expected, and have spent several day long dives into specs and
other implementations to see if there's any common consistency, especially
. On a pragmatic level, I'd like to be a
hard liner, with being one clear interpretation, but in the real world, I
can't find anyone who consistently followed or implemented that guidance.