First, a little history: the typical provision of random numbers in
programming language standard libraries is typified by the rand() and
srand() functions present in C. rand() returns an unsigned integer from
0 to RAND_MAX and srand sets the global seed to a provided unsigned
integer. Traditionally these were implemented by a simple generator with
a single integer of internal state; an attacker who obtains few (maybe
as few as one!) outputs from rand() can generally figure out the entire
state and, thus, predict the entire sequence. Users who don't want the
same sequence on every run of the program are encouraged to seed the RNG
from the system clock, as the initial state of the seed isn't
particularly specified, and users who do want a repeatable
sequence are able to specify a known seed - but as the algorithm used
isn't specified, the scope of that repeatability isn't very clearly
defined; repeated runs of the same compiled binary can probably be
assumed to produce the same results (although dynamic linking to libc
might even put that in jeopardy) but the sequences produced by different
C compilers or on different operating systems or architectures may vary.
Uses for random numbers are varied, and can include:
- Things where an attacker guessing future outputs from past outputs, or
bein able to influence future outputs in some way, would be a disaster -
eg, picking short-term session keys, choosing hard-to-guess unique IDs,
or making long-term cryptographic keys.
- Statistical simulations, where you'll usually want random numbers with
good statistical properties (known distributions), and may or may not
need the ability to repeat a "run" with the same sequence
- Testing applications such as property-based testing or fuzzing where
you want to apply a wide variety of inputs to a component under test,
and check that its output/behaviour are valid in all cases.
Repeatability may be desirable when a test run finds a problem, so that
a proposed fix for that problem can then be tested with exactly the same
sequence of inputs to confirm whether it worked or not, so it's not
uncommon to either always run with the same seed or to pick a random
seed and log it so that a run can be "re-run" by providing the logged seed.
- Computer games, where unpredictability is needed to model the game
world. Reproducibility might be desired if the random source is used
to drive a procedurally generated world that needs to look the same
every time the player returns to it, but also, unpredictability may be
desired when modelling the reactions of non-player characters and other
such events that are meant to surprise the player; multiplayer online
games have sometimes suffered from players being able to manipulate RNGs
to their benefit (eg, Minecraft - not just online gambling!). So
computer games may need either secure or repeatable random numbers,
depending on context!
- Toy programs, that demonstrate a principle but aren't necessarily
meant for "real-world" use, such as filling an array with random numbers
to demonstrate a sorting algorithm, making simple "guess the number!
Higher or lower!" games as a programming exercise, etc. I don't want to
underestimate this category - educational examples are important - but
it is a niche.
rand()/srand() provides a rather poor service to users - the sequences
are quite likely to be predictable from previous outputs, but the effect
of re-using a seed to repeat the sequence is also poorly specified.
Also, the fact that rand() returns a number from 0 to RAND_MAX has led
to the convention of using "rand() % max" to obtain an integer in the
range from 0..max-1, which won't produce a uniform distribution if max
isn't a factor of RAND_MAX (imagine RAND_MAX is 2 so the results of
rand() can be 0, 1, or 2 and max is 2 so we want 0 or 1 as results - the
result will be 0 if rand() returns 0 or 2 or 1 if rand() returns 1, so
returning 0 is twice as likely as 1). However, this kind of interface is
traditional and it's become what people expect from a default random
number source in a standard library - but I think this is unfortunate
and we should go back to the actual needs of random number users rather
than copying this pattern. I'm suspicious of a poorly specified "default
random number generator" - whose needs are it adequate for? Just the
writers of toy programs in education? But their needs are just as well
served by any specific RNG they could pick out of a hat.
The search for unpredictable "secure" random numbers is an interesting
one, because it presents a technical challenge in that computers are
largely deterministic devices. You can get dedicated hardware that
digitises quantum noise sources, but such things add cost and
potentially add vulnerabilities: such hardware is often hard to inspect
and makes an attractive target for attackers who can plant somebody
inside the chip design house or even fabrication plant. Another popular
source of nondeterminism is external events such as the contents or
arrival times of network packets, and detectable timing differences in
the behaviour of less-deterministic (eg, temperature-sensitive) hardware
in the system. Gaining reliable access to all this low-level hardware
timing information tends to be difficult outside of the kernel, so it
has become traditional for operating systems to offer a kernel-based
randomness source; this is usually a secure (eg, hard to predict) PRNG
whose internal state (known as the "pool") is constantly being "fed"
with nondeterministic information from hardware devices as it becomes
available. Outputs from this PRNG are made available via syscalls or
special devices like /dev/random.
Earlier implementations tried to track "how random" the pool was at any
point in time, increasing an entropy estimate as randomness was fed in
from hardware and reducing it as random bytes were pulled out by users.
/dev/random would block on the available entropy, only emitting a byte
every time the entropy counter exceeded eight bits (and reducing it by
eight each time), while /dev/urandom wouldn't block (but would still
reduce the counter). However, over time it was decided that this was a
meaningless distinction in practice, and possibly leaked information,
and opened an avenue for denial of service by using up all the entropy
so that things like TLS connection setups would then stall; so these
days /dev/random and /dev/urandom are either identical or perhaps /dev/
random blocks only when the kernel entropy pool hasn't yet been filled
after boot; the manpages on my linux system seem to record a history of
thinking on this changing over time and the promises made by random(7)
and random(4) seem mildly contradictory if I read them correctly!
Which leads us onto the weaknesses of kernel entropy pools - any modern
OS will provide one, so they're largely ubiquitous, but can't always be
relied upon to be safe. A freshly booted system will have the pool
initialised to some initial state. Modern Linux will, during shutdown,
save a poolfull of randomness into a file that is read into the pool in
early boot to conserve entropy across reboots, but LiveCDs or virtual
machines whose disk state is reset will not have this luxury, and will
be reliant on entropy from some external source to provide
non-predictable randomness. Virtual machines (and a lot of software is
run in VMs now, eg in cloud providers) will often have limited access to
hardware-level timing randomness, and limited access to accurate clocks
to measure it with, so even kernel random pools may be predictable (or
manipulable) in some contexts. However, they're often the best we've
got, and as application developers, we have little control over (or
ability to detect) the circumstances under which they're not to be
trusted, so we tend to have to just use them and hope that whoever hosts
the application has made the correct security decisions
in doing so.
Access to in-kernel entropy pools can be slow, as even without the issue
of blocking on available entropy, it's still a syscall and can involve
gaining exclusive locks inside the kernel to access the entropy pool, a
limitation on highly multicore systems. As such, applications may wish
to operate their own entropy pools, perhaps even dedicated to each
thread to avoid locking and fighting over cache lines, initialising them
from the kernel randomness source at startup and periodically injecting
some entropy from the kernel. Between such kernel entropy injections,
this pool operates as a pseudorandom number generater - whose output is
repeatable if it's reset to the known seed! But because it's only ever
seeded from secure random sources, and algorithms will be chosen that
are deemed to not allow an attacker with access to past outputs to
be able to predict future outputs (even with an attacker who can request
outputs under their control - imagine an algorithm with a small weakness
that the next output would have a known bias if the previous output fell
into a certain range; an attacker could request outputs until they see
one in that range then sit back and wait for the system they're
attacking to generate a biased output).
As such, there's a number of categories a random source might have, from
the perspective of a userland process:
1) Is it a fully deterministic algorithm, starting from a default or
provided seed?
2) Is it a hopefully-secure external entropy source, eg a kernel entropy
pool or a dedicated hardware randomness-source device accessed from
userland?
3) Is it a hybrid of the two - a fully deterministic type (1) algorithm,
but seeded from an external type (2) entropy source?
And then a separate property with is orthogonal to type (1) but implied
by types (2) and (3):
4) Is its output unpredictable, even by an adversarial user who can read
past outputs and stimulate the generation of outputs on demand?
A type (1) and type (4) algorithm is often known as a "CPRNG"
(cryptographic PRNG). If you just say "PRNG" you probably mean a type
(1) that isn't type (4).
Ideally, cryptographic applications need a type (4) source that's also
type (2) or (3). One might say that type (2) is better than type (3) as
the hybrid approach introduces the risk of vulnerabilities in the
underlying type (1) algorithm, but the in-kernel randomness source of
(2) is probably actually a type (3) system inside the kernel. A type (1)
system can still be suitable for cryptographic applications as long as
it is still type (4) and the attacker can't find the seed - a system
with no hardware randomness sources at all could still be provided an
initial random number by a user, or something like its own serial number
(as long as they're not predictable!), and generate fully deterministic
random numbers that an attacker can't exploit *as long as
property (4) holds reliably*.
Anyway, enough history and background. I have a few open questions for
the group to consider:
1) The source of random bits is distinct from converting those random
bits into useful things from the basic "evenly distributed integers in a
specified range" to exotic things like interesting distributions or
points on a sphere, and the algorithms for doing so aren't necessarily
trivial to get right, so I think it's uncontroversial that there need to
be low-level libraries for PRNGs and kernel or hardware entropy sources,
and high-level libraries to turn their outputs into useful values, with
perhaps only applications that literally require raw random bytes (for
UUIDs or cryptographic keys and so on) not needing any higher-level
processing. Is "I would like a random bytevector of a specified length"
the correct interface between the high and low level, so an RNG source
might be characterised as a function that accepts a natural number and
returns a bytevector of that length? (I suspect so).
1a) I presume we would want to make it easy for users to construct
compliant random sources of their own devising, eg their own RNG
algorithm, and use the same higher-level libraries with it; and also to
construct higher-level distribution libraries that can run from any
compliant random source.
1b) Converting raw bytes into useful distributions, even in such simple
cases as "integers or reals in a range", is surprisingly hard (see Zhu's
recent email) so users shouldn't be encouraged to try and roll their own
- we definitely need libraries. I think providing a rand()-style "Return
a random fixnum in some range" is too low-level and dangerous for users
- they should either get raw random bytes or use a library that provides
problem-domain random data. Which can include things like "pick N random
elements from this list/vector with/without replacement" and "randomly
permute this vector/list" as well as the obvious return-a-number procedures.
2) Applications that want secure randomness shouldn't be accidentally
given an insecure RNG, as that presents a possibly serious safety
failure. Putting aside the issues of whether we can truly know if we
have secure randomness or not due to system entropy pool initialisation
problems as we can't do much about it, it seems like it would be a good
idea if an RNG source can be asked if it's repeatable, or secure
(unpredictable, even to an attacker), or some combination of the two
(one can have an RNG that is entirely deterministic but whose outputs
can't be used to predict future outputs because they don't leak useful
information about the state, or an RNG that is insecure and also
unrepeatable because we don't have a way to manually seed it). How
should this information be exposed? Just predicates on an RNG source -
(random-source-repeatable?) (my type (1) above) and
(random-source-adversarialy-unpredictable?) (my type (4) above) - and
users attempting to produce cryptographic keys should check they have an
unrepeatable and unpredictable source? As a convenience, should we
provide (assert-random-source-adversarialy-unpredictable) and
(assert-random-source-unrepeatable) procedures that calls (error) if the
source doesn't meet the requirements, to encourage users to check? Are
these distinctions too fine, or will stating them separately make users
more aware of the subtleties?
3) Going further, are we right to conflate cryptographic and statistical
uses of RNGs at all - should we have totally distinct libraries for the
two, each more highly specialised? The overlap between the two with a
type (3) hybrid approach might lead to some sharing of code *inside* the
implementation but that's not necessarily exposed to the users.
4) For applications that want repeatable randomness in a portable
manner, you really need to specify an algorithm as well as a seed. For
applications that want to do crypto things, you want to obtain a source
that have the appropriate cryptographic safety properties. Should we
still provide a weakly-specified "default random source", and who's it
for? If just toy programs, is there any benefit to leaving it weakly
specified, or do we expect toy program authors to import an xorshift
library and construct a random number source from (make-xorshift64-rng
(get-current-unix-time)) themselves? I'm nervous about the concept of a
"default" RNG and the implication that, as the default, it might be good
for any particular purpose. Perhaps I'd be happier if we provided a
means to obtain an *arbitrary* RNG rather than a *default* one - a user
who asks for "an arbitrary" RNG is making a more explicitly statement
that they don't care what RNG they get, and the implementation is free
to provide whatever they feel is a generally good performance/quality
tradeoff.
5) How much do we want to provide in R7RS-large, and where? Access to
kernel or hardware RNGs is clearly an environments issue, but such
facilities are now so widespread that we might be able to assume they're
as available as memory that can be allocated or an ALU that can do
addition, with maybe a caveat that the RNG is reliably there but might
block arbitrarily or raise an error that needs to be handled at run
time. PRNG algorithms can be portably implemented but may benefit from
platform-specific implementations that use assembly language
bit-twiddling to go faster, and how much do we provide out of the box so
users can rely on them being present, versus letting a thousand portable
libraries bloom? Bearing in mind that cryptographic algorithms are a
continually evolving landscape, with once highly prized algorithms
falling into disrepute and being replaced by faster, better, more
trusted ones; but even despite all my pearl-clutching about default
RNGs, educators may curse us if we don't provide a simple reliable
one-line access to low-grade randomish numbers out of the box.
--
Alaric Snell-Pym (M0KTN neé M7KIT)
http://www.snell-pym.org.uk/alaric/