High-Quality Random Numbers

We are concerned here with pseudorandom number generators (RNG’s), in particular those of the highest quality. It turns out to be difficult to find an operational definition of randomness that can be used to measure the quality of a RNG, that is the degree of independence of the numbers in a given sequence, or to prove that they are indeed independent. The situation for traditional RNG’s (not based on Kolmogorov–Anasov mixing) is well described by Knuth in [1]. The book contains a wealth of information about random number generation, but nothing about where the randomness comes from, or how to measure the quality (randomness) of a generator. Now with hindsight, it is not surprising that all the widely-used generators described there were later found to have defects (failing tests of randomness and/or giving incorrect results in Monte Carlo (MC) calculations), with the notable exception of RANLUX, which Knuth does mention briefly in the third edition, but without describing the new theoretical basis.

The Need for High Quality

High-level scientific research, like many other domains, has become dependent on Monte Carlo calculations, both in theory and in all phases of experiments. It is well-known that the MC method is used primarily for calculations that are too difficult or even impossible to perform analytically, so that our science has become dependent to a large extent on the random numbers used in extensive MC calculations. But how do we know if those numbers are random enough? In the early days (1960’s) the RNG’s were so poor that, even with the very slow computers of that time, their defects were sometimes obvious, and users would have to try a different RNG. When the result looked good, it was assumed to be correct, but we know now that all the generators of that period had serious defects which could give incorrect results not easily detected.

As computers got faster and RNG’s got longer periods, the situation evolved quantitatively, but still unacceptable results were occasionally obtained and of course were not published. Until 1992, when the famous paper of Ferrenberg et al. [2] showed that the RNG considered at that time to be the best was giving the wrong answer to a problem in phase transitions, while the older RNG’s known to be defective gave the right answer. Since most often we don’t have any independent way to know the right answer, it became clear that empirical testing of RNG’s, at that time the only known way to verify their quality, was not good enough. Fortunately, the particular problem which was detected by Ferrenberg et. al. was soon solved by Martin Lüscher (in [3]), but it became clear that if we were to have confidence in MC calculations, we would need a better way to ensure their quality. Fortunately the theory of Mixing, outlined below, now offers this possibility.

The experience gained from develo**, using and discovering defects in many RNG’s has taught us some lessons which we summarise here (they are explained in detail in [1]):

  1. 1.

    The period should be much longer than any sequence that will be used in any one calculation, but a long period is not sufficient to ensure lack of defects.

  2. 2.

    Empirical testing can demonstrate that a RNG has defects (if it fails a test), but passing any number of empirical tests can never prove the absence of defects.

  3. 3.

    Making an algorithm more complicated (in particular, combining two or more methods in the same algorithm) may make a better RNG, but it can also make one much worse than a simpler component method alone if the component methods are not statistically independent.

  4. 4.

    It is better to use a RNG which has been studied, whose defects are known and understood, than one which looks good but whose defects are not understood.

  5. 5.

    There is no general method to determine how good a RNG must be for a particular MC application. The best way to ensure that a RNG is good enough for a given application, is to use one designed to be good enough for all applications.

The Theory of Mixing in Classical Mechanical Systems

It has been known, at least since the time of Poincaré, that classical dynamical systems of sufficient complexity can exhibit chaotic behaviour, and numerous attempts have been made to make use of this “dynamical chaos” to produce random numbers by numerical algorithms which simulate mechanical systems. It turns out to be very difficult to find an approach which produces a practical RNG, fast enough and accurate enough for general MC applications. To our knowledge, only two such attempts have been successful, both based on the same representation and theory of dynamical systems.

This theory grew out of the study of the asymptotic behaviour of classical mechanical systems that have no analytic solutions, developed largely in the Soviet Union around the middle of the twentieth century by Kolmogorov, Rokhlin, Anosov, Arnold, Sinai and others. See, for example, Arnold and Avez [4] for the theory. At the time, these mathematicians were certainly not thinking of RNG’s, but it turns out that their results can be used to produce sequences of random numbers that have some of the properties of the trajectories of the dynamical systems (See Savvidy [5] and Savvidy [6]). The property of interest here is called Mixing, and is usually associated with the names Kolmogorov and Anosov. Mixing is a well-defined concept in the theory, and will be seen to correspond quite exactly to what is usually called independence or randomness.

The representation of dynamical systems appropriate for our purposes is the following:

$$\begin{aligned} x(i+1) = A \times x(i) \mod 1, \end{aligned}$$
(1)

where x(i) is the N-vector of real numbers specifying completely the state of the system at time i, and A is a (constant) \(N \times N\) matrix which can be thought of as representing the numerical solution to the equations of motion. The N-dimensional state space is a unit hypercube which because of the modulo function becomes topologically equivalent to an N-dimensional torus by identifying opposite faces. The vectors x represent points along the continuous trajectory of the abstract dynamical system in N-dimensional phase space.

All the elements of the matrix A are integers, and the determinant of A must be one. This ensures that A is invertible and the elements of \(A^{-1}\) are also integers. The theory is intended for high- (but finite) dimensional systems (\(1 \ll N < \infty\)). In practice N will be between 8 and a few hundred.

Mixing and the Ergodic Hierarchy

Let x(i) and x(j) represent the state of the dynamical system at two times i and j. Furthermore, let \(v_1\) and \(v_2\) be any two subspaces of the entire allowed space of points x, with measures (volumes relative to the total allowed volume), respectively \(\mu (v_1)\) and \(\mu (v_2)\). Then the dynamical system is said to be a 1-system (with 1-mixing) if

$$\begin{aligned} P(x(i) \in v_1 ) = \mu (v_1) \end{aligned}$$

and a 2-system (with 2-mixing) if

$$\begin{aligned} P(x(i) \in v_1 \; {\mathrm and} \; x(j) \in v_2 ) = \mu (v_1) \mu (v_2), \end{aligned}$$

for all i and j sufficiently far apart, and for all subspaces \(v_i\). Similarly, an n-system can be defined for all positive integer values n. We define a zero-system as having the ergodic property (coverage), namely that the state of the system will asymptotically come arbitrarily close to any point in the state space.

Putting all this together, we have that asymptotically:

  • A system with zero-mixing covers the entire state space.

  • A system with one-mixing covers uniformly.

  • A system with two-mixing has \(2 \times 2\) independence of points.

  • A system with three-mixing has \(3 \times 3\) independence.

  • etc.

Finally, a system with n-mixing for arbitrarily large values of n is said to have K-mixing and is a K-system. It is a result of the theory that the degrees of mixing form a hierarchy [7], that is, a system which has n-mixing for any value of n also has i-mixing for all \(i<n\). There are additional systems, in particular Anasov C-systems and Bernoulli B-systems which are also K-systems, but K-systems are sufficient for our purposes.

Now the theory tells us that a dynamical system represented by Eq. (1) will be a K-system if the matrix A has determinant equal to one and eigenvalues \(\lambda\), all of which have moduli \(|\lambda _i | \ne 1\).

The Eigenvalues of A

We have seen that to obtain K-mixing, none of the eigenvalues of A should lie on the unit circle in the complex plane. In fact, in order to obtain sufficient mixing as early as possible (recall that complete mixing is only an asymptotic property) it is desirable to have the eigenvalues as far as possible from the unit circle.

An important measure of this distance is the Kolmogorov entropy h:

$$\begin{aligned} h = \sum _{k:|\lambda _k | > 1} \ln |\lambda _k|, \end{aligned}$$

where the sum is taken over all eigenvalues with absolute values greater than 1 [It is also equal to the sum over all eigenvalues less than 1, but then it changes sign]. As its name implies, it is analogous to thermodynamic entropy as it measures the disorder in the system, and it must be positive for an asymptotically chaotic system [This actually follows from the definition if all \(|\lambda | \ne 1\)].

Another important measure is the Lyapunov exponent, defined in the following section.

The Divergence of Nearby Trajectories

The mechanism by which mixing occurs in K-systems can be observed “experimentally” by noting the behaviour of two trajectories which start at nearby points in state space. Using the same matrix A, let us start the generator from two different nearby points a(0) and b(0), separated by a very small distance \(\delta (a(0),b(0))\). The distance \(\delta\) is defined by

$$\begin{aligned} \begin{aligned} \delta (a,b)&= \max _{\kappa }d_\kappa , \\ d_\kappa&=\min \left\{ |a_\kappa -b_\kappa | \, , \, 1-|a_\kappa -b_\kappa |\right\} , \end{aligned} \end{aligned}$$
(2)

where \(\kappa\) runs over the N components of the vector indicated. Note that the distance defined in this way is a proper distance measure and has the property \(0\le \delta \le 1/2\). Now we use Eq. (1) on a(0) and b(0) to produce a(1) and b(1), and calculate \(\delta (a(1),b(1))\). Then we continue this process to produce two series of points a(i) and b(i), and a set of distances \(\delta (i)\), for \(i = 1,2,3 \ldots\) until the \(\delta\) reach a plateau at the value expected for truly random points, which according to Lüscher is \(\delta = 12/25\) for RANLUX. Then it is a well-known result of the theory that, if A represents a K-system, the distances \(\delta _i\) will diverge exponentially with i, so that if plotted on a logarithmic scale, the points \(\delta _i\) vs. i should lie on a straight line. The inevitable scatter of points should be reduced by averaging the \(\delta\) over different starting pairs (a(0), b(0)) (the b(0) must of course always be the same very small distance from the a(0), but in different directions).

The rate of divergence of nearby trajectories (\(\nu\), the Lyapunov exponent) is equal to the logarithm of the modulus of the largest eigenvalue:

$$\begin{aligned} \nu = \ln |\lambda |_\mathrm{max}, \end{aligned}$$

which, for a K-system, should be the slope of the straight line described above.

Decimation

The asymptotic independence of a and b is guaranteed by the mixing (if it is a K-system), but as long as \(\delta (i)\) remains small, a(i) and b(i) are clearly correlated. The point where the straight line of divergence of nearby trajectories reaches the plateau of constant \(\delta\) indicates the number of iterations m required to make the K-system “sufficiently asymptotic”, in the sense that the points a and b generated on the following iteration are on average as far apart as independent points would be. We may call this criterion the “2-mixing criterion”, since it apparently assures that 2\(\times\)2 correlations due to nearby trajectories will be negligible. The question whether this criterion is sufficient to eliminate higher order correlations, is important and will be discussed below in connection with RANLUX++.

Some plots of divergence for real RNG’s are given below. If the K-system is used to generate random numbers, to eliminate the correlations due to nearby trajectories, after one vector \(a_i\) is delivered to the user, the following m vectors \(a_{i+1},a_{i+2},\ldots a_{i+m}\) must be discarded before the next vector \(a_{i+m+1}\) is delivered to the user (decimation). It has become conventional to characterise the degree of decimation by the integer p, defined such that after delivering a vector of N random numbers to the user, \(p-N\) numbers are skipped before the next N numbers are delivered [3, 8].

From the Theory to the Discrete RNG

Equation (1) will be used directly to generate random numbers, where x(0) will be the N-dimensional seed, and each successive vector x(i) will produce N random numbers. However, in the theory, x is a vector of real numbers, continuous along the unit line. The computer implementation must approximate the real line by discrete rational numbers, which is valid provided the finite period is sufficiently long, and the rational numbers are sufficiently dense, so that the effects of the discreteness are not detectable. Thus the computer implementation has access only to a rational sublattice of the continuous state space, and we must confirm that the discrete approximation preserves the mixing properties of the continuous K-system. Fortunately, the divergence of nearby trajectories offers this possibility, since a theorem usually attributed to Pesin [7] states that a dynamical system has positive Kolmogorov entropy and is, therefore, K-mixing if and only if nearby trajectories diverge exponentially. Then we can expect the discrete system to be K-mixing only if the same condition is satisfied. This is discussed in more detail below in connection with Extended MIXMAX.

The Period

The most obvious difference between continuous and discrete systems is that continuous systems can have infinitely long trajectories, whereas a computer RNG must always have a finite period. This means that a trajectory ‘eats up’ state space as it proceeds, since it can never return to a state it has previously occupied without terminating its period. The fact that the available state space becomes progressively smaller indicates necessarily a defect in the RNG, but this defect is easily seen to be undetectable if the period is long enough. According to Maclaren [9], when the period is P, using more than \(P^{2/3}\) numbers from the sequence “leads to excessive uniformity compared to a true random sequence.” This is compatible with the RNG folklore which sets the usable limit at \(\sqrt{P}\).

Summary of Criteria for Highest Quality

We will consider as candidates for highest quality RNG’s only those based on the theory of chaos in classical mechanical systems, for the simple reason that we know of no other class of systems which can offer—even in some limit which may or may not be attainable—the uniform distribution and lack of correlations guaranteed by k-mixing. To be precise, our criteria are the following:

  1. 1.

    Matrix A must have eigenvalues away from the unit circle, and determinant =1.

  2. 2.

    Kolmogorov entropy must be positive (follows from the above).

  3. 3.

    The discrete algorithm must have points sufficiently dense to accurately represent the continuous system.

  4. 3.

    Divergence of nearby trajectories must be exponential (follows from the above) .

  5. 4.

    Decimation must be sufficient to assure that the average distance between successive vectors is the expected distance between independent points.

  6. 5.

    Period must be long enough, \({>}10^{100}\)

  7. 6.

    Some practical criteria: double precision available, portable, repeatable, independent sequences possible.

The High-Quality RNG’s: 1. RANLUX

The first widely-used RNG to offer reliably random numbers was Martin Lüscher’s RANLUX, published in 1994. He considered the RNG proposed by Marsaglia and Zaman [10], installed many years ago at CERN with the name RCARRY, and now known variously as RCARRY, SWB (subtract with borrow) or AWC (add with carry). He discovered that, if the carry bit was neglected (see below), this RNG had a structure that could be represented by Eq. (1), and was, therefore, possibly related to a K-system.

The SWB (RCARRY) Algorithm

SWB operates on an internal vector of 24 numbers, each with a 24-bit mantissa, and each call to the generator produces one random number which is produced by a single arithmetic operation (addition or subtraction) operating on two of the numbers in the internal vector. Then this random number replaces one of the entries in the internal vector. Lüscher realised that if SWB is called 24 times in succession it generates a 24-vector which is related to the starting 24-vector by (1), and he could determine the matrix A which would reproduce almost the same sequence as SWB. The only difference would be due to the “carry bit” which is necessary for attaining a long period, but affects only the least significant bit, so it does not alter significantly the mixing properties of the generator. The carry bit is described in detail, both in the paper of Marsaglia and Zaman [10] and that of Lüscher [3].

SWB: The Lyapunov Exponent and Decimation

The eigenvalues of this “equivalent matrix” were seen to satisfy the conditions for a K-system, but the RNG nevertheless failed several standard tests of randomness. Plotting the evolution of the separation of nearby trajectories immediately indicated the reason for the failure and how to fix it. The Lyapunov exponent was bigger than one, as required, but not much bigger, so it would require considerable decimation to attain full mixing.

Note that the reason for needing decimation has nothing to do with the carry bit, or even with the discrete approximation to the continuous system. It would in general be needed even for an ideal K-system with continuous \(a_i\), simply because mixing is only an asymptotic property of K-systems.

Fig. 1
figure 1

Divergence of nearby trajectories for RANLUX. The blue line is a straight-line fit to all points

The situation is shown in Fig. 1. It can be seen first of all that the first 15 points lie precisely on a straight line, indicating exponential divergence, and therefore, RANLUX has indeed the mixing properties of a K-system. However, we also see that, after generating 24 numbers, it is necessary to throw away (decimate) about \(16 \times 24\) numbers before delivering the next 24 numbers to the user, in order that the average separation between nearby trajectories equals the expected separation between independent points in phase space (the 2-mixing criterion). Fortunately, throwing away numbers is typically about twice as fast as delivering them, so the generation time is increased by only about a factor 7, which is still negligible for many applications, since the basic algorithm is very fast. However, for those users who cannot afford the extra time, RANLUX offers different degrees of decimation, known as “luxury levels”. The lowest luxury levels provide very little or no decimation and are very fast; higher levels are slower but more reliable: and the highest level offers sufficient decimation to eliminate the obvious correlation due to nearby trajectories remaining too close. Whether this decimation is sufficient to guarantee the disappearance of all higher order correlations will be discussed below.

RANLUX: The Spectral Test

The spectral test is not exactly an empirical test, since it cannot be applied to any sequence of numbers, but requires some knowledge of the RNG algorithm [1]. We could, therefore, call it a semi-empirical test which can be used on any linear congruential generator (LCG) with a known multiplier. RNG’s in this class have the well-known property that if d-tuples of random numbers are used as coordinates of points in a d-dimensional hypercube, all the points lie on parallel hyperplanes which are sometimes spaced much more widely than would be the case for independent points. Given the multiplier of the LCG and a value of d, the spectral test provides a figure of merit for the spacing of the hyperplanes. This test is a full-period test and is generally considered to be more powerful than the usual empirical tests.

When the first papers on RANLUX were published, Tezuka et al. [11] had already discovered that the algorithm of Marsaglia and Zaman [10] (which was the basis for RANLUX) was in fact equivalent to a linear congruential generator (LCG) with an enormous multiplier. This fact was used by Lüscher [3] to apply the spectral test to RANLUX. He shows that RANLUX passes the test at high luxury levels, but fails at the lower levels as expected.

RANLUX: The Period and the Implementation

So RANLUX is simply SWB with decimation, where the amount of decimation is a parameter to be chosen by the user: the luxury level. The definitions of the luxury levels may depend on the implementation, so the user should consult local documentation.

The period of SWB is known exactly and is \(\approx 5 \times 10^{171}\). This is easily seen to be many orders of magnitude more than all the world’s computers could generate in the expected lifetime of the earth, so any defects arising from the finiteness of the state space should remain undetectable.

The original implementation was in Fortran [12] but RANLUX is now used mainly in the implementations in C, available in single precision for 24-bit mantissas, and in double precision for 53-bit mantissas, of which 48 are implemented and the last 5 are zeroes. These are available with documentation from Lüscher’s website ( http://luscher.web.cern.ch/luscher/) and are also installed in several program libraries including CERN and the Gnu Scientific Library. It is part of the C++ standard library. There is also a version [13] using integer arithmetic internally, which may be faster on some platforms. Because of the luxury levels, RANLUX is especially suited for testing packages that test RNG’s for randomness [14]. If RANLUX passes the test at low luxury levels, then the test is not very sensitive; if it fails at the highest luxury level, then it is likely that the test itself is defective.Footnote 1

The High-Quality RNG’s: 2. MIXMAX

A few years before the publication of RANLUX, George Savvidy et al. in Erevan [5] were working on a different approach to the same problem using the same theory of mixing. Their approach was to look for a family of matrices A which could be defined for any dimension N, having eigenvalues far from the unit circle and, therefore, large Lyapunov exponents for different values of N, to reduce or even eliminate the need for decimation.

MIXMAX: The Matrix, the Algorithm, and Decimation

The family of matrices which they found was (for \(N \ge 3\))

$$\begin{aligned} A = \begin{pmatrix} 1 &{} 1 &{} 1 &{} 1 &{} \cdots &{} 1 &{} 1 \\ 1 &{} 2 &{} 1 &{} 1 &{} \cdots &{} 1 &{} 1 \\ 1 &{} 3+s &{} 2 &{} 1 &{} \cdots &{} 1 &{} 1 \\ 1 &{} 4 &{} 3 &{} 2 &{} \cdots &{} 1 &{} 1 \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \ddots &{}\vdots &{} \vdots &{}\\ 1 &{} N-1 &{} N-2 &{} N-3 &{} \cdots &{} 2 &{} 1 \\ 1 &{} N &{} N-1 &{} N-2 &{} \cdots &{} 3 &{} 2 \\ \end{pmatrix} \end{aligned}$$
(3)

where the “magic” integer s is normally zero, but for some values of N, \(s=0\) would produce an eigenvalue \(|\lambda _i | = 1\), in which case a different small integer must be used.

The straightforward evaluation of the matrix-vector product in Eq. (1) requires \(O(N^2)\) operations to produce N random numbers, making it hopelessly slow compared with other popular RNG’s, so a faster algorithm would be needed. After considerable effort, they reduced the time to \(O(N\ln N)\), much faster but still too slow.

George Savvidy’s son Konstantin, also a theoretical physicist, found the algorithm [

Fig. 3
figure 3

Divergence of nearby trajectories for MIXMAX 1.0, \(N= 10\), 16, 44 ,88, 256, 1000

Table 1 Number of iterations of decimation required for full separation with original MIXMAX for different values of N

However, the developers of MIXMAX were already working on an extended MIXMAX which hopefully would not require any decimation.