Random Number Generator

Pages: 123

Sometimes trying to convert older rand & modulo code to C++ can be fraught with monsters.

Old code in a Windows app that drew shapes at random positions on the client area, and had connecting "lines" used rand.modulo to generate locations that could be negative. Trying to recreate that using <random> didn't quite work.

Not that I really bothered to delve deeper into what went wrong. I was merely dinking around and gave up after a few attempts.

Duthomhas (13130)

@againtry
There is nothing misleading about the calculation with the Pigeonhole Principle — that is exactly how it works.

No matter what RAND_MAX you choose there will always be numbers that do not evenly divide it. That is, N (mod M) is not guaranteed to be congruent to 0 for all M in [0..N-1]. Hence, there is a (high) probability of bias. Such is the nature of modular arithmetic.

@Albatross.
Thanks for the link.

lastchance (6980)

But you could simply re-draw if rand() gave you a number greater than or equal to RAND_MAX-RAND_MAX%M, so removing that bias (at the expense of a small loss of range).

If RAND_MAX were 19, and you were working modulo 6, then you would simply reject and re-draw if rand() were >= 18. No, it's not pretty ... but it's an easy fix.

Last edited on

Duthomhas (13130)

...which I spent nearly 8000 characters explaining on the last page.
http://cplusplus.com/forum/general/282260/#msg1221435

But yes, thank you.

lastchance (6980)

So, since it's easy to remove the bias, I don't have a problem with rand(). And if you want multiple intervals [a,b] in the same program then you don't have to define a new std::uniform_int_distribution<int> for each a and b.

But, personally, I prefer computer languages where the default random number is a floating-point value in [0,1). Then I can easily convert it to whatever I want.

But I've just read through one link earlier in this thread and I spotted the truly hilarious
auto main() -> int
Please don't use it to bash rand() when it contains that monstrosity!

Last edited on

Duthomhas (13130)

Alas, I think a lot of those languages just use the equivalent of rand() underneath to produce that floating-point number... without even bothering to fill in the gaps.

helios (17506)

since it's easy to remove the bias, I don't have a problem with rand().

The problem with rand() isn't really the bias. You can bias the output even of a TRNG if you remap it into your desired range incorrectly.

n = (true_random_bit() + 2 * true_random_bit()) % 3;

The problem is that linear congruential generators produce regularly patterned outputs. The easiest way to see these patterns is by generating points in 2D or 3D: https://en.wikipedia.org/wiki/File:Lcg_3d.gif
This non-uniformity may or may not be a problem for your particular application, but determining that incorrectly could be bad.

againtry (2313)

There is nothing misleading about the calculation with the Pigeonhole Principle — that is exactly how it works.

@Duthomas
We all know about the pigeon hole principle and its application. If the purpose is to dispense with rand() then its application here is misleading. The pigeons lead us to an incorrect conclusion, just as the Albatross article, yet another rand() hobby-horse article, and many others do.

The numbers you have chosen demonstrate the principle, but the trouble is neither rand() nor mt19937 suffer anything like the problem in the realm of non-specialized uses of random numbers. (I'm also keeping in mind @lastchance comment about floats, @seeplus comment about the ease of use of mt19937, all that's good, but is not relevant to pigeon-rand-doom).

Both distributions display similar and close uniformity in practice. Interestingly it can be shown that for a very low number of trials both exhibit the same supposedly deficient uniformity in their algorithms.

The fallacy however is that 19 is only thrown up in the mod 6 calculation once in RAND_MAX occasions. (I can't change RAND_MAX) 19 doesn't get thrown up on every trial as your (and numerous others) would claim. At the low end of the RAND_MAX scale 19 will only occur on average once every 32,767 times - hardly a pit of doom. On my machine the library RAND_MAX value is 2,147,483,647. The pit of doom here is not even noticeable.

Here's a couple of runs. There's not much between them:

RAND_MAX: 2147483647
    NO_OF_TESTS: 19
RANGE_OF_VALUES: 6

VALUE           rand()            mt19937
=========================================
  0        1  5.26316%        4  21.0526%
  1        2  10.5263%        2  10.5263%
  2        2  10.5263%        3  15.7895%
  3        3  15.7895%        1  5.26316%
  4        6  31.5789%        4  21.0526%
  5        1  5.26316%        3  15.7895%
  6        4  21.0526%        2  10.5263%
Program ended with exit code: 0


RAND_MAX: 2147483647
    NO_OF_TESTS: 1000
RANGE_OF_VALUES: 6

VALUE           rand()            mt19937
=========================================
  0      142     14.2%      144     14.4%
  1      160       16%      152     15.2%
  2      141     14.1%      150       15%
  3      144     14.4%      122     12.2%
  4      133     13.3%      155     15.5%
  5      142     14.2%      137     13.7%
  6      138     13.8%      140       14%
Program ended with exit code: 0


       RAND_MAX: 2147483647
    NO_OF_TESTS: 1000000
RANGE_OF_VALUES: 6

VALUE           rand()            mt19937
=========================================
  0   142984  14.2984%   143057  14.3057%
  1   142613  14.2613%   142272  14.2272%
  2   142605  14.2605%   142793  14.2793%
  3   142652  14.2652%   143496  14.3496%
  4   142631  14.2631%   142226  14.2226%
  5   143696  14.3696%   143179  14.3179%
  6   142819  14.2819%   142977  14.2977%
Program ended with exit code: 0

Peter87 (11178)

The problem of using rand() % N tend to become larger when N is bigger.

If N is 2 * RAND_MAX / 3 then there would be a 67% chance of picking a number in the first half of the range compared to only 33% chance of picking a number in the second half.

If N exceeds RAND_MAX you won't get any numbers above RAND_MAX.

std::uniform_int_distribution handles these problems but it cannot help if the RNG generates random numbers of poor quality. One of the main problems with std::rand() is that it is implementation defined so the quality might vary between different implementations. What might be good enough with GCC on Linux might turn out to not be good enough with MinGW on Windows. std::default_random_engine has the same problem.

Last edited on

lastchance (6980)

But the bias problem can be removed in the case of N being 2*RAND_MAX/3 simply by rejecting and redrawing when rand() returns a value greater than or equal to 2*RAND_MAX/3 (or, in general, RAND _MAX - RAND_MAX % N). Then every accepted value has the same probability.

The main problem is that RAND_MAX is so small on some systems - notably on Visual Studio. On most linux systems it seems to be substantially larger. No idea why the C++ standard shouldn't stipulate a very large minimum value, or tie it to the largest value of an unsigned int.

One way of getting numbers above RAND_MAX would be to combine multiple calls to rand():
(1ull+RAND_MAX)*rand()+rand().

Last edited on

Peter87 (11178)

Yes, both problems can be worked around. Either manually or using std::uniform_int_distribution. You cannot use std::rand with std::uniform_int_distribution directly but it's not difficult to write your own distribution that uses std::rand() that you can use with std::uniform_int_distribution.

#include <iostream>
#include <random>
#include <ctime>

using namespace std;

struct std_rand_generator
{
	using result_type = decltype(std::rand());
	
	static constexpr result_type min() { return 0; }
	static constexpr result_type max() { return RAND_MAX; }
	result_type operator()() const { return std::rand(); }
};

int main()
{
	std::srand(std::time(nullptr));
	
	std_rand_generator rng;
	
	constexpr int N = 2 * (rng.max() / 3);
	
	std::uniform_int_distribution<int> dist(0, N);
	
	int first_half = 0;
	int second_half = 0;
	
	constexpr int NO_OF_TESTS = 10'000'000;
	
	for (int i = 0; i < NO_OF_TESTS; ++i)
	{
		//int r = std::rand() % N;
		int r = dist(rng);
		
		if (r <= N / 2)
		{
			++first_half;
		}
		else
		{
			++second_half;
		}
	}
	
	
	std::cout << "First half:  " << first_half  << " (" << round(100.0 * first_half  / NO_OF_TESTS) << "%)\n";
	std::cout << "Second half: " << second_half << " (" << round(100.0 * second_half / NO_OF_TESTS) << "%)\n";
}

_{Sample output:}

First half:  4997512 (50%)
Second half: 5002488 (50%)

Last edited on

Peter87 (11178)

lastchance wrote:
No idea why the C++ standard shouldn't stipulate a very large minimum value, or tie it to the largest value of an unsigned int.

Note that int (and unsigned int) is only required to be at least 16 bits by the standard. RAND_MAX is guaranteed to be at least 32767 which is the same as the maximum value of a signed 16-bit integer (using two's complement representation).

It's probably too late to change now anyway. The fact that Microsoft hasn't increased their value of RAND_MAX is probably a sign that they don't want to change it for one reason or another. Compatibility?

I think the people on the standard committees have accepted that rand() is a legacy feature and instead of trying to improve it they have started to discouraged its use and provide/recommend alternatives.

A footnote in the C standard says:

There are no guarantees as to the quality of the random sequence produced and some implementations are known to produce sequences with distressingly non-random low-order bits. Applications with particular requirements should use a generator that is known to be sufficient for their needs

The C++ standard calls it Low-quality random number generation and has a note that says:

The other random number generation facilities in this document (26.6) are often preferable to rand, because rand’s underlying algorithm is unspecified. Use of rand therefore continues to be non-portable, with unpredictable and oft-questionable quality and performance.

Last edited on

lastchance (6980)

Thank-you @Peter.

Another issue raised about the linear congruential PRNGs is their predictability (and hence observability patterns). But that is only true if you kept every one of their sequence of values. If you discarded a random number of draws between each returned value then it would be much harder to predict the next in sequence.

I guess it comes down to my own uses of quasi-random numbers being probabilistic ones, requiring decent statistics and elimination of bias. A typical example would be simulating particle dispersion. (Sadly, I've had a go at Covid-19 aerosol dispersion predictions.) I'm not in a position to comment on their cryptographic uses.

jonnin (11333)

I tried for a while to exploit LC engines for compression, to find a way to best-fit a random sequence to a data file to increase redundancy (eg, xor with the random stream to make more zeros) and assist compression. I never managed it, though. Predictable, repetitive, and generally annoying as they are, its not enough to turn that around and make it a useful feature in any way I ever figured out.

George P (5600)

Peter87 wrote:
I think the people on the standard committees have accepted that rand() is a legacy feature and instead of trying to improve it they have started to discouraged its use and provide/recommend alternatives.

Increasingly there are voices on the 'net, loud voices, insisting even the C++ <random> library use is to to be discouraged, never mentioned let alone used, and beginners need to use some 3rd party library.

Bad enough using the C library is almost a universal way many instructors teach beginners as the end-all to be-all.

Yes, even the C++ random engines are not "the best." But to ignore them because they are not designed for crypto work is foolish. The PRNGs in <random> are still far better than what C offers.

seeplus (6457)

C++ beginners don't need to use a 3rd party library for random. For beginners work IMO the std::random et al are perfectly acceptable. What is missing, however, is a readable analysis of the pros and cons of the various generators, when one should be used over the others, their limitations and when 3rd party ones might be needed for more advanced work (eg for crypto etc). Wouldn't this analysis etc be beneficial if produced by SG20 (Education)? Has this sub-group produced any educational materials yet?? It's last update seems to be from over 2 years ago.

helios (17506)

But to ignore them because they are not designed for crypto work is foolish.

I don't think anyone has said that. The reasons I remember off the top of my head to discourage using it were
* std::random_engine has the exact same problem as rand(), in that no requirements are made on its implementation. An implementation could conceivably call rand() internally, or do something even worse.
* Even when std::random_engine gives decent randomness, using it to seed std::mt19937* is tricky. This is arguably a flaw in the design of the deterministic generators, as not providing a seed should simply cause the generator to seed itself from a good source, not to be left with a default constant seed.
* Statistically Mersenne Twister is one of the best non-cryptographic generators, but it's rather slow and has a rather large state. IIRC it was something like 600 32-bit integers. There are faster and smaller generators that are also pretty good, and would have been a better choice for general purpose generators.

Duthomhas (13130)

1

Multiplying random numbers together generates bias. Don’t do it.

Your best hope (and the correct way) is to assume that the distribution of ones and zeros in the resulting number is random and simply chain the two with a bit-shift. (Alas, as already noted, rand() often fails this litmus.)

2

Tossing random values requires an RNG, so you are left with the conundrum of using yet another RNG (or, just as bad, using the same RNG). And, as already noted, chaining LCGs is still predictable, even if more difficult to decipher. (In large part because the observer knows how it works! But even if he didn’t, it wouldn’t take much for modern computers to break such a system.)

Worse yet, you again introduce bias. Remember, bias in an LCG is eliminated by balancing the parameters between y-intercept, slope, and modulus (and proving it with a whole lot of testing). Tossing values adds bias by removing some of the possible results from an output cycle!

I believe I mentioned this earlier. By ignoring some pigeons we break the uniformity of the distribution. The reason that it worked before was that we knew we did not need the ignored pigeons — they were expressly ignored. Even so, they still appeared with the same likelihood as the pigeons we wanted.

But when we chain the LCGs we are suddenly ignoring pigeons without having specifically excluded them. Sure, it may be possible that those particular pigeons may appear in another cycle, or we could consider the cycle to be some longer combination of the two RNG cycles, but either way the distribution of outputs is no longer uniform — I can no longer expect every pigeon to appear with equal probability.

3

The C crowd seems to hate C++ with a frothing, animalistic passion. <random> provides a fabulous collection of basic PRNGs all easy and ready to use. But no, “you’ll pry rand() from my cold, dead fingers”, and “oh, and your fancy stuff isn’t good enough, because it doesn’t also do X, Y, and Z”.

It doesn’t help that the C++ committee, at least for a while, had too many tight neckties to do something like mandate a proper CSPRNG instead of saying, "We could, uh, if you like, I mean... if it isn’t too much of a bother... make a... hey, you can just make the interface available, so...”

...which lead to the bullcrap GCC pulled with the

struct random_device
{
  //int getRandomNumber()
  value_type operator () ()
  {
      return 4;  // chosen by fair dice roll
                 // guaranteed to be random
  }
};

Someone who thought they understood Randall Munroe’s joke was clearly behind the cueball of reality, both in terms of what Randall was expressing and in terms of what the consequence would be to people just wanting easy access to a CSPRNG.

4

As far as a readable choice of which PRNG to use...

it doesn’t matter.

Seriously, just pick one. Even very awesome PRNGs are typically only a few lines of code long. MT is kind of unique for its class because of its size and complexity.

If you are working in a specialized-enough context that it makes any difference, you should should already know enough about that context and how PRNGs work to make a good choice. (Or you should learn, because working in that context requires that knowledge.)

5

AFAIK, std::random_device is supposed to work properly on all modern compilers, but I still hate the thing. Hence my shameless plug little CSPRNG library: https://github.com/Duthomhas/CSPRNG

I agree that beginners should be just given working <random> code. It really is as easy as:

#include <random>

int random( int min, int max )
{
  return std::uniform_int_distribution( min, max )( std::random_device{}() );
}

Assuming your std::random_device works because you have a sufficiently modern compiler (droll shove: make sure you are using a sufficiently modern compiler), that will always work.

lastchance (6980)

Multiplying random numbers together generates bias.

Nobody has suggested multiplying random numbers together.

George P (5600)

Assuming your std::random_device works because you have a sufficiently modern compiler (droll shove: make sure you are using a sufficiently modern compiler), that will always work.

Three of my "go-to" compilers are (I hope) decently up-to-date. MSVC (19.29.30140 & 19.31.31104), TDM64-GCC (10.3.0) and MinGW [MSYS2] (Rev9, Built by MSYS2 project) 11.2.0.

Pages: 123