Generating 100 million unique random uns

Forum

Forum
Beginners
Generating 100 million unique random uns

Generating 100 million unique random unsigned integers

Mar 21, 2017 at 2:16pm

Hello guys, i am thinking of building a program where i would take up data for people in the millions. I want to create a surrogate key to identify each person. I want a program that can generate 100( or more) million four-digit numbers.

Here is my first visionary try

#include <iostream>  // in/out-put stream
#include <random>  // random engine generator
#include <vector>  // for std::vector
#include <algorithm> // for std::remove
#include <iomanip> // for setw()

// definining constants. Change any of these the way you want 
#define LIMIT 1000	// number of runs and number needed
#define MIN_VAL 1000	// least number to generate
#define MAX_VAL 9999   // maximum number to generate
using namespace std;

vector<unsigned> randomIntegers();
// -------------------------------------------------------------------------------------------------

int main()
{
	vector<unsigned> lastDigits = randomIntegers();
	unsigned row = 1;
	//sort(lastDigits.begin(), lastDigits.end()); // uncomment if you want to sort the numbers 1-10
	for (auto num : lastDigits)
	{
		cout << left << setw(6) << num;
		if (row++ % 10 == 0)
			cout << endl;
	}
	cout << endl;
	return 0;
}

//------------------------------------------------------------------------------------------------
// function definition
vector<unsigned> randomIntegers()
{
	// Randomly generate numbers
	random_device rd;  // random seed
	mt19937 gen(rd()); // Initializing Mersnne Twister pseudo-random number generator
	uniform_int_distribution<> random(MIN_VAL, MAX_VAL);

	// container to store the numbers
	vector<unsigned> ranNumVec;

	// generate random numbers between minVal and maxVal
	for (size_t i = 0; i < LIMIT, ranNumVec.size() < LIMIT; i++)
	{
		unsigned ranNum = random(gen); // generate a number

		// check duplicate
		auto found = find(begin(ranNumVec), end(ranNumVec), ranNum);

		// if duplicate, delete one
		if (found != end(ranNumVec))
		{
			ranNumVec.erase(std::remove(ranNumVec.begin(), ranNumVec.end(), ranNum), ranNumVec.end());
		}
		ranNumVec.push_back(ranNum);  // only put numbers that are non-repeating
	}

	return ranNumVec;
}

Edit & run on cpp.sh

How can this be made faster? When i try to run say 10000 numbers, my computer seem to sleep. Maybe it is its speed?

Your suggestions??

Last edited on Mar 23, 2017 at 5:35pm

Mar 21, 2017 at 2:48pm

helios (17607)

I want a program that can generate 100( or more) million four-digit numbers.

How are you going to have 100 million 4-digit numbers when there's only 9000 4-digit numbers?
If you just want to ID your users, what's the problem with a sequentially increasing ID? Why do you want random IDs?

Mar 21, 2017 at 2:51pm

JLBorges (13770)

Well under a second for one million 1000x1000 unique random numbers
(even with clang++ which is well behind GNU and microsoft on this test):

#include <iostream>
#include <vector>
#include <numeric>
#include <algorithm>
#include <random>
#include <ctime>

constexpr std::size_t N = 1000 ; // #unique random numbers
constexpr std::size_t NRUNS = 1000 ; // #runs
constexpr int  MINV = 1000 ; // min_value
constexpr int  MAXV = 9999 ; // max value

std::vector<int> make_population() // return vector with all numbers in [MINV,MAXV]
{
    std::vector<int> pop( MAXV-MINV + 1 ) ;
    std::iota( std::begin(pop), std::end(pop), MINV ) ;
    return pop ;
}

std::vector<int> generate() // generate N unique random integers in [MINV,MAXV]
{
    static std::vector<int> population = make_population() ;
    static std::mt19937 rng( std::random_device{}() ) ;

    // return the first N numbers of the randomly shuffled vector
    std::shuffle( std::begin(population), std::end(population), rng ) ;
    return { std::begin(population), std::begin(population)+N } ;
}

int main()
{
    unsigned long r = 0 ;
    // time time taken to generate N randomn numbers NRUNS times
    const auto start = std::clock() ;

    for( std::size_t run = 0 ; run < NRUNS ; ++run )
    {
        r += generate().front() ;
    }

    const auto end = std::clock() ;
    std::cout << double(end-start) / CLOCKS_PER_SEC << " seconds\n" ;
    
    return r > MAXV * NRUNS ;
}

Edit & run on cpp.sh

http://coliru.stacked-crooked.com/a/21fe87d980dd724a
http://rextester.com/ZMAR19367

Last edited on Mar 21, 2017 at 2:54pm

Mar 21, 2017 at 2:59pm

jonnin (11491)

if you are hitting your data with C++, consider trying to unique hash the user's data to make the key so you can search in O(1) by indexing the database off the key. Something like whatever the latest SHA or similar does pretty well. You shouldnt have collisions for only a few million in a 4 byte int, if you do it right -- it holds over 4000 millions.

Mar 21, 2017 at 3:04pm

gunnerfunner (2127)

i am thinking of building a program where i would take up data for people in the millions

they've built up a database for a billion+ people over in India, see if you can get any further ideas for your project:
https://uidai.gov.in/

Mar 21, 2017 at 5:32pm

longberns (143)

@helios

If you just want to ID your users, what's the problem with a sequentially increasing ID?

because i don't want to do that. I just want it to be random.

@JLBorges wao! Tested (in VS Enterprise 2015) and i get

  
Number of values generated 9000
16.272 seconds

That was fast! Thanks for that.

@jonnin

to unique hash the user's data to make the key so you can search in O(1) by indexing the database off the key

. I will research on that and see how i can implement that. Thanks.

@gunnerfunner thanks for the referense.

Again, thank you guys for taking a look and contributing to my vision.

Last edited on Mar 21, 2017 at 5:33pm

Mar 21, 2017 at 6:14pm

jonnin (11491)

Number of values generated 9000
16.272 seconds
???
that seems terrible to me, unless you found a 386 somewhere to run it on.

Are you sure that wasnt MS or something?

Mar 21, 2017 at 6:20pm

longberns (143)

@jonnin what do you mean

that seems terrible to me, unless you found a 386 somewhere to run it on.

? Too slow? It could be my hardware. It is faster compared to mine where i waited for more than 60 seconds and didnt get anything except the console blinking.

Mar 21, 2017 at 7:01pm

jonnin (11491)

I mean generating 9000 values should take fractions of a second, not 10s of seconds.
I was wondering if you meant miliseconds where you said 16.x ?

If your machine is very, very old and slow, it could be the issue. Note above:

JLBorges (9142)
Well under a second for one million 1000x1000 unique random numbers...

the 386 comment was a joke, but 16 seconds for only 9k? That indicates a serious problem somewhere.

Mar 21, 2017 at 7:22pm

helios (17607)

That time is possible on a modern computer if the program runs from inside the debugger without optimizations, N=1000, and NRUNS=9000.

Mar 22, 2017 at 5:31am

TheIdeasMan (6845)

jonin wrote:
if you are hitting your data with C++, consider trying to unique hash the user's data to make the key so you can search in O(1) by indexing the database off the key.

A little while ago, I was thinking that a hash table might be more efficient than a std::vector . But JLBorges demonstrated to me, as to why that not might not be true. It was quicker to populate a vector with 1 million ints and sort it, then binary search, than it was to compute hash values for the same.

I have come to learn that a lot of the rules about containers and efficiency in terms big O notation for CS in general don't seem to always hold for C++. The reasons seem to be move semantics, concurrency and cache effects, making C++ more efficient than other languages that don't have these things.

http://www.cplusplus.com/forum/general/193311/#msg930382

Mar 22, 2017 at 7:04am

helios (17607)

Big O notation describes the asymptotic behavior of an algorithm. An O(f(n)) algorithm A is asymptotically faster than an O(g(n)) algorithm B if lim [n -> ∞] g(n) / f(n) = ∞. That doesn't mean that algorithm A is always faster than algorithm B, it just means there's a sufficiently large problem for which it is always faster from then on. Sometimes that sufficiently large problem is so large that it's irrelevant, and sometimes it's stupidly small.

Consider:

//(Note: this isn't C++ code, it's a decompiled representation of hypothetical executable code.)
unsigned gaussian_sum_1(unsigned n){
    unsigned ret;
    for (int i = 1000000; i--;)
        ret = (n * n + n) / 2;
    return ret;
}

unsigned gaussian_sum_n(unsigned n){
    unsigned ret = 0;
    for (unsigned i = 1; i <= n; i++)
        ret += i;
    return ret;
}

gaussian_sum_1() runs in O(1) and gaussian_sum_n runs in O(n), yet gaussian_sum_1() will run slower until n is at least 1000000.

It's difficult to make blanket statements about when a hash table is faster than an array. I would certainly not make them here, since OP's question doesn't even make sense.

Mar 22, 2017 at 7:57am

TheIdeasMan (6845)

helios wrote:
It's difficult to make blanket statements about when a hash table is faster than an array. I would certainly not make them here, since OP's question doesn't even make sense.

My post was about what jonin wrote, not the OP. He was contending that a hash might be better (search in O(1) ), I was relaying a real and timed example where it wasn't. More importantly, the point was that hash tables have been considered (in theory) to be efficient in CS, but that isn't necessarily the case with modern C++, and I relayed the reasons why.

Stroustrup has said that "it is only real C++ experts that can say about the real efficiency of STL containers". It's not enough to say that a hash function will run in O(1), so that is better than something else. It was JLBorges who demonstrated all this to me, along with another discussion with Cubbi and mbozzi

So this raises another question: Why would someone want to use an STL hash container like std::unordered_set , given the big difference in performance shown in the example above? I could guess: Maybe their data has duplicates; or they implemented a different strategy for the hashing; or some other reason. Cubbi mentioned that he has experienced several times where they rolled their own hash tables / functions. (No pun intended :+) )

Here is the link to the example again, save scrolling up to look for it:
http://www.cplusplus.com/forum/general/193311/#msg930382

Edit:

I guess a hash table would be handy for a DB table, where there are multiple different indexes. One wouldn't want to be resorting on different fields all the time. I guess the user would have to put up with the ~~embuggerance~~ embotherance if the container is growing rapidly.

I should also say that I don't have anything against jonin, I was relating what I had learnt - maybe that wasn't the entire picture, Regards :+)

Last edited on Mar 22, 2017 at 8:43am

Mar 22, 2017 at 12:05pm

jonnin (11491)

I am not offended, use the best thing for your application. I said to consider it. Just computing some of the high end string hashes is a little expensive as they require multiple ops across many bytes. It depends on what he is doing whether it makes sense to try something else. The OP led me to think this stuff was living in a database, so the actual search and get are done there, off a key. If his app is sitting off the side of that, and the user enters key identifying data for the record, you can hash that, get the key, and the database will find that faster than if you don't have the key and have to do string searches on the DB.

If everything lives in c++, then its a different problem to address.

Ill also note that your example runs slower until... the numbers reach the values asked for by the OP. Whether that applies here, or not, I don't know.

Mar 22, 2017 at 2:10pm

helios (17607)

More importantly, the point was that hash tables have been considered (in theory) to be efficient in CS, but that isn't necessarily the case with modern C++, and I relayed the reasons why.

Yes, hash tables are asymptotically faster than binary search. That means there's an n that's large enough that a hash table is faster than binary search.

Why would someone want to use an STL hash container like std::unordered_set , given the big difference in performance shown in the example above?

For example, a hash table has weaker constraints on its key than a sorted array. Binary search requires that the type have a weak ordering, while a hash table only requires an equality comparison.
EDIT: Also, the cache performance of binary search quickly degenerates if the elements are not stored contiguously (e.g. if you sort an array of pointers based on the value of the pointed-to objects), while a hash table of pointers can perform the initial lookup without dereferencing its keys.

Last edited on Mar 22, 2017 at 2:22pm

Mar 22, 2017 at 2:20pm

JLBorges (13770)

For this specific problem (limited range of ids, say, a million or so), look up by indexing into a sparsely populated vector would perhaps be the fastest option.

#include <iostream>
#include <vector>
#include <numeric>
#include <algorithm>
#include <random>
#include <ctime>

constexpr std::size_t N = 1000000 ; // #unique random numbers
constexpr std::size_t NRUNS = 112 ; // #runs
constexpr int  MINV = 1000000 ; // min_value
constexpr int  MAXV = 9999999 ; // max value

std::vector<int> make_population() // return vector with all numbers in [MINV,MAXV]
{
    std::vector<int> pop( MAXV-MINV + 1 ) ;
    std::iota( std::begin(pop), std::end(pop), MINV ) ;
    return pop ;
}

std::vector<int> generate() // generate N unique random integers in [MINV,MAXV]
{
    static std::vector<int> population = make_population() ;
    static std::mt19937 rng( std::random_device{}() ) ;

    // return the first N numbers of the randomly shuffled vector
    std::shuffle( std::begin(population), std::end(population), rng ) ;
    return { std::begin(population), std::begin(population)+N } ;
}

std::vector<unsigned char> create_lookup_table( const std::vector<int>& unique_ids )
{
    static_assert( MINV >= 0 && MINV <= MAXV, "written for non-negative values" ) ;
    std::vector<unsigned char> lookup_table( MAXV-MINV + 1 ) ;
    for( unsigned int v : unique_ids ) lookup_table[v-MINV] = true ;
    return lookup_table ;
}

bool look_up( const std::vector<unsigned char>& lookup_table, int id )
{
    return lookup_table[id-MINV] ;
}

int main()
{
    unsigned long r = 0 ;

    const auto lookup_table = create_lookup_table( generate() ) ;

    // time time taken to lookup all numbers in [MINV,MAXV] NRUNS times
    const auto start = std::clock() ;

    for( std::size_t run = 0 ; run < NRUNS ; ++run )
    {
        for( int id = MINV ; id <= MAXV ; ++id ) r += look_up( lookup_table, id ) ;
        // note: this produces excellent cache locality; 
        //      truly random look ups would be somewhat slower
    }

    const auto end = std::clock() ;
    std::cout << ( MAXV-MINV + 1 ) * NRUNS << " look ups in "
              << double(end-start) / CLOCKS_PER_SEC << " seconds\n" ;

    return r > N * NRUNS ;
}

Edit & run on cpp.sh

clang++ -O3 1008000000 look ups in 0.684365 seconds
http://coliru.stacked-crooked.com/a/4721b23e83760249

g++ -O3 1008000000 look ups in 0.289123 seconds (as before, the GNU compiler/library shines.)
http://coliru.stacked-crooked.com/a/34310c5f9f028e52

microsoft -O2 1008000000 look ups in 0.712 seconds (different machine, so merely indicative)
http://rextester.com/SSL67313

Last edited on Mar 22, 2017 at 2:24pm

Mar 23, 2017 at 5:34pm

longberns (143)

Thanks guys for putting up such an educative debate. Special thanks to @JLBorges for putting up this. It works and that solves my problem.

Topic archived. No new replies allowed.