Sorting Random Vector

May 20, 2016 at 3:22pm

Hello everyone,

I am trying to write a program that sorts a vector's 100,000 random integers from least to greatest. I am using C++'s standard library function called "sort."

The current code currently gives me an assertion failure at line 23. For some reason, the code is making v1[0] equal to v1[1], and so the code fails. It's only on that line as well, since if I change the '<' sign in ln 23 to '<=', then the program runs fine.

I would appreciate the feedback. Thanks in advance!

#include <iostream>
#include <cassert>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include <algorithm>

// The code uses features in c++11. Compile it using "g++ -std=c++11".

using namespace std;

void RndmVector(vector<int> & v);

int main()
{
    srand(time(0));

    vector<int> v1;
    RndmVector(v1);
    sort(v1.begin(), v1.end());

    assert(v1.size() == 100000);
    assert(v1[0] < v1[1]);
    assert(v1[2] < v1[9]);
    assert(v1[88] < v1[99]);
    assert(v1[970] < v1[999]);
    assert(v1[9700] < v1[9999]);
    assert(v1[97000] < v1[99999]);

    cout << "All tests have successfully passed." << endl;
}
void RndmVector(vector<int> & v)
{
    for(int k = 0; k < 100000; k++)
    {
        v.push_back(rand());
    }
}

Edit & run on cpp.sh

Last edited on May 20, 2016 at 3:34pm

May 20, 2016 at 3:37pm

AbstractionAnon (6954)

For some reason, the code is making v1[0] equal to v1[1]

It's very possible that rand() is returning a duplicate value. There is no guarantee that rand() will return unique values. sort(), of course, will group the duplicate values together. If v1[0] and v1[1] are indeed duplicate values, your assertion will fail. That also explains why <= works.

May 20, 2016 at 3:40pm

Pegasus1 (34)

Thanks for the reply!

But isn't it mathematically improbable that out of 100,000 elements, there will only be two duplicate values, and that those two values just happen to be the smallest integers?

EDIT: Actually, I don't know that those are the only two duplicate values, since I don't have 100,000 assert functions!

Last edited on May 20, 2016 at 3:42pm

May 20, 2016 at 3:52pm

Chervil (7320)

On many systems, RAND_MAX may be 32767. If you try to generate 100000 random numbers using rand() on such a system, there will always be many duplicates.
http://www.cplusplus.com/reference/cstdlib/RAND_MAX/
http://www.cplusplus.com/reference/cstdlib/rand/

If you need 100000 unique values, don't use rand().
You'd be better using the c++11 random facilities.

May 20, 2016 at 4:03pm

Pegasus1 (34)

Chervil, that's a good point. However, if that were the only problem then the following code should work, but the assert algorithm fails at line 40. So there are still duplicates when the vector size is under 32,767.

#include <iostream>
#include <cassert>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include <algorithm>

// The code uses features in c++11. Compile it using "g++ -std=c++11".

using namespace std;

void RndmVector(vector<int> & v);
void assertalgorithm(vector<int> & v);

int main()
{
    srand(time(0));

    vector<int> v1;
    RndmVector(v1);
    sort(v1.begin(), v1.end());
    assertalgorithm(v1);

    cout << "All tests have successfully passed." << endl;
}
void RndmVector(vector<int> & v)
{
    for(int k = 0; k < 10000; ++k)
    {
        v.push_back(rand());
    }
}
void assertalgorithm(vector<int> & v)
{
    assert(v.size() == 10000);
    for(int i = 0; i < 10000 -1; i++)
    {
        for(int j = i +1; j < 10000; ++j)
        {
          assert(v[i] < v[j]);
        }
    }
}

Edit & run on cpp.sh

Last edited on May 20, 2016 at 4:04pm

May 20, 2016 at 4:16pm

JLBorges (13770)

#include <iostream>
#include <cstdlib>
#include <iomanip>

double probability_of_no_duplicates( std::size_t N )
{
    if( RAND_MAX < N ) return 0 ;

    double p = 1.0 ;
    for( std::size_t i = 0 ; i < N ; ++i ) p *= double( RAND_MAX - i ) / RAND_MAX ;
    return p * 100 ;
}

int main()
{
    std::cout << "RAND_MAX: " << RAND_MAX << " percent probability of no duplicates\n" ;
    for( std::size_t n = 1 ; n < 1'000'001 ; n *= 10 )
        std::cout << std::setw(7) << n << std::setw(20) << std::fixed << std::setprecision(15) << probability_of_no_duplicates(n) << "%\n" ;
}

Edit & run on cpp.sh

RAND_MAX: 2147483647 percent probability of no duplicates
      1 100.000000000000000%
     10  99.999997904524221%
    100  99.999769497924888%
   1000  99.976742919957701%
  10000  97.698813414799019%
 100000   9.745941004097384%
1000000   0.000000000000000%

RAND_MAX: 32767 percent probability of no duplicates
      1 100.000000000000000%
     10  99.862747710512124%
    100  85.965875610794797%
   1000   0.000020476543314%
  10000   0.000000000000000%
 100000   0.000000000000000%
1000000   0.000000000000000%

http://coliru.stacked-crooked.com/a/a034d8fb55cd7ce3
http://rextester.com/LBJ4547

May 20, 2016 at 5:03pm

JLBorges (13770)

> Can I use your code to my school project?

Yes. Provided you understand it, and can explain the logic.

May 20, 2016 at 8:05pm

Pegasus1 (34)

Huh?

I do not call recall my asking you if I may use your code for a school project...

That's a nice code for getting the probabilities.

Is the RAND_MAX value for C++ usually 32767?

Last edited on May 20, 2016 at 8:05pm

May 20, 2016 at 8:17pm

Chervil (7320)

Don't worry, it was a spam message, now deleted.

The documentation says it is guaranteed to be at least 32767 - and that would be common on a 32-bit system. I'm not sure about 64-bit systems.

The best plan is to move on, rand() is limited in many ways, better to use the C++11 facilities.

You need a source of random numbers - a generator, and then feed them through the required distribution, usually int or floating point according to your needs.

Example - but you should choose the range as required.

1
2

#include <random>
#include <ctime>

    std::mt19937 mt(time(0));                              // seed using current time
    std::uniform_int_distribution<int> dist(0, 1000000);   // specify range 0 to 1 million

    dist(mt); // use this instead of rand()

http://www.cplusplus.com/reference/random/mt19937/
http://www.cplusplus.com/reference/random/uniform_int_distribution/
http://www.cplusplus.com/reference/random/uniform_real_distribution/

Last edited on May 20, 2016 at 8:19pm

May 20, 2016 at 9:16pm

Pegasus1 (34)

Hello Chervil,

I used the Mersenne Twister Algorithm as per your suggestion. But the assertion algorithm still fails.

#include <iostream>
#include <cassert>
#include <vector>
#include <time.h>
#include <algorithm>
#include <random>

// The code uses features in c++11. Compile it using "g++ -std=c++11".

using namespace std;

void RndmVector(vector<int> & v);
void assertalgorithm(vector<int> & v);

int main()
{
    vector<int> v1;
    RndmVector(v1);
    sort(v1.begin(), v1.end());
    assertalgorithm(v1);

    cout << "All tests have successfully passed." << endl;
}
void RndmVector(vector<int> & v)
{
    mt19937 mt_rand(time(0));
    for(int k = 0; k < 100000; k++)
    {
        v.push_back(mt_rand());
    }
}
void assertalgorithm(vector<int> & v)
{
    assert(v.size() == 100000);
    for(int i = 0; i < 100000 -1; i++)
    {
        for(int j = i +1; j < 100000; ++j)
        {
          assert(v[i] < v[j]);
        }
    }
}

Edit & run on cpp.sh

Last edited on May 20, 2016 at 9:26pm

May 20, 2016 at 9:33pm

Chervil (7320)

Two comments. First, you should use some sort of distribution as well as the generator. Choose the required range of the resulting values.

Second, if you generate any sequence of random numbers, there's no guarantee that each value will be unique.

If you throw a die six times, you would not expect to get each number 1 to 6 precisely once. Much more likely you will get some repeats and some not occurring at all.

You need to consider how to ensure you get unique numbers. Just using a 'better' source of numbers isn't sufficient on its own.

One way would be to fill the vector with consecutive values then shuffle it. And in your case, sort it back again, so that would be a waste of time.

Another way is to check that the number is already present before inserting it. If it's already there, try again.

A better way is to use a std::set which will only accept unique values, and keep inserting until the size of the set reaches your desired count. (Note, if you tried this with rand() you'd get an infinite loop, as after the first 32768 values (typically), every number would already be present, and so the insert would fail each time).
http://www.cplusplus.com/reference/set/set/insert/

Last edited on May 20, 2016 at 9:35pm

May 21, 2016 at 4:06am

JLBorges (13770)

Not really random, but more than random enough for our purposes (should pass most statistical tests for randomness.)

#include <iostream>
#include <random>
#include <algorithm>
#include <ctime>
#include <unordered_set>
#include <cassert>

unsigned int unique_rand( unsigned int n )
{
    // number theory:
    // where p is an odd prime number, the quadratic residue (mod p) of n
    //          is unique for every n when 2*n < p.
    // where p is a prime number, such that p (mod 4 ) == 3,
    //          p - the quadratic residue (mod p) of n is unique for every n when 2*n > p.

    static constexpr unsigned int p = 4294967291; // prime number, p%4 == 3
    static constexpr unsigned int half_p = p/2 ;

    if( n >= p ) return n ;

    const unsigned long long square = 1ULL * n * n ;
    const unsigned int quadratic_residue = square % p ;
    if( quadratic_residue <= half_p ) return quadratic_residue ;
    else return p - quadratic_residue ;
}

// EDIT: corrected
std::vector<unsigned int> vector_with_unique_random_values( std::size_t sz )
{
    std::vector<unsigned int> result ;
    result.reserve(sz) ; 
    
    unsigned int seed = std::time(nullptr) ;
    for( unsigned int i = 0 ; i < (sz+2) ; ++i ) result.push_back( seed = unique_rand(seed) ) ;

    // break the clustering of quadratic residues
    static constexpr unsigned int anti_cluster = 0x5bf03635 ; // empirically a good value
    for( unsigned int& v : result ) v ^= anti_cluster ;
    std::shuffle( std::begin(result), std::end(result), std::mt19937( std::time(nullptr) ) ) ;

    return result ;
}

bool has_unique_values( const std::vector<unsigned int>& seq )
{ return std::unordered_set<unsigned int>( std::begin(seq), std::end(seq) ).size() == seq.size() ; }

int main()
{
    const std::size_t N = 10'000'000 ; // ten million unique random numbers

    const auto start = std::clock() ;
    const auto vec = vector_with_unique_random_values(N) ;
    const auto end = std::clock() ;

    std::cout << "unique values? " << std::boolalpha << has_unique_values(vec) << '\n'
              << "generation took " << (end-start) * 1000.0 / CLOCKS_PER_SEC << " processor milliseconds\n" ;
}

Edit & run on cpp.sh

// http://coliru.stacked-crooked.com/a/12caec17968f858a
corrected: http://coliru.stacked-crooked.com/a/b60a1a4356c3ceca

On first impressions, the GNU optimiser (GCC 6.1) appears to be well ahead of the LLVM offering.

Last edited on May 21, 2016 at 1:29pm

Topic archived. No new replies allowed.

C++

Forum

Sorting Random Vector