sample without replacement - efficient way

Greetings!
I have an array of which I would like to draw n samples without replacement. The quality of the randomness as well as computational efficiency is important.

Is this naive approach of mine the right way to do it?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

#include <iostream>
#include <vector>
#include <random>
#include <iomanip>

int main(int argc, char *argv[]){

int randomseed = std::atoi(argv[1]);  			 // pass a random seed
int n = 4;									// no of samples
std::vector <double> Xdata {1,2,3,4,5,6,7,8,9,10};  // how to sample n elements without replacement?

// seed random number generator and distribution
std::mt19937 twister;

std::seed_seq seq{1, 20, 3200, 403, 5*randomseed+1, 12000, 73667, 9474+randomseed, 19151-randomseed};
std::vector <std::uint32_t> seeds(1);
seq.generate(seeds.begin(), seeds.end());
twister.seed(seeds.at(0)); 

std::uniform_int_distribution <> distrib(0, Xdata.size()-1);  // uniform dist. over array indices

// naive way
int index;
std::vector <int> indices(0);	//	stores chosen indices
int i = 0; 					// counts no. if indices already found
bool match;				// alerts if sampled index was already used

while(i < n){

	match = false;
	index = distrib(twister);
	
	for(int j=0; j<indices.size(); ++j){

		if(index == indices[j]) {
			match = true;	
			break;
		}
	
	}
		
	if(match == false){			  // accept index
		indices.push_back(index);
		++i;
	} 	
	
}	
	
// print out my great results
for(int i=0; i<indices.size(); ++i){
	std::cout << Xdata[ indices[i] ] << std::endl;
}

return 0;

}


edit: not sure why the comments are so far off...
Last edited on
not sure why the comments are so far off...


Probably because you used tabs to align the comments which breaks down if a different tab width is used.
> edit: not sure why the comments are so far off...
It happens when you mix spaces and tabs for indentation.
Sooner or later, something somewhere will screw up your nicely arranged code because it doesn't understand tab stops properly.

The problem is, as n approaches the size of your input vector, you're going to spend more and more time guessing and searching for the elusive element not already in the list.

Consider https://www.cplusplus.com/reference/algorithm/shuffle/

Using test coverage to profile your code (https://gcc.gnu.org/onlinedocs/gcc/Gcov.html )
The first number on each line is the number of times that line was executed.

$ g++ -fprofile-arcs -ftest-coverage  foo.cpp
$ ./a.out 20
$ gcov foo.cpp
$ cat foo.cpp.gcov
        1:    6:int main(int argc, char *argv[]){
        -:    7:
        1:    8:int randomseed = std::atoi(argv[1]);  			 // pass a random seed
        1:    9:int n = 4;									// no of samples
        2:   10:std::vector <double> Xdata {1,2,3,4,5,6,7,8,9,10};  // how to sample n elements without replacement?
        -:   11:
        -:   12:// seed random number generator and distribution
        1:   13:std::mt19937 twister;
        -:   14:
        2:   15:std::seed_seq seq{1, 20, 3200, 403, 5*randomseed+1, 12000, 73667, 9474+randomseed, 19151-randomseed};
        2:   16:std::vector <std::uint32_t> seeds(1);
        1:   17:seq.generate(seeds.begin(), seeds.end());
        1:   18:twister.seed(seeds.at(0)); 
        -:   19:
        1:   20:std::uniform_int_distribution <> distrib(0, Xdata.size()-1);  // uniform dist. over array indices
        -:   21:
        -:   22:// naive way
        -:   23:int index;
        1:   24:std::vector <int> indices(0);	//	stores chosen indices
        1:   25:int i = 0; 					// counts no. if indices already found
        -:   26:bool match;				// alerts if sampled index was already used
        -:   27:
        5:   28:while(i < n){
        -:   29:
        4:   30:	match = false;
        4:   31:	index = distrib(twister);
        -:   32:	
       10:   33:	for(int j=0; j<indices.size(); ++j){
        -:   34:
        6:   35:		if(index == indices[j]) {
    #####:   36:			match = true;	
    #####:   37:			break;
        -:   38:		}
        -:   39:	
        -:   40:	}
        -:   41:		
        4:   42:	if(match == false){			  // accept index
        4:   43:		indices.push_back(index);
        4:   44:		++i;
        -:   45:	} 	
        -:   46:	
        -:   47:}	
        -:   48:	
        -:   49:// print out my great results
        5:   50:for(int i=0; i<indices.size(); ++i){
        4:   51:	std::cout << Xdata[ indices[i] ] << std::endl;
        -:   52:}
        -:   53:
        1:   54:return 0;
        -:   55:
        -:   56:}


And if I change n from 4 to 8, it starts to look pretty horrible.

       15:   28:while(i < n){
        -:   29:
       14:   30:	match = false;
       14:   31:	index = distrib(twister);
        -:   32:	
       54:   33:	for(int j=0; j<indices.size(); ++j){
        -:   34:
       46:   35:		if(index == indices[j]) {
        6:   36:			match = true;	
        6:   37:			break;
        -:   38:		}
        -:   39:	
        -:   40:	}
        -:   41:		
       14:   42:	if(match == false){			  // accept index
        8:   43:		indices.push_back(index);
        8:   44:		++i;
        -:   45:	} 	
Last edited on
not sure why the comments are so far off...

The formatting of more than your comments looks borked.

There are tools available to prettify code for a consistent indent look. Two IDEs I know of have that capability built-in. Visual Studio and Code::Blocks. Artistic Style, AKA AStyle, is one of said tools.

http://astyle.sourceforge.net/

I really like the VS capability, it can format on paste if that option is set. It has quite a few settings available to tweak code layout/formatting.

Copy'n'paste your code into a new VS project/solution, and here's what VS does for the formatting courtesy of my preset rules:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#include <iostream>
#include <vector>
#include <random>
#include <iomanip>

int main(int argc, char* argv[])
{

   int randomseed = std::atoi(argv[1]);  			 // pass a random seed
   int n = 4;									// no of samples
   std::vector <double> Xdata { 1,2,3,4,5,6,7,8,9,10 };  // how to sample n elements without replacement?

   // seed random number generator and distribution
   std::mt19937 twister;

   std::seed_seq seq { 1, 20, 3200, 403, 5 * randomseed + 1, 12000, 73667, 9474 + randomseed, 19151 - randomseed };
   std::vector <std::uint32_t> seeds(1);
   seq.generate(seeds.begin(), seeds.end());
   twister.seed(seeds.at(0));

   std::uniform_int_distribution <> distrib(0, Xdata.size() - 1);  // uniform dist. over array indices

   // naive way
   int index;
   std::vector <int> indices(0);	//	stores chosen indices
   int i = 0; 					// counts no. if indices already found
   bool match;				// alerts if sampled index was already used

   while (i < n)
   {

      match = false;
      index = distrib(twister);

      for (int j = 0; j < indices.size(); ++j)
      {

         if (index == indices[j])
         {
            match = true;
            break;
         }

      }

      if (match == false)
      {			  // accept index
         indices.push_back(index);
         ++i;
      }

   }

   // print out my great results
   for (int i = 0; i < indices.size(); ++i)
   {
      std::cout << Xdata[indices[i]] << std::endl;
   }

   return 0;

}

Makes reading the code a whole lot easier IMO.

I have VS convert tabs to spaces so multiple indent code doesn't run way off to the right side, I try to stay within a right margin of 90 or so.

Now, one question about line 9.....what happens if someone starts the app and doesn't provide any command-line parameters?

An arguably better random seed setup requiring no user intervention might be:
16
17
18
19
std::seed_seq::result_type seeds[] { std::random_device {} (),
                                     std::seed_seq::result_type(std::chrono::system_clock::now().time_since_epoch().count()) };

std::seed_seq sseq(std::begin(seeds), std::end(seeds));
The quality of the randomness as well as computational efficiency is important.


Twister is one of the slower random generation available. It is also good quality, so here you have to make a choice. Slow is relative, but it adds up over billions.

searching is best avoided.
like 34/35 is a linear search for each and every go. Maybe you should store this stuff in an unordered map, or use some other storage scheme such as indices[index] is the data if it exists, or is a shortcut to the data (eg a second array set up this way to give the real index back in the real array). The idea here is to not search, but simply know, where the match is (if any). At the absolute least you should sort the data (or a copy of it, or a bunch of pointers to it) and binary search instead of hunt and peck, assuming the data is much, much larger than your examples here (?). If its really < 100 items, linear does not matter but then again, its not what you said (computationally efficient).

Note that map's [] automatically adds a value if it isn't already in the container. Also if you decide to continue to use a vector, you should have a reserve for how many expected items to reduce resize operations.

Iteration is the way to sample data IMHO.
If the data is small, make a copy of it, randomly re-order the data, and iterate it to take samples. If the data is large enough that copying is slow, then set up pointers to the read data, randomly reorder the pointers, and do the same idea.
Last edited on
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <iostream>
#include <vector>
#include <random>
#include <ctime>
#include <algorithm>
#include <numeric>
using namespace std;

int main()
{
   const int SIZE = 10;
   const int n = 4;
   
   vector<double> Xdata( SIZE );
   iota( Xdata.begin(), Xdata.end(), 1 );

   mt19937 gen( time( 0 ) );
   shuffle( Xdata.begin(), Xdata.end(), gen );
   for ( int i = 0; i < n; i++ ) cout << Xdata[i] << " ";
}
Last edited on
@Salem.c, lastchance
Thanks for the shuffle recommendation, I'll give it a try!
edit: I just realized that shuffle can be much less efficient than my naive approach if n is much smaller than the size of the data.

@George P.
Did you accidentally repost my code instead of the result of your prettifying?

I was using an input random seed in order to get reproducible results. For now, I will be the only one using this, so I rely on me knowing how to operate the code x)

@jonnin
Thanks for the hints. Yes, the algorithm should also scale for (much) larger data arrays. I don't quite understand what is meant by "shortcut" to the data... but I do understand that ordering makes sense and that binary search makes it better already.
Last edited on
If n is fairly small compared to size (at least less than 1/2) then it might be more efficient to do something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>
#include <bitset>
#include <random>
using namespace std;

const int Size = 100, Choose = 30;

int main() {
    mt19937 rnd(random_device{}()); // not demonstrating great seeding here
    uniform_int_distribution<> dist(0, Size - 1);
    bitset<Size> b;
    vector<int> v;

    for (int i = 0; i < Choose; ++i) {
        int r;
        while (b[r = dist(rnd)]) ;
        b.set(r);
        v.push_back(r);
    }

    for (int n: v) cout << n << ' ';
    cout << '\n';
}

Last edited on
I've forgotten the name of the algorithm on which std::shuffle is based, but if you used the first n terms of that, a kind of "partial_shuffle", then you ought to be able to get n distinct indices straight off with no repeats. This is where I need Knuth's book ...
Could you be thinking of Fisher-Yates? It looks like this:
1
2
for (int i = 0; i < xs.size() - 1; ++i)
  swap(xs[i], xs[random_integer_in_halfopen_interval(i, xs.size())]);
https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle

However, there's already a reservoir sampling algorithm in the standard library called std::sample:
https://en.wikipedia.org/wiki/Reservoir_sampling
https://en.cppreference.com/w/cpp/algorithm/sample

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <iostream>
#include <random>
#include <vector>
#include <algorithm>

int main()
{
    std::mt19937 rbg { 42u }; 
    
    std::vector<int> population(1000);
    std::iota(population.begin(), population.end(), 0);
    std::vector<int> samples(10);
    std::sample(population.begin(), population.end(), samples.begin(), samples.size(), rbg);
    
    for (int sample: samples) std::cout << sample << ' '; 
    std::cout << '\n';
}
Last edited on
PhysicsIsFun, look at your code, and what I posted.

Yes, it IS your code, with some indentation formatting to make it a bit more readable.

While being better formatted doesn't alter the code one bit, it makes reading easier. You do want others to be able to read and understand your code, don'cha?
I didn't think about the shuffle results being used while the shuffle is proceeding and stopping the shuffle early.

Here's a serialized, truncated shuffle, now with super-seeding.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#include <iostream>
#include <random>
#include <array>
#include <numeric>
using namespace std;

#if defined(__linux__) || defined(linux) || defined(__linux)

#include <fstream>
size_t sysrandom(void* dst, size_t dstlen) {
    char* buffer = reinterpret_cast<char*>(dst);
    std::ifstream stream("/dev/urandom", stream.binary);
    stream.read(buffer, dstlen);
    return dstlen;
}

#elif defined(_WIN32)

#include <wincrypt.h>

bool acquire_context(HCRYPTPROV *ctx) {
    if (!CryptAcquireContext(ctx, 0, 0, PROV_RSA_FULL, 0))
        return CryptAcquireContext(ctx, 0, 0, PROV_RSA_FULL, CRYPT_NEWKEYSET);
    return true;
}
size_t sysrandom(void* dst, size_t dstlen) {
    HCRYPTPROV ctx;
    if (!acquire_context(&ctx))
        throw std::runtime_error("Unable to initialize Win32 crypt library.");
    BYTE* buffer = reinterpret_cast<BYTE*>(dst);
    if(!CryptGenRandom(ctx, dstlen, buffer))
        throw std::runtime_error("Unable to generate random bytes.");
    if (!CryptReleaseContext(ctx, 0))
        throw std::runtime_error("Unable to release Win32 crypt library.");
    return dstlen;
}

#endif

template<typename T>  // can use for mt19937 or mt19937_64
void seed_mt19937(T& rnd) {
    array<typename T::result_type, T::state_size> state;
    sysrandom(state.begin(), state.size() * sizeof(typename T::result_type));
    seed_seq s(state.begin(), state.end());
    rnd.seed(s);
}

const int Size = 100, Choose = 30;

int main() {
    mt19937 rnd;
    seed_mt19937(rnd);

    vector<int> v(Size);
    iota(v.begin(), v.end(), 0);

    for (int i = Size, j = 0; j < 30 && --i > 0; ++j) {
        swap(v[i], v[uniform_int_distribution<>(0, i)(rnd)]);

        cout << v[i] << ' ';
    }
    cout << '\n';
}

Last edited on
mbozzi wrote:
Could you be thinking of Fisher-Yates?

I was! But I had a senior moment and couldn't remember the name. Thank-you to you and @DizzyDon for the slick implementation.

I was unaware of std::sample - that would be a more direct solution in C++17 and above.



My implementation of a "partial_shuffle", based on the posts of @mbozzi and @DizzyDon. (I'm filling from the opposite end of the array to @DizzyDon.)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <iostream>
#include <vector>
#include <random>
#include <ctime>
#include <algorithm>
#include <numeric>
using namespace std;

mt19937 gen( time( 0 ) );

template<typename T>
void partial_shuffle( vector<T> &V, int n )          // n steps of Fisher-Yates algorithm (assumes n <= V.size())
{                                                    // (https://www.cplusplus.com/forum/beginner/282421/#msg1222335 and following)
   for ( int i = 0; i < n; i++ ) swap( V[i], V[uniform_int_distribution<>(i,V.size()-1)( gen )] );
}

int main()
{
   const int SIZE = 100;
   const int n = 4;
   
   vector<double> X( SIZE );
   iota( X.begin(), X.end(), 1 );

   partial_shuffle( X, n );
   for ( int i = 0; i < n; i++ ) cout << X[i] << " ";
}
Last edited on
Topic archived. No new replies allowed.