Missing Integer

Write a function:

int solution(vector<int> &A);

that, given an array A of N integers, returns the smallest positive integer (greater than 0) that does not occur in A.

For example, given A = [1, 3, 6, 4, 1, 2], the function should return 5.

Given A = [1, 2, 3], the function should return 4.

Given A = [−1, −3], the function should return 1.

Write an efficient algorithm for the following assumptions:

N is an integer within the range [1..100,000];
each element of array A is an integer within the range [−1,000,000..1,000,000].


I wrote this and guess it's O(N):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
int solution(std::vector<int>& A) {
	if (A.size() == 1)
		return A[0] == 1 ? 2 : 1;

	auto vecSize = std::max_element(A.begin(), A.end());

	if (*vecSize <= 0) return 1;

	std::vector<int> boolVals(*vecSize + 1, 0);

	for (std::size_t i = 0; i < A.size(); ++i)
		if (A[i] > 0)
			boolVals[A[i]] = 1;

	for (std::size_t i = 1; i < boolVals.size(); ++i)
		if (boolVals[i] == 0) 
			return i;

	return boolVals.size();
}

int main() {
	std::vector A{1,3,6,4,1,2};

	std::cout << solution(A) << '\n';

	return 0;
}
Last edited on
Yes, it's O(N) in time.

If you use a vector<bool> for boolVals[] you will use about 1/32 the amount of space.

Since the largest possible value of an element of A is 1000000 you might want to cap the size of your bool array at A.size()+2.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;

int solution( const vector<int>& A )
{
   int SIZE = min( int( A.size() ), *max_element( A.begin(), A.end() ) );
   if ( SIZE <= 0 ) return 1;
   else             SIZE += 2;      // allow for 0 and "one element after"
   
   vector<bool> boolVals( SIZE, false );
   for ( int e : A ) if ( e > 0 ) boolVals[e] = true;

   for ( int i = 1; i < SIZE; i++ ) if ( !boolVals[i] ) return i;
}

int main()
{
   vector<int> A = { 1, 3, 6, 4, 1, 2 };
   cout << solution( A ) << '\n';
}
Last edited on
If you use a vector<bool> for boolVals[] you will use about 1/32 the amount of space.
It's been a typo. Please look at the name of the array there, "boolVals" (meaning Boolean values), I meant: std::vector<bool> boolVals(*vecSize + 1, 0); :)

Rather simplified:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
int solution(std::vector<int>& A) {
	if (A.size() == 1)
		return A[0] == 1 ? 2 : 1;

	std::vector<bool> boolVals(A.size() + 1, 0);    // Solution #3

	for (std::size_t i = 0; i < A.size(); ++i)
		if (A[i] > 0 && A[i] <= A.size())
			boolVals[A[i]] = 1;

	for (std::size_t i = 1; i < boolVals.size(); ++i)
		if (boolVals[i] == 0) 
			return i;

	return boolVals.size();
}

int main() {
	std::vector A{1,3,6,4,1,2};
	std::cout << solution(A) << '\n';

	return 0;
}

Since the largest possible value of an element of A is 1000000 you might want to cap the size of your bool array at A.size()+2.
I didn't get that part. The bool array is defined based on the size of the vector A.

I should thank you for the algorithms you used in the previous tasks since I learnt them and they're very effective.
Last edited on
The bool array is defined based on the size of the vector A.


Not entirely. It depends on BOTH the size of the array AND on the maximum value of elements in the array.

If A has size 3 and maximum element 10 then the first missing integer must be less than 4
If A has size 10 and maximum element 2 then the first missing element can be at most 3
Last edited on
If A has size 3 and maximum element 10 then the first missing integer must be less than 4
Take a look at line 15 of the code above, please.

If A has size 10 and maximum element 2 then the first missing element can be at most 3
Take a look at the way boolVals is filled (lines 8-9).

If we still don't understand each other well, please give me a or two examples to test them through the last algorithm
@Frek - you are completely missing the point.

The size of the boolVals array depends on the lesser of TWO things; the size of the original array and the maximum value it holds.

Here are my examples again:
- If A has size 3 and maximum element 10 then the first missing integer must be less than 4 (in this example the size of the boolVals array is determined by the size of the original array).
- If A has size 10 and maximum element 2 then the first missing element can be at most 3 (in this example the size of the boolVals array is determined by the maximum value that the original array holds)

Your line 15 is not pertinent to the issue - you can choose to exit after the loop if nothing satisfies it (your choice) or guarantee to exit during it by having an array one element longer (my choice). It makes a difference of just 1 in the size of the array and is irrelevant to the problem.
If A has size 3 and maximum element 10
Something like {1,4,10}
then the first missing integer must be less than 4
Right, here 2.
(in this example the size of the boolVals array is determined by the size of the original array).
Yes, the bool array will first be {0, 0, 0, 0} then only the position one will be set to 1 (line 8). So staring from the beginning, our result will be 2, as expected.

If A has size 10 and maximum element 2
Something like {-4, 1, -3, -3, 0, -5, 1, 2, -1, -6}
then the first missing element can be at most 3
Right. As mentioned above we define the bool array: {0,0,0,0,0,0,0,0,0,0, 0}, and set positions that are greater than 0 and less than or equal to 10: {0,1,2,0,0,0,0,0,0,0, 0}. So again, starting from the beginning we will have 3 as the result.

(in this example the size of the boolVals array is determined by the maximum value that the original array holds)
It was the case in the penultimate algorithm. In the last one that issue is not taken into account.

Likely I'm too tried that can't get your point that's why I asked you for an example. I meant an array example. For instance, give me an array given to the project it will send out a not completely right answer.
A side question:

Suppose we have three two-dimensional arrays with the same size, M, named a, b, c. Then we have the following two snipped codes:
1
2
3
for(int i=0; i<M; ++i)
 for(int j=0; j<M; ++j)
   c[i][j] = a[i][j] + b[i][j]


1
2
3
for(int j=0; j<M; ++j)
 for(int i=0; i<M; ++i)
   c[i][j] = a[i][j] + b[i][j]

Is there any performance difference between these two, please? And the more important question is the reason for that difference, please?
Last edited on
> Is there any performance difference between these two?

assuming that M is not too small, there may or may not be a difference;
depends on the architecture and the optimiser.


> reason for that difference

More cache misses in the worse case


> reason for that non-difference

The optimiser may exploit the "as-if" rule: https://en.cppreference.com/w/cpp/language/as_if
For example: https://gcc.godbolt.org/z/9ajqcqsM9
>assuming that M is not too small, there may or may not be a difference;
depends on the architecture and the optimiser.

1) Consider M is not too small and according the architecture and optimiser they're different. What's is that difference in action, please?

2) As for the as-if rule, I tried to study it through cppreference. But it's too complex/advanced for me this time. I took advantages of this page instead: https://en.wikipedia.org/wiki/As-if_rule
Fine to you too, please?

>For example: https://gcc.godbolt.org/z/9ajqcqsM9

3) You used some optimising flag for the compiler there, right?
> What's is that difference in action, please?

If the optimiser recognises that
1. the 'observable behaviour' of the program would be the same for both the loop constructs
2. one loop construct gives better performance

it is allowed, but not required, to perform a change in the program code, so that it generates the code that gives better performance. The difference lies in whether the optimisation is applied or not.


> Fine to you too, please?

Yes, it should be good enough for a starter.


> You used some optimising flag for the compiler there

Yes, -O3 -march=core-avx2
Whether some optimizing flag is exploited or not the gofbolt shows the same set of Assembly code for both loop constructs each time: https://gcc.godbolt.org/z/6G7866868

> one loop construct gives better performance

Which loop and why, please?
I mean what happens internally or how does the compiler execute the instruction (on memory) for each loop differently? I guess the difference is in accessing the elements of the loops but not sure how, or what that different accessing in on the hardware.
> Whether some optimizing flag is exploited or not the gofbolt shows the same set of Assembly code

The "as-if" rule can always kick in; the language allows it.


> Which loop and why, please?

In general, a loop that traverses the elements of an array in the order in which it is laid out in memory (for example, traverses a row-major array row by row) gives better performance. Successive elements of the array can be accessed from the same cache line (cache hit), and if arithmetic operations are involved, vectorisation would be a possibility.
>In general, a loop that traverses the elements of an array in the order in which it is laid out in memory

How to recognize it from the given code snippets below, if the whole thing offered to us is them, please?

1)
1
2
3
for(int i=0; i<M; ++i)
 for(int j=0; j<M; ++j)
   c[i][j] = a[i][j] + b[i][j]

2)
1
2
3
for(int j=0; j<M; ++j)
 for(int i=0; i<M; ++i)
   c[i][j] = a[i][j] + b[i][j]


Here, both instructions (of the loops) start by elements of array a then adding them to those of b and finally store in c.
Only the order of for loops in each snipped differs from the other.
Last edited on
> How to recognize it from the given code snippets below?

It ought to be quite obvious.

There is a picture of access patterns on this page:
https://onestepcode.com/array-reference-performance-row-major-vs-column-major-order/
Very good info I learnt from you, thanks.
Row-major order is used in C/C++/Objective-C (for C-style arrays), PL/I, Pascal, Speakeasy, SAS, and Rasdaman. So the first version is better in terms of performance since it's accessing data in row-major order.

>The "as-if" rule can always kick in; the language allows it.

You mean, whether we use an installed compiler or an online one, compiler optimizations take action in the code even if we disable them manually!?
> compiler optimizations take action in the code even if we disable them manually!?

With appropriate compiler options, we can disable some (non-mandatory) optimisations.

For instance, with GNU or compatible compilers, -Og is used to generate code to support debugging:

-Og
Optimize debugging experience....

Like -O0, -Og completely disables a number of optimization passes... Otherwise -Og enables all -O1 optimization flags except for those that may interfere with debugging.
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html


Or we could use very specific compiler options. For example:

-floop-interchange
Perform loop interchange outside of graphite. This flag can improve cache performance on loop nest and allow further loop optimizations, like vectorization, to take place.

(Use -fno-loop-interchange to disable this optimisation)
There is a picture of access patterns on this page

Thank you for that, JLBorges! A while back I interacted with someone who kept insisting there was no performance hit/different with row-major vs. column-major 2D access. If there was a difference column-major was superior performance with normal, default, compiler options.

*saved*
Topic archived. No new replies allowed.