Genomic Range Query

Forum

Forum
General C++ Programming
Genomic Range Query

Genomic Range Query

A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T have impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?

The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 ≤ K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).

For example, consider string S = CAGCCTA and arrays P, Q such that:

P[0] = 2 Q[0] = 4
P[1] = 5 Q[1] = 5
P[2] = 0 Q[2] = 6
The answers to these M = 3 queries are as follows:

The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2.
The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4.
The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.
Write a function: vector<int> solution(string &S, vector<int> &P, vector<int> &Q);

that, given a non-empty string S consisting of N characters and two non-empty arrays P and Q consisting of M integers, returns an array consisting of M integers specifying the consecutive answers to all queries.

Result array should be returned as a vector of integers.

For example, given the string S = CAGCCTA and arrays P, Q such that:

P[0] = 2 Q[0] = 4
P[1] = 5 Q[1] = 5
P[2] = 0 Q[2] = 6
the function should return the values [2, 4, 1], as explained above.

Write an efficient algorithm for the following assumptions:

N is an integer within the range [1..100,000];
M is an integer within the range [1..50,000];
each element of arrays P and Q is an integer within the range [0..N - 1];
P[K] ≤ Q[K], where 0 ≤ K < M;
string S consists only of upper-case English letters A, C, G, T.

I wrote this correct but inefficient algorithm for that:

std::vector<int> solution(std::string& S, std::vector<int>& P, std::vector<int>& Q) {

	std::vector<int> secVec, minSeq;

	for (const auto& s : S)
		switch (s) {
		case 'A': secVec.push_back(1); break;
		case 'C': secVec.push_back(2); break;
		case 'G': secVec.push_back(3); break;
		case 'T': secVec.push_back(4); break;
		}

	for (size_t i = 0; i < P.size(); ++i) {
		int min_val{ secVec[P[i]] };
		for (size_t j = P[i]; j <= Q[i]; ++j)
			min_val = secVec[j] < min_val ? secVec[j] : min_val;
		minSeq.push_back(min_val);
	}
	return minSeq;
}

int main() {
	std::string str{ "CAGCCTA" };
	std::vector P{ 2,5,0 }, Q{ 4,5,6 };
	
	for (int s : solution(str, P, Q))
		std::cout << s << ' ';
	std::cout << '\n';

	return 0;
}

Do you have an idea how to create a more efficient algorithm for the exercise, please?

Last edited on

keskiverto (10365)

One trivial detail in that current algorithm is that your push_backs will probably invoke many vector reallocations several times.
That is unnecessary since you have enough data to preallocate.

againtry (2313)

https://massivealgorithms.blogspot.com/2015/07/rafals-blog-codility-genomic-range-query.html

lastchance (6980)

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
using namespace std;

vector<int> solution( const string &S, const vector<int> &P, const vector<int> &Q )
{
   string SI = S;
   replace( SI.begin(), SI.end(), 'A', '0' );
   replace( SI.begin(), SI.end(), 'C', '1' );
   replace( SI.begin(), SI.end(), 'G', '2' );
   replace( SI.begin(), SI.end(), 'T', '3' );

   int N = S.size();
   vector< vector<int> > bases(N,vector<int>(4,-1));       // will hold the LATEST index with these bases
   bases[0][SI[0]-'0'] = 0;
   for ( int i = 1; i < N; i++ )
   {
      bases[i] = bases[i-1];
      bases[i][SI[i]-'0'] = i;
   }

   int M = Q.size();
   vector<int> result( M );
   for ( int k = 0; k < M; k++ )
   {
      int p = P[k], q = Q[k];
      for ( int j = 0; j < 4; j++ )
      {
         if ( bases[q][j] >= p )
         {
            result[k] = j + 1;
            break;
         }
      }
   }
   return result;
}


int main()
{
   string S = "CAGCCTA";
   vector<int> P = { 2, 5, 0 }, Q = { 4, 5, 6 };
   for ( int e : solution( S, P, Q ) ) cout << e << ' ';
}

2 4 1

Time complexity O(max(N,M)), or O(N+M) if you must.
Space complexity O(N).

Last edited on

mbozzi (3911)

S.A.:
https://en.wikipedia.org/wiki/Range_minimum_query

againtry (2313)

#include <iostream>
#include <string>
#include <vector>

std::vector<int> solution(std::string&, std::vector<int>&, std::vector<int>&);

int main()
{
    std::string S = "CAGCCTA";
    std::vector<int> P{2,5,0};
    std::vector<int> Q{4,5,6};
    
    for( auto i:solution(S, P, Q) )
    {
        std::cout << i << ' ';
    }
    std::cout << '\n';
    
    return 0;
}

std::vector<int> solution(std::string& S, std::vector<int>& P, std::vector<int>& Q)
{
    int value{0};
    std::vector<int> result;
    
    for(int i = 0; i < P.size(); i++)
    {
        char MIN{'Z'};
        for( char ch: S.substr(P[i], Q[i] - P[i] + 1) )
        {
            if(ch < MIN)
                MIN = ch;
        }
        
        switch(MIN)
        {
            case 'A':
                value = 1;
                break;
            case 'C':
                value = 2;
                break;
            case 'G':
                value = 3;
                break;
            case 'T':
                value = 4;
                break;
            default:
                std::cout << "Error\n";
        }
        result.push_back(value);
    }
    return result;
}

2 4 1 
Program ended with exit code: 0

frek (576)

@mbozzi, I read that page but since most of it is formulated and not code I didn't understand a good amount of that, but guess that lastchance's method is the only way to create a non-recursive algorithm with O(N) time complexity for the exercise. I tested his algorithm on paper as well and it works but couldn't get the idea behind that! Therefore I suppose I need just to memorize his code for situations where a Range minimum/maximum query is questioned in the exercise. Agree?

Last edited on

keskiverto (10365)

No. Copy-paste without comprehension ... one does not know whether there is a need to adapt rather than use verbatim, if one has no clue about what and why some code is,

frek (576)

OK, thank you all for your answers. One thing keskiverto, I don't think the vector allocates new memory (from heap) each time a push_back is performed in my code, because the vector firstly allocates 8 (size of vector's type) and when it's full it then reallocates double the size of vector (or another 8) if my recollection on creating a user-defined vector from scratch is accurate.

@lastchance, did you write the above algorithm from scratch, or is it something you'd already studied, please?

againtry (2313)

https://www.youtube.com/watch?v=4SyckIAmYXk

lastchance (6980)

@frek,
No, I wrote it from scratch. I wasn't aware of formal methods for solving it before and it simply reflects how I thought about the problem after trying a few examples with pen and paper. I originally had separate A, C, G, T arrays, but there was so much repetition that I eventually settled on a single 2-d array.

My take is that if you want to reduce the time-complexity from quadratic (M*N or N^2) to linear (M+N or N) then you are probably going to have to do some pre-computation. The fact that you are going to do multiple queries about a single object also suggests the same.

keskiverto (10365)

frek wrote:
I don't think the vector allocates new memory (from heap) each time a push_back is performed in my code, because the vector firstly allocates 8 (size of vector's type) and when it's full it then reallocates double the size of vector (or another 8)

All C++ Standard Library implementations do not use exact same reallocation policy. Even if they did, if N can be up to 100'000, then reallocations are likely.

However, if we first call vector::reserve() with the N that we do know already at start, then we know for certain that no vector::push_back() will invoke reallocation. Certainty is way better than "maybe".

frek (576)

@againtry, thanks for the video.

This mixture of the two algorithms has the same efficiency and is shorter.

#include <iostream>
#include <vector>
#include <string>

std::vector<int> solution(std::string& S, std::vector<int>& P, std::vector<int>& Q) {

	std::vector<int> result;
	std::vector< std::vector<int> > bases(4, std::vector<int>(S.size(), 0));
	int a{ 0 }, c{ 0 }, g{ 0 }, t{ 0 };

	for (size_t i = 0; i < 4; ++i)
		for (size_t j = 0; j < S.size(); ++j)
			switch (i) {
			case 0: if (S[j] == 'A') ++a; bases[i][j] = a; break; 
			case 1: if (S[j] == 'C') ++c; bases[i][j] = c; break; 
			case 2: if (S[j] == 'G') ++g; bases[i][j] = g; break;
			case 3: if (S[j] == 'T') ++t; bases[i][j] = t; break;
			}

	for (size_t i = 0; i < P.size(); ++i)
		     if (bases[0][P[i]] < bases[0][Q[i]] || S[P[i]] == 'A') result.push_back(1);
		else if (bases[1][P[i]] < bases[1][Q[i]] || S[P[i]] == 'C') result.push_back(2);
		else if (bases[2][P[i]] < bases[2][Q[i]] || S[P[i]] == 'G') result.push_back(3);
		else if (bases[3][P[i]] < bases[3][Q[i]] || S[P[i]] == 'T') result.push_back(4);
	return result;
}

int main() {
	std::string str{ "CAGCCTA" };
	std::vector P{2,5,0}, Q{4,5,6};

	for (int s : solution(str, P, Q))
  	 std::cout << s << ' ';
	std::cout << '\n'; 

	return 0;
}

He says: notice that this method allows me to check the variance of the letter occurrences without visiting all the array. The question is, how or why the variance of letter occurrences can be the idea of solving the exercise!? Could you explain it?

Last edited on

againtry (2313)

@frek You might need to play the video a couple of times. As you probably know yours gets a 100% rating at Codility the same as his.

(Mine gets a dismal 62%, @lastchances will probably get 100% except for the replace function not being acceptable to the Codility C++ version limitation).

As far as the comment about not visiting all the array, I interpret it that the string only has to be visited once, i.e. to prepare the initial cumulative numbers. Once that's done only the end points P,Q have to be checked, not what happens between them, hence 'without visiting the whole array'.

frek (576)

Assessing what was already written using paper and pen it turned out to be clear at last! It was my first experience of such exercise so some time was needed. We record the variance of each letter in a row (ascendingly for minimum range/descending for maximum range) to be aware each occurrence of that.
Thank you all.

Topic archived. No new replies allowed.