fast comparisons of strings

I have to compare several times two vectors of strings. Each string contains "0", "1" or "2".

Ex: vector1 = "0001002001" "0200011001" and vector2 = "0010202011" "0201001001"

I want my function to return a third vector which contains the highest number for each character, here: vector 3 ="0011202011" "0201011001"

By construction, for a given position it is not possible to have "1" for vector1 and "2" for vector2 (or the converse situation). So a "1" always faces "0" or "1" and "2" always faces "0" or "2",

What I did is:

if(vector1 == vector2) return(vector1);
for (i=0; i<vector1.size(); i++) {
element1=vector1[i];
element2=vector2[i];
if (element1.compare(element2) != 0) {
for (j=0; j<element1.length(); j++) {
if(element1.at(j)<=element2.at(j))
element3.push_back(element2.at(j)); //note that comparing strings works, so I give up on fighting to convert str to int and int to str
else element3.push_back(element1.at(j));
}
vector3.push_back (element3);
element3.clear();
}
else vector3.push_back (element1);
}
return(vector3);

It works fine, but my problem is that my vectors have all 1000 elements composed each of 500 characters (so they are 1000*500 chars long) and comparisons are far too slow,,,

Note that I would appreciate solutions not involving very special library since the C++ code is called by another language (R in my case), Thanks a lot!
Last edited on
Start by not making useless copies. element1 and element2 don't need to exist. Or at least not as objects.
at() performs bounds checking, so that can slow down code a lot if used inside a loop.

Your inner for looks like just a wheel reinvention of std::string::operator<=(). So why not use that, instead?
I wonder... Would it be faster if he made a global mapping like this...

1
2
3
4
5
const char result[3][3]={
    '0','1','2',
    '1','1','2',
    '2','2','2'
};

(i.e. (0,0) -> '0', (0,1) -> '1', (0,2) -> '2' etc...)

...and then did something like this:

element3.push_back( result[element1[j]-'0'][element2[j]-'0'] );

??? I think it would save him some time...
(since it eliminates the need for comparisons...)
Last edited on
Thanks hellios, removing "element" objects and replacing .at() by [] saved me 22% of computation time. Sorry for trying to reinvent the wheel, I rekon it's a usual newbies fault, still I don't understand how to compare strings as a whole because it returns which string is the smaller and does not return results of each char by char comparison within strings (needed in my case)... could you be more explicit, please?

Also, do you think that converting everything into ternaries could help to save lot's of computation time?
Don't think this will improve much the speed of your code but instead of making element3 a vector make it a char since that is all what you are copying.
Thanks m4ster r0shi, but global mapping does not save time :-(
Thanks RedX, but element3 is not a vector in my code, it's a string and it's what I need since it is composed of several chars.
Mmm... since element3 is a string, wouldn't it be faster to initialize its size to 10 characters and then modify each character using the [] operator? Using push_back in general takes more time than the [] operator because push_back modifies the size of the string and thus the cstr representation must be resized (that is be deleted and then newed again). But that is unless the cstr representation is calculated only when you ask for it (and not every time you change your string). Anyway, try doing it and see if it's faster.
Last edited on
Thanks again m4ster r0shi, declaring element3 as a char [] instead of a std::string save me around half of the time!

Based on all previous comments, my code is now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
  std::vector<std::string> my_vector1 = Rcpp::as< std::vector<std::string> >(x); 
  std::vector<std::string> my_vector2 = Rcpp::as< std::vector<std::string> >(y);
  std::vector<std::string> my_vector3; 
  size_t i, j;
  char element3 [my_vector1[1].length()];
  const char result [3] [3] = {
    '0','1','2',
    '1','1','2',
    '2','2','2'};
  if(my_vector1 == my_vector2) return(Rcpp::wrap( my_vector1 ));
  for (i=0; i<my_vector1.size(); i++) {
    if (my_vector1[i].compare(my_vector2[i]) != 0) {
      for (j=0; j<my_vector1[i].length(); j++) element3[j] = result[my_vector1[i][j]-'0'][my_vector2[i][j]-'0'];
      my_vector3.push_back (element3);
      }
    else my_vector3.push_back(my_vector1[i]);
    }
  return(Rcpp::wrap( my_vector3 ));


Now, 5000 comparisons of 2 vectors of 1000 x 100 chars each takes me around 5.9 seconds on my laptop, it's much better than the 11 seconds before your answers. Still, I have billions of pairwise comparisons to do, so if somebody has still any idea, I am ready to try!

Thanks a lot!!!
Last edited on
That's not what he said!
std::string element3(my_vector1[i].length());
Creating variable-sized arrays (line 5) is illegal in C++.

Are you using optimizations? Since that compiled, I'm guessing you're using GCC. You can make the compiler optimize by adding -O3 to the command line.
Last edited on
Sorry Helios for this misunderstanding, I corrected my post, I indeed had not made a variable array... I am going to try to play with GCC, it was compiled dynamically by my other program, but I will try to compile it properly. Thanks a lot!
By construction, for a given position it is not possible to have "1" for vector1 and "2" for vector2 (or the converse situation). So a "1" always faces "0" or "1" and "2" always faces "0" or "2"

This can be used to make the calculation of the maximum element even faster! Take a look at this, I've gathered a couple of ways to do what you want and compared them to each other:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
#include <iostream>
#include <cstdlib>
#include <vector>
#include <string>
using namespace std;

void random_sv1(vector<string> & sv, int size);
//randomly generate a string vector

void random_sv2(vector<string> & sv, const vector<string> & sv1);
//randomly generate a string vector, given another one, so that
//no '1's are paired with '2's

void calculate(const vector<string> & sv1, const vector<string> & sv2, vector<string> & sv3);
//classic way -> character comparison

void calculate2(const vector<string> & sv1, const vector<string> & sv2, vector<string> & sv3);
//smarter way -> mapping (comparison free)

void calculate3(const vector<string> & sv1, const vector<string> & sv2, vector<string> & sv3);
//yet smarter way -> take advantage of the special condition that a '1' cannot be paired with a '2'
//you see, since you have this condition, you can make the comparison simpler, like this:
//if (c1=='0') (c3=c2) //since c2 >= '0'
//else c3=c1; //because if c1!='0' then c2 is EITHER '0' OR c1, both of which are <= than c1 ;)
//(simply put, the maximum element is the one that is not '0')

void calculate4(const vector<string> & sv1, const vector<string> & sv2, vector<string> & sv3);
//I thought this here would be faster but it's not... :/
//I took advantage of both the special condition and the fact that the binary representations
//of '0', '1', '2' are 110000, 110001 and 110010, respectively,
//to reduce the maximum element calculation to c3=c1|c2.

void print(const vector<string> & sv);

typedef void (*calc_func)(const vector<string> &,const vector<string> &,vector<string> &);

const calc_func calc_ftable[]={calculate,calculate2,calculate3,calculate4};

int main()
{
    srand(time(0));

    const int size=500000;
    bool print_results=false;

    vector<string> sv1, sv2, sv3;

    cout << "initializing string vectors...\nplease wait...\n" << endl;
    random_sv1(sv1,size);
    random_sv2(sv2,sv1);

    int start;
    int stop;

    int i;

    for (i=0; i<4; i++)
    {
        cout << "calculating with method no" << i+1 << endl;

        start=clock();
        cout << "start: " << start << endl;

        calc_ftable[i](sv1,sv2,sv3);

        stop=clock();
        cout << "stop: " << stop << endl;

        cout << "time taken: " << (stop-start)*1000/double(CLOCKS_PER_SEC) << " ms\n" << endl;

            if (print_results)
            {
                cout << "sv1:\n";
                print(sv1);
                cout << "\nsv2:\n";
                print(sv2);
                cout << "\nsv3:\n";
                print(sv3);
                cout << endl;
            }
    }

    cout << "hit enter to quit\n";
    cin.get();

    return 0;
}

void random_sv1(vector<string> & sv, int size)
{
    int i;
    int j;
    string str;

    sv.resize(size);
    str.resize(10);

    for (i=0; i<size; i++)
    {
        for (j=0; j<10; j++)
        {
            str[j]=rand()%3+'0';
        }
        sv[i]=str;
    }
}

void random_sv2(vector<string> & sv, const vector<string> & sv1)
{
    int i;
    int j;
    string str;
    char c;
    int temp;

    int size=sv1.size();

    sv.resize(size);
    str.resize(10);

    for (i=0; i<size; i++)
    {
        for (j=0; j<10; j++)
        {
            c=sv1[i][j];

            if (c=='0') str[j]=rand()%3+'0';
            else
            {
                temp=rand()%2;
                if (c=='2') temp*=2;

                str[j]=temp+'0';
            }
        }
        sv[i]=str;
    }
}

void calculate(const vector<string> & sv1, const vector<string> & sv2, vector<string> & sv3)
{
    int i;
    int j;
    string str;
    char c1,c2;

    int size=sv1.size();

    sv3.resize(size);
    str.resize(10);

    for (i=0; i<size; i++)
    {
        for (j=0; j<10; j++)
        {
            c1=sv1[i][j];
            c2=sv2[i][j];

            if (c1>=c2) str[j]=c1;
            else str[j]=c2;
        }
        sv3[i]=str;
    }
}

void calculate2(const vector<string> & sv1, const vector<string> & sv2, vector<string> & sv3)
{
    static const char maximum[3][3]=
    {
        '0','1','2',
        '1','1','2',
        '2','2','2'
    };

    int i;
    int j;
    string str;

    int size=sv1.size();

    sv3.resize(size);
    str.resize(10);

    for (i=0; i<size; i++)
    {
        for (j=0; j<10; j++)
        {
            str[j]=maximum[sv1[i][j]-'0'][sv2[i][j]-'0'];
        }
        sv3[i]=str;
    }

}

void calculate3(const vector<string> & sv1, const vector<string> & sv2, vector<string> & sv3)
{
    int i;
    int j;
    string str;
    char c;

    int size=sv1.size();

    sv3.resize(size);
    str.resize(10);

    for (i=0; i<size; i++)
    {
        for (j=0; j<10; j++)
        {
            c=sv1[i][j];
            if (c=='0') str[j]=sv2[i][j];
            else str[j]=c;
        }
        sv3[i]=str;
    }
}

void calculate4(const vector<string> & sv1, const vector<string> & sv2, vector<string> & sv3)
{
    int i;
    int j;
    string str;
    char c;

    int size=sv1.size();

    sv3.resize(size);
    str.resize(10);

    for (i=0; i<size; i++)
    {
        for (j=0; j<10; j++)
        {
            str[j]=(sv1[i][j])|(sv2[i][j]);
        }
        sv3[i]=str;
    }
}

void print(const vector<string> & sv)
{
    int i;
    int size=sv.size();

    for (i=0; i<size; i++)
    {
        cout << "str_vector[" << i << "]=" << sv[i] << endl;
    }
}


EDIT: I would also suggest that you get rid of these comparisons:

if(my_vector1 == my_vector2) //...

if (my_vector1[i].compare(my_vector2[i]) != 0) //...

They really slow you down and they're practically unnecessary.
Last edited on
Thank you so much m4ster r0shi!!!!!!!!!! Indeed, the idea being calculate3 is simple and brilliant, I am now stuck in a meeting but I will try all that ASAP. I might keep the comparisons you suggest to remove since in my case the elements of vectors are not 10 but potentially up to 1000 chars long (I will try both).
Topic archived. No new replies allowed.