fastest find and replace for a string

Pages: 123
This is how I'm doing it; is there a faster way?
The program could do it millions of times, so I expect any time savings would improve it.

1
2
std::regex double_quote_find {R"<("")<"};
str = std::regex_replace (str, double_quote_find, "''");
I don't know regex - can you explain what you are trying to do please.
I have a string that says something like this.
 
"Item exp, 2"" B/M 90deg widget (As Per Quote AABBITU26.056)"

I want it to say this:
 
"Item exp, 2'' B/M 90deg widget (As Per Quote AABBITU26.056)"


The 2 would now be followed by 2 single quotes.

I need it to find all of the occurrences of double quotes; it is possible there are many in a string.

My searches seem to suggest this:
Boost::replace_all

I can go to the trouble getting Boost and using the header, but I was wondering if anyone knew if it was worth the trouble. Will it be faster?

or I can try use this:
1
2
3
4
5
6
7
8
9
void replace_all(std::string& data, std::string to_search, std::string replace_str){

    size_t pos = data.find(to_search);

    while( pos != std::string::npos){
        data.replace(pos, to_search.size(), replace_str);
        pos =data.find(to_search, pos + replace_str.size());
    }
}


I'm not capable of deciding if something is worth the effort; will it be faster?
Last edited on
Possibly something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <string>
#include <iostream>

void strconv(std::string& str) {
	for (auto& ch : str)
		if (ch == '\"')
			ch = '\'';
}

int main() {
	std::string str { R"!(Item exp, 2"" B/M 90deg widget (As Per Quote AABBITU26.056))!" };

	strconv(str);

	std::cout << str << '\n';
}



Item exp, 2'' B/M 90deg widget (As Per Quote AABBITU26.056)

Last edited on
Similar to @seeplus, but only converting 2 consecutive double quotes...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <string>
#include <iostream>

void double_quote_conv(std::string& str)
{
    static const std::string target = R"("")";
    static const std::string sub = R"('')";
    int index = 0;
    while (true)
    {
        index = str.find(target, index);
        if (index == std::string::npos)
        {
            break;
        }
        else
        {
            str.replace(index, sub.length(), sub);
        }
    }
}

int main()
{
	std::string str { R"!("Item exp, 2"" B/M 90deg widget (As Per Quote AABBITU26.056)")!" };

	double_quote_conv(str);

	std::cout << str << '\n';
}


As far as whether it's faster, you need to do some real-world measurements. I suspect the Regex developers wrote optimized code, so it's probably pretty fast. But it's always a good idea to measure things in your environment to see how things shake out.
I'm leaning towards this:
1
2
3
4
5
6
7
8
9
inline void replace_all(std::string& data, std::string to_search, std::string replace_str){

    size_t pos = data.find(to_search);

    while( pos != std::string::npos){
        data.replace(pos, to_search.size(), replace_str);
        pos =data.find(to_search, pos + replace_str.size());
    }
}


Is inline worthwhile or is it nonsense?

How do you tell if one function is faster? Does anyone have a standard method?
Last edited on
I'm not capable of deciding if something is worth the effort; will it be faster?
It's likely because regex will use the string function but adds some overhead.

Whether it's worth the effort is hard to tell. It would require some [extensive] testing.

I guess the simpliest solution would be that you keep the approach you prefer. When it turns out to be a bottle neck you may try something else. Often you don't need the fastes solution, just the solution which is fast enough.
That will probably be slightly slower than my version above. If all you want to do is to replace all occurrences of one char with another char, then:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <string>
#include <iostream>

void strconv(std::string& str, char fnd, char rep) {
	for (auto& ch : str)
		if (ch == fnd)
			ch = rep;
}

int main() {
	std::string str{ R"!(Item exp, 2"" B/M 90deg widget (As Per Quote AABBITU26.056))!" };

	strconv(str, '\"', '\'');

	std::cout << str << '\n';
}


If you want to replace more than one sequential chars (string) with another set of char(s), then that is different and will be slightly slower. As you say that performance is critical then unless you need the string replacement, I'd stick with the version that just iterates the string and replaces one char with it's alternative. Anything that causes the string to shrink or expand is going to decrease performance as the string has to shuffled and potentially new memory allocated and copied (unless the string capacity has already been reserved to the potential maximum needed).

There is no harm in using inline - it's only a hint to the compiler to in-line this code where the compiler thinks it could be appropriate.
Last edited on
how are you reading the data? Is it one giant string, or many small ones?
would it be possible to use " as the delimiter to a read and replace it as it comes in? Not sure that would be faster, but maybe?
is there any pattern in the data you can rely upon and exploit?
is the data static (same every time you run) such that you can just do this once an save the new file (where do the offending symbols come from, can you prevent it at that level??)
Last edited on
Thank you for your response.

My present problem needs to ensure that it is always two quotes together R"("")". An alone double quote needs to be retained.

My present program ran too slow, with approximately a 1/2 a million records it took close an hour. Figuring out where the bottle-neck is, is not something I know how to do.

My goal is 10 minutes, which for me is more than fast enough.

I'm just guessing that it is my use of regular expressions that is making it run so slow.

Stick in your replace_all function you posted a few posts up. Then run your program and see how long int takes. If it takes close to an hour, you didn't make any gains. If it takes around 10 minutes, you're golden. Sometimes development is that simple.

If you want to get down to the nitty gritty, you need a profiler. That will show you where your CPU cycles are being spent.
Another option:
1
2
3
4
5
6
7
8
9
10
11
12
13
void convert(std::string& str)
{
    
    for(size_t i = 0; i < str.size() -1; i++ )
    {
        if (str[i] == '\"' && str[i+1] == '\"')
        {
            str[i] = '\'';
            str[i+1] = '\'';
        }
    }
    
}


A quick benchmark site: https://quick-bench.com
I'm reading the data a line at a time, using std::getline.
I've posted this program before in another thread, but I've made changes based on comments others suggested.

I'm going to repost it just in case someone is willing to look. I'm not expecting someone to look it over I understand it is a bit much.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
#include <iostream>
#include <algorithm>
#include <stdexcept>
#include <vector>
#include <fstream>
#include <sstream>
#include <regex>

std::vector<std::string> get_input();
std::vector<std::string> process_record(std::string& r);
inline void print_record(std::regex& r, std::vector<std::string> vs, std::fstream& ofh);
inline void replace_all(std::string& data, std::string to_search, std::string replace_str);

int main() try {

    auto vs = get_input();

    std::fstream ifh; 
    ifh.open(vs[0], std::ios::in); 
    if (!ifh) 
        throw std::runtime_error("File not opened: " + vs[0]);

    std::fstream ofh; 
    ofh.open(vs[1], std::ios::out); 
    if (!ofh) 
        throw std::runtime_error("File not opened: " + vs[1]); 
    
    //here are my precompiled regex
    std::regex leading_and_trailing {R"(^ +| +$|( ) +)"};
    std::regex number { R"(\$?\d\d?\d?(,\d{3})*(\.\d*)?)" };
    std::regex trailing_comma {R"(,+$)"};

    for(std::string str; std::getline(ifh,str);) { 

        for (size_t n{}; (n = std::count(str.begin(), str.end(), '"') % 2);){ //ensuring all double quotes are closed
            //If they're not closed I'm grabing the the next record and replacing
            //the return character with a literal \n.
            std::string next_line;

            std::getline(ifh, next_line);
            str += R"(\n)" + next_line;
        }

        //str = std::regex_replace (str, double_quote_find, "''"); //removing double quotes inside of quotes
        replace_all(str, R"("")", R"('')"); //removing double quotes inside of quotes

        auto record = process_record(str); //putting the individual fields in a vector.

        for(auto& s : record) { 
            s = std::regex_replace(s , leading_and_trailing, "$1"); //remove leading and trailing spaces
            
            std::smatch sm;
            if (std::regex_match(s, sm, number)) { 
                const auto e1 {std::remove(s.begin(), s.end(), ',')};	//remove commas in numbers
                s.erase(std::remove(s.begin(), e1, '$'), s.end());		//remove dollar sign
            }

            std::replace(s.begin(), s.end(), ',', ';'); //replacing commas with semi-colons
            
        }

        print_record(trailing_comma, record, ofh);

    }

    return 0;
}
catch (std::exception& e) {
    std::cerr << e.what() << '\n';
    return 1;
}
catch (...) {
    std::cerr << "uncaught" << '\n';
    return 1;
}

inline void print_record(std::regex& r, std::vector<std::string> vs, std::fstream& ofh) {
    std::string str;
    for(auto s : vs) {
        str += s;
        str += ',';
    }

    ofh << std::regex_replace (str, r, "") << std::endl;
}

std::vector<std::string> process_record(std::string& r) {
    std::stringstream ss{r};
    std::vector<std::string> rv;
    char ch;

    std::string field{};
    while (ss.get(ch)) { 

        if(ch == '"') { 
            while (ss.get(ch)) {

                if (ch != '"') {
                    field += ch; 
                }
                else { 
                    ss.get(ch); 
                    break;
                }
            }
        }

        if(ch == ',') {
            rv.push_back(field);
            field.clear();
            continue;
        }
        field += ch;
        
    }

    return rv;
}

std::vector<std::string> get_input() {
    std::vector<std::string> return_vstr;
    std::fstream ifh;

    ifh.open(R"(input_info.xml)", std::ios::in);
	if (!ifh) 
        throw std::runtime_error("File not opened--check to see if input_info.xml is correct");

    std::string info;
    char ch;

    while (ifh.get(ch)) { 
        if (ch == '\n') ch = ' '; 
        info.push_back(ch); 
    } 

    std::smatch sm;  
    std::regex in(R"(.*?<input1>(.*?)</input1>.*)");
    std::regex out(R"(.*?<output1>(.*?)</output1>.*)");

    std::regex_match (info, sm, in); 
    return_vstr.push_back(sm[1]);

    std::regex_match (info, sm, out); 
    return_vstr.push_back(sm[1]);

    return return_vstr;

}

inline void replace_all(std::string& data, std::string to_search, std::string replace_str){

    size_t pos = data.find(to_search);

    while( pos != std::string::npos){
        data.replace(pos, to_search.size(), replace_str);
        pos =data.find(to_search, pos + replace_str.size());
    }
}


I do need to figure out a way to get it faster. This present iteration took maybe 40 minutes which is about 4 times slower than I need.
Possibly:

1
2
3
4
5
void strconv(std::string& str, char fnd, char rep) {
	for (auto it {str.begin()}, ite {str.end() - 1}; it != ite; ++it)
		if (*it == fnd && *(it + 1) == fnd)
			*it = rep, *(it + 1) = rep;
}

I'm reading the data a line at a time, using std::getline.

ok, so, you can read these lines into N containers and send the containers off to N threads that each crunch part of the data in parallel. This would divide your time by N, so if you did 5 threads and then reassembled (if necessary) the containers into one 'fixed' block of records (you may NOT want to do that, it may be better to keep many blocks and thread other parts of the code downstream as well, and just iterate the containers one by one at the end to output it) it would turn 40 min into 10 without much fuss.

of course, I think you can do better still, but that is the 10 cent solution that keeps on giving.
if you can double the speed of the function, it will do it in 5 with threading, and so on.

I think seeplus' original is really the way to go here. pseudocode for 2 at a time:
1
2
3
4
5
for(all the letters except the last one) //traditional for loop, range based won't cut it for 2 at at time
{
    if( string[letter] == " and string[letter+1] == ")
         string[letter] = string[letter+1] = ' 
} 

run that sucker over 5 threads and call it a day. (maybe, check the hardware, how many CPU do you have? I have 20 on my I9 ... most newer machines have 8 or more)

just in case: be sure to compile in release / optimized flags mode, not debug mode, when you test how long it takes.
Last edited on
IMO, using regex() is slowwww.

Also your code is repeatedly iterating the same string. For performance purposes, you really just want to iterate over it once. Things like removing $ sign can be done in process_record(). The same with removing leading and trailing spaces.

As process_record() is splitting on , (except in quotes), how are you dealing with , in numbers - as 123,456 will be split as 123 456 (unless in quotes)? Do you want the comma removed if the number is in quotes?

The same with replacing , with ; - after the split this will only replace , with ; when the original was enclosed in quotes. So why not do the replace in process_record?

L45 and L49-60 could be done within the loop in process_record() without using regex etc. This will speed up the code - probably quite considerably.

Last edited on
Thank you jonnin.

Since I'm just a hack, I believe the thread suggestion is probably the best for me. I suspect I can use this regularly if I can wrap me mind around it.

I think at line 47 I can create 5 threads without any problems.
Would I need stop each thread before I print on line 62?

Any suggestions for a tutorial on threads?
the threads will stop on their own, but you need to wait for all of them to finish before line 62, yes.

https://www.geeksforgeeks.org/multithreading-in-cpp/

these kinds of threads are simple: they do not have to take care about accessing the same data (mutex / critical sections) or order of operations (waiting on each other) or any of the advanced stuff. Each thread is just doing its own thing to its own data and nothing cares what anything else is doing.
File operations are costly. Stream operations are generally rather slow.

To find out the bottle neck I would suggest to test the first getline(...) loop only. Then add loop by loop and line by line. This should tell you were the problem really is.
Thank you all. I appreciate your suggestion and will read and consider them all, but I do need to get this working in the next couple of hours.So I'm going to put off really optimizing the code for a couple of days.

Question about threads:

How do I ensure that if I create 5 thread: t1,t2,t3,t4, and t5; that when I write them to a file on line 62 they write in the same order?

Pages: 123