fastest find and replace for a string

Pages: 123
doesn't matter.
you have chunk 1, the first N records in the data, in a container. you have chunk 2, the N+1 to 2N block of records... etc.
when all the threads are done (waiting as we discussed) you just print chunk 1 then chunk 2... they are still in order.
so chunk them in order into an array of vectors or something along those lines.
alternately if you have some reason to do it some other way, you can stuff a line number onto the lines and sort at the end (eg a tuple or struct with a number and a string). I see no reason to do that here, but its a way to handle it if you need that idea for something else later.

you can also just split them with pointers. Eg if you read all the lines into one big vector, you can take &vec[index] and operate on chunksize (vec.size()/N) worth of them, taking care not to run off the end of the vector in the 'remainder' thread. work out the math as to how you split it and pass the size into each thread as a parameter... This is probably more efficient but more difficult. its easier to read into an array of vectors, but even there, you need a count of # of recs to split it up correctly.
Last edited on
How big is the file? If two will fit into memory, I'd be tempted to read the whole of the input file into a string (use read() and set the string capacity to the size of the file), process the string data to a new string as required (set it's capacity to be the same size as the read string) and then write the whole new string to the new file (use write).
if you did it that way, you would have to use the pointer idea to split it for threads. which is fine. it would improve the disk IO part for sure.
Last edited on
Some of the advice is going above me head. I'll have to go back and reread some of your response when I have more time.

I wrote this function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
void task(std::regex& number, std::regex& leading_and_trailing, std::string str, std::vector<std::string>& record){

        //str = std::regex_replace (str, double_quote_find, "''"); //removing double quotes inside of quotes
        replace_all(str, R"("")", R"('')"); //removing double quotes inside of quotes

        record = process_record(str); //putting the individual fields in a vector.

        for(auto& s : record) { 
            s = std::regex_replace(s , leading_and_trailing, "$1"); //remove leading and trailing spaces
            
            std::smatch sm;
            if (std::regex_match(s, sm, number)) { 
                const auto e1 {std::remove(s.begin(), s.end(), ',')};	//remove commas in numbers
                s.erase(std::remove(s.begin(), e1, '$'), s.end());		//remove dollar sign
            }

            std::replace(s.begin(), s.end(), ',', ';'); //replacing commas with semi-colons
            
        }
}


How do I rewrite it so I can pass it to a thread?

Right now my compiler will let me do this:
1
2
3
4
5
std::vector<std::string> record1;
std::thread t1(task, number, leading_and_trailing, str, record1);
t1.join();

print_record(trailing_comma, record1, ofh);


I get warnings about record1. It also doesn't work.
Last edited on
no, sorry, that is old school. Standard thread allows multiple parameters. what is your warning?
did you populate record1?
Last edited on
 
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include\type_traits(1593): warning C4239: nonstandard extension used: 'argument': conversion from 'std::vector<std::string,std::allocator<std::string>>' to 'std::vector<std::string,std::allocator<std::string>> &'


No I didn't populate it. It is to be populated in the function.
Could you upload the input file somewhere. Would be much easier if we could run and benchmark the code.
It thinks there is a type mismatch because of the reference but I don't see the error yet.

ah.. read this:

https://riptutorial.com/cplusplus/example/2329/passing-a-reference-to-a-thread

or just get rid of the & (since its all empty, I don't think copying the vector will hurt you much)
Last edited on
how would I return the vector and use a thread?
you are right, you need it so remove & is no good.
use a pointer or that standard reference link (above).

a pointer looks like
void task(std::regex& number, std::regex& leading_and_trailing, std::string str, std::vector<std::string> *record)
std::thread t1(task, number, leading_and_trailing, str, &record1);
Last edited on
We don't use much std::thread as we use Windows threading. If you use Windows, then this is one way to parallel read a file using overlapped i/o. From our timings this is the fastest way to read a file into std::string when the file is at least several hundred megabytes. We use a chunk size of 10Mb.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <windows.h>
#include <string>
#include <iostream>
#include <vector>

constexpr size_t defchunk {10'000'000};

// fn - file name
// buff - where to read to file to
// returns - true if read succeeded, false otherwise
bool WOreadFile(const std::string& fn, std::string& buff, size_t chunk = defchunk) {
	bool ret {};	// Problems

	if (const auto fh {CreateFile(fn.c_str(), FILE_READ_DATA, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED, NULL) }; fh != INVALID_HANDLE_VALUE) {
		if (const auto sz {GetFileSize(fh, NULL) }; sz != INVALID_FILE_SIZE) {
			constexpr size_t evtsz {50};
			const size_t nochk {sz / chunk + 1};
			const auto thid{ GetThreadId(GetCurrentThread()) };	// Sort of unique event name

			std::vector<OVERLAPPED> vo(nochk);
			std::vector<HANDLE> vh(nochk);
			char evt[evtsz] {};

			buff.resize(sz);

			for (size_t i = 0, done = 0; i < nochk; ++i, done += chunk) {
				snprintf(evt, evtsz, "oevt%u-%zu", thid, i);
				vo[i].hEvent = vh[i] = CreateEvent(NULL, FALSE, FALSE, evt);
				vo[i].OffsetHigh = 0;
				vo[i].Offset = done;
				ReadFile(fh, buff.data() + done, std::min(chunk, (size_t)(sz - done)), NULL, &vo[i]);
			}

			WaitForMultipleObjects(vh.size(), (HANDLE*)vh.data(), TRUE, INFINITE);

			ret = true;

			for (const auto& e : vo) {
				ret &= ((DWORD)e.Internal == ERROR_SUCCESS);	// Have all reads succeeded
				CloseHandle(e.hEvent);							// Close events
			}
		}

		CloseHandle(fh);
	}

	return ret;
}

Are these the rules you want for the result string:


Remove spaces before/after , if not in quotes
Remove $
Replace "" with ''
Remove , if in a number inside quotes
Replace , with ; if in quotes and not part of a number

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
for(std::string str; std::getline(ifh,str);) { 

    for (size_t n{}; (n = std::count(str.begin(), str.end(), '"') % 2);){ //ensuring all double quotes are closed
        //If they're not closed I'm grabing the the next record and replacing
        //the return character with a literal \n.
        std::string next_line;

        std::getline(ifh, next_line);
        str += R"(\n)" + next_line;
    }

    std::vector<std::string> record1;
    std::thread t1(task, number, leading_and_trailing, str, &record1);
    t1.join();
    print_record(trailing_comma, record1, ofh);

}


This is what I did. It does run considerably faster, but I can't figure out how I can possible run another thread.

I think this might be above my capabilities.

Thank you all.
Last edited on
@seeplus

These are the rules and I believe these rules need to be followed in order:

1.) Replace all R"("")" with R"('')"
2.) If there is a return inside of closed quotes replace the return character
with R"(\n)".
3.) If it looks like a number inside of closed quotes remove ',' and '$'.
4.) Inside of closed quotes replace all commas with ';' and remove all leading and trailing spaces ( it is important that step 3 is done before this step).
Last edited on
it may be beyond you to finish it in the next couple of hours, but from what you have shown so far, I think you can do it, and the technique will greatly improve your ability as modern software is all about the threads since 4+ core cpus were common..

I believe you will have more luck if you do this:
read and split the file into your vector of strings up front.
call a thread on each vector.
trying to read the data inside the thread is why its hard to make a second one.
the actual work function can be massively improved too. I bet if you do both its going to end up < 2 min total.

Last edited on
The present program runs on a 400MB file took 12:23.
The original took over an hour.
Thank you anyone who took time to help me. I didn't expect it and it is appreciated.

I can't understand why adding just one thread increased the speed over 4 times. I will have to play with this more in the future when I have more time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
#include <iostream>
#include <algorithm>
#include <stdexcept>
#include <vector>
#include <fstream>
#include <sstream>
#include <regex>
#include <thread>

std::vector<std::string> get_input();
std::vector<std::string> process_record(std::string& r);
inline void print_record(const std::regex& r, std::vector<std::string> vs, std::fstream& ofh);
inline void replace_all(std::string& data, std::string to_search, std::string replace_str);
void task(const std::regex& number, const std::regex& leading_and_trailing, std::string str, std::vector<std::string>& record);

int main() try {

    auto vs = get_input();

    std::fstream ifh; 
    ifh.open(vs[0], std::ios::in); 
    if (!ifh) 
        throw std::runtime_error("File not opened: " + vs[0]);

    std::fstream ofh; 
    ofh.open(vs[1], std::ios::out); 
    if (!ofh) 
        throw std::runtime_error("File not opened: " + vs[1]); 
    
    //here are my precompiled regex
    const std::regex leading_and_trailing {R"(^ +| +$|( ) +)"};
    const std::regex number { R"(\$?\d\d?\d?(,\d{3})*(\.\d*)?)" };
    const std::regex trailing_comma {R"(,+$)"};

    for(std::string str; std::getline(ifh,str);) { 

        for (size_t n{}; (n = std::count(str.begin(), str.end(), '"') % 2);){ //ensuring all double quotes are closed
            //If they're not closed I'm grabing the the next record and replacing
            //the return character with a literal \n.
            std::string next_line;

            std::getline(ifh, next_line);
            str += R"(\n)" + next_line;
        }

        std::vector<std::string> record;
        std::thread t1(task, std::cref(number), std::cref(leading_and_trailing), str, std::ref(record));
        t1.join();
        print_record(trailing_comma, record, ofh);

    }

    return 0;
}
catch (std::exception& e) {
    std::cerr << e.what() << '\n';
    return 1;
}
catch (...) {
    std::cerr << "uncaught" << '\n';
    return 1;
}

inline void print_record(const std::regex& r, std::vector<std::string> vs, std::fstream& ofh) {
    std::string str;
    for(auto s : vs) {
        str += s;
        str += ',';
    }

    ofh << std::regex_replace (str, r, "") << std::endl;
}

std::vector<std::string> process_record(std::string& r) {
    std::stringstream ss{r};
    std::vector<std::string> rv;
    char ch;

    std::string field{};
    while (ss.get(ch)) { 

        if(ch == '"') { 
            while (ss.get(ch)) {

                if (ch != '"') {
                    field += ch; 
                }
                else { 
                    ss.get(ch); 
                    break;
                }
            }
        }

        if(ch == ',') {
            rv.push_back(field);
            field.clear();
            continue;
        }
        field += ch;
        
    }

    return rv;
}

std::vector<std::string> get_input() {
    std::vector<std::string> return_vstr;
    std::fstream ifh;

    ifh.open(R"(input_info.xml)", std::ios::in);
	if (!ifh) 
        throw std::runtime_error("File not opened--check to see if input_info.xml is correct");

    std::string info;
    char ch;

    while (ifh.get(ch)) { 
        if (ch == '\n') ch = ' '; 
        info.push_back(ch); 
    } 

    std::smatch sm;  
    std::regex in(R"(.*?<input1>(.*?)</input1>.*)");
    std::regex out(R"(.*?<output1>(.*?)</output1>.*)");

    std::regex_match (info, sm, in); 
    return_vstr.push_back(sm[1]);

    std::regex_match (info, sm, out); 
    return_vstr.push_back(sm[1]);

    return return_vstr;

}

inline void replace_all(std::string& data, std::string to_search, std::string replace_str){

    size_t pos = data.find(to_search);

    while( pos != std::string::npos){
        data.replace(pos, to_search.size(), replace_str);
        pos =data.find(to_search, pos + replace_str.size());
    }
}


void task(const std::regex& number, const std::regex& leading_and_trailing, std::string str, std::vector<std::string>& record){

        replace_all(str, R"("")", R"('')"); //removing double quotes inside of quotes

        record = process_record(str); //putting the individual fields in a vector.

        for(auto& s : record) { 
            s = std::regex_replace(s , leading_and_trailing, "$1"); //remove leading and trailing spaces
            
            std::smatch sm;
            if (std::regex_match(s, sm, number)) { 
                const auto e1 {std::remove(s.begin(), s.end(), ',')};	//remove commas in numbers
                s.erase(std::remove(s.begin(), e1, '$'), s.end());		//remove dollar sign
            }

            std::replace(s.begin(), s.end(), ',', ';'); //replacing commas with semi-colons
            
        }
}
Last edited on
If you read the whole file into a single std::string, then this is an example of a possible way to process this by iterating the string once char by char. Not fully tested so may need some tweaks but could be the basis of an alternative to using regex etc which could further improve the performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
#include <iostream>
#include <string>
#include <cctype>

// Remove spaces before/after , if not in quotes
// Remove $
// Replace "" with ''
// Remove , if in a number inside quotes
// Replace , with ; if in quotes
// Ignore /n /r inside quotes
std::string process(const std::string& str) {
	std::string outstr(str.size(), 0);

	bool inelem{};
	bool inquote{};

	const char* ch {str.c_str()};
	char* outch {outstr.data()};

	for (; *ch; ++ch) {
		// Remove $
		if (*ch == '$')
			continue;

		// Ignore space if not in an element
		if (*ch == ' ') {
			if (inelem)
				*outch++ = *ch;
			continue;
		}

		// Remove previous spaces before a , or \n if not in quote
		if (!inquote && (*ch == ',' || *ch == '\n')) {
			while (outch > outstr.data() && *(outch - 1) == ' ')
				--outch;

			*outch++ = *ch;
			inelem = false;
			continue;
		}

		// Ignore \n or \r if in quote
		if (inquote && (*ch == '\n' || *ch == '\r'))
			continue;

		// Replace "" with ''
		if (*ch == '\"' && *(ch + 1) == '\"') {
			*outch++ = '\'';
			*outch++ = '\'';
			++ch;
			continue;
		}

		// Deal with quote
		if (*ch == '"') {
			inquote = !inquote;
			inelem = true;
			continue;
		}

		// Is , in a number in quotes
		if (*ch == ',') {
			// Ignore if part of a number otherwise replace with ;
			if (!std::isdigit(static_cast<unsigned char>(*(ch + 1))))
				*outch++ = ';';

			continue;
		}

		inelem = true;
		*outch++ = *ch;
	}

	outstr.resize(outch - outstr.data());
	return outstr;
}

int main() {
	//std::string str{ R"!(Item exp, 2"" B/M 90deg widget (As Per Quote AABBITU26.056))!" };
	const std::string str{ R"!(   qwe  ,  pou ,  "4,5 qw,po "" (8)", 987)!" };

	std::cout << process(str) << '\n';
}



qwe,pou,45 qw;po '' (8),987

@seeplus

Thank you. I spent way too much time on this yesterday. I won't be able to get back to this for awhile. I was able to get my program down to 8 minutes by adding 4 threads.

Interestingly to me, adding one thread increased the speed from 1 hour down to 12 minutes, but adding three additional threads increased the speed from 12 minutes to 8.

Eight minutes is tolerable, but now that jonnin told me he thinks I should be able to get it to under 2 minutes I'm going to give it a try. It'll just have to wait awhile.
You could consider using Boyer-Moore to find the match, that's faster than linear and regex.
I'll need to read up on Boyer-Moore. It looks like math. Since I've have no advance math skills, other than college level algebra, I'm wondering if it is something I'll be able to understand.

Any recommended reading for a math dummy?
Pages: 123