fastest find and replace for a string

How big is the file? If two will fit into memory, I'd be tempted to read the whole of the input file into a string (use read() and set the string capacity to the size of the file), process the string data to a new string as required (set it's capacity to be the same size as the read string) and then write the whole new string to the new file (use write).

if you did it that way, you would have to use the pointer idea to split it for threads. which is fine. it would improve the disk IO part for sure.

Last edited on

Some of the advice is going above me head. I'll have to go back and reread some of your response when I have more time.

I wrote this function:

void task(std::regex& number, std::regex& leading_and_trailing, std::string str, std::vector<std::string>& record){

        //str = std::regex_replace (str, double_quote_find, "''"); //removing double quotes inside of quotes
        replace_all(str, R"("")", R"('')"); //removing double quotes inside of quotes

        record = process_record(str); //putting the individual fields in a vector.

        for(auto& s : record) { 
            s = std::regex_replace(s , leading_and_trailing, "$1"); //remove leading and trailing spaces
            
            std::smatch sm;
            if (std::regex_match(s, sm, number)) { 
                const auto e1 {std::remove(s.begin(), s.end(), ',')};	//remove commas in numbers
                s.erase(std::remove(s.begin(), e1, '$'), s.end());		//remove dollar sign
            }

            std::replace(s.begin(), s.end(), ',', ';'); //replacing commas with semi-colons
            
        }
}

How do I rewrite it so I can pass it to a thread?

Right now my compiler will let me do this:

std::vector<std::string> record1;
std::thread t1(task, number, leading_and_trailing, str, record1);
t1.join();

print_record(trailing_comma, record1, ofh);

I get warnings about record1. It also doesn't work.

Last edited on

no, sorry, that is old school. Standard thread allows multiple parameters. what is your warning?
did you populate record1?

Last edited on

C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include\type_traits(1593): warning C4239: nonstandard extension used: 'argument': conversion from 'std::vector<std::string,std::allocator<std::string>>' to 'std::vector<std::string,std::allocator<std::string>> &'

No I didn't populate it. It is to be populated in the function.

thmm (703)

Could you upload the input file somewhere. Would be much easier if we could run and benchmark the code.

It thinks there is a type mismatch because of the reference but I don't see the error yet.

ah.. read this:

https://riptutorial.com/cplusplus/example/2329/passing-a-reference-to-a-thread

or just get rid of the & (since its all empty, I don't think copying the vector will hurt you much)

Last edited on

how would I return the vector and use a thread?

you are right, you need it so remove & is no good.
use a pointer or that standard reference link (above).

a pointer looks like
void task(std::regex& number, std::regex& leading_and_trailing, std::string str, std::vector<std::string> *record)
std::thread t1(task, number, leading_and_trailing, str, &record1);

Last edited on

We don't use much std::thread as we use Windows threading. If you use Windows, then this is one way to parallel read a file using overlapped i/o. From our timings this is the fastest way to read a file into std::string when the file is at least several hundred megabytes. We use a chunk size of 10Mb.

#include <windows.h>
#include <string>
#include <iostream>
#include <vector>

constexpr size_t defchunk {10'000'000};

// fn - file name
// buff - where to read to file to
// returns - true if read succeeded, false otherwise
bool WOreadFile(const std::string& fn, std::string& buff, size_t chunk = defchunk) {
	bool ret {};	// Problems

	if (const auto fh {CreateFile(fn.c_str(), FILE_READ_DATA, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED, NULL) }; fh != INVALID_HANDLE_VALUE) {
		if (const auto sz {GetFileSize(fh, NULL) }; sz != INVALID_FILE_SIZE) {
			constexpr size_t evtsz {50};
			const size_t nochk {sz / chunk + 1};
			const auto thid{ GetThreadId(GetCurrentThread()) };	// Sort of unique event name

			std::vector<OVERLAPPED> vo(nochk);
			std::vector<HANDLE> vh(nochk);
			char evt[evtsz] {};

			buff.resize(sz);

			for (size_t i = 0, done = 0; i < nochk; ++i, done += chunk) {
				snprintf(evt, evtsz, "oevt%u-%zu", thid, i);
				vo[i].hEvent = vh[i] = CreateEvent(NULL, FALSE, FALSE, evt);
				vo[i].OffsetHigh = 0;
				vo[i].Offset = done;
				ReadFile(fh, buff.data() + done, std::min(chunk, (size_t)(sz - done)), NULL, &vo[i]);
			}

			WaitForMultipleObjects(vh.size(), (HANDLE*)vh.data(), TRUE, INFINITE);

			ret = true;

			for (const auto& e : vo) {
				ret &= ((DWORD)e.Internal == ERROR_SUCCESS);	// Have all reads succeeded
				CloseHandle(e.hEvent);							// Close events
			}
		}

		CloseHandle(fh);
	}

	return ret;
}

Are these the rules you want for the result string:


Remove spaces before/after , if not in quotes
Remove $
Replace "" with ''
Remove , if in a number inside quotes
Replace , with ; if in quotes and not part of a number

for(std::string str; std::getline(ifh,str);) { 

    for (size_t n{}; (n = std::count(str.begin(), str.end(), '"') % 2);){ //ensuring all double quotes are closed
        //If they're not closed I'm grabing the the next record and replacing
        //the return character with a literal \n.
        std::string next_line;

        std::getline(ifh, next_line);
        str += R"(\n)" + next_line;
    }

    std::vector<std::string> record1;
    std::thread t1(task, number, leading_and_trailing, str, &record1);
    t1.join();
    print_record(trailing_comma, record1, ofh);

}

This is what I did. It does run considerably faster, but I can't figure out how I can possible run another thread.

I think this might be above my capabilities.

Thank you all.

Last edited on

@seeplus

These are the rules and I believe these rules need to be followed in order:

1.) Replace all R"("")" with R"('')"
2.) If there is a return inside of closed quotes replace the return character
with R"(\n)".
3.) If it looks like a number inside of closed quotes remove ',' and '$'.
4.) Inside of closed quotes replace all commas with ';' and remove all leading and trailing spaces ( it is important that step 3 is done before this step).

Last edited on

it may be beyond you to finish it in the next couple of hours, but from what you have shown so far, I think you can do it, and the technique will greatly improve your ability as modern software is all about the threads since 4+ core cpus were common..

I believe you will have more luck if you do this:
read and split the file into your vector of strings up front.
call a thread on each vector.
trying to read the data inside the thread is why its hard to make a second one.
the actual work function can be massively improved too. I bet if you do both its going to end up < 2 min total.

Last edited on

The present program runs on a 400MB file took 12:23.
The original took over an hour.
Thank you anyone who took time to help me. I didn't expect it and it is appreciated.

I can't understand why adding just one thread increased the speed over 4 times. I will have to play with this more in the future when I have more time.

#include <iostream>
#include <algorithm>
#include <stdexcept>
#include <vector>
#include <fstream>
#include <sstream>
#include <regex>
#include <thread>

std::vector<std::string> get_input();
std::vector<std::string> process_record(std::string& r);
inline void print_record(const std::regex& r, std::vector<std::string> vs, std::fstream& ofh);
inline void replace_all(std::string& data, std::string to_search, std::string replace_str);
void task(const std::regex& number, const std::regex& leading_and_trailing, std::string str, std::vector<std::string>& record);

int main() try {

    auto vs = get_input();

    std::fstream ifh; 
    ifh.open(vs[0], std::ios::in); 
    if (!ifh) 
        throw std::runtime_error("File not opened: " + vs[0]);

    std::fstream ofh; 
    ofh.open(vs[1], std::ios::out); 
    if (!ofh) 
        throw std::runtime_error("File not opened: " + vs[1]); 
    
    //here are my precompiled regex
    const std::regex leading_and_trailing {R"(^ +| +$|( ) +)"};
    const std::regex number { R"(\$?\d\d?\d?(,\d{3})*(\.\d*)?)" };
    const std::regex trailing_comma {R"(,+$)"};

    for(std::string str; std::getline(ifh,str);) { 

        for (size_t n{}; (n = std::count(str.begin(), str.end(), '"') % 2);){ //ensuring all double quotes are closed
            //If they're not closed I'm grabing the the next record and replacing
            //the return character with a literal \n.
            std::string next_line;

            std::getline(ifh, next_line);
            str += R"(\n)" + next_line;
        }

        std::vector<std::string> record;
        std::thread t1(task, std::cref(number), std::cref(leading_and_trailing), str, std::ref(record));
        t1.join();
        print_record(trailing_comma, record, ofh);

    }

    return 0;
}
catch (std::exception& e) {
    std::cerr << e.what() << '\n';
    return 1;
}
catch (...) {
    std::cerr << "uncaught" << '\n';
    return 1;
}

inline void print_record(const std::regex& r, std::vector<std::string> vs, std::fstream& ofh) {
    std::string str;
    for(auto s : vs) {
        str += s;
        str += ',';
    }

    ofh << std::regex_replace (str, r, "") << std::endl;
}

std::vector<std::string> process_record(std::string& r) {
    std::stringstream ss{r};
    std::vector<std::string> rv;
    char ch;

    std::string field{};
    while (ss.get(ch)) { 

        if(ch == '"') { 
            while (ss.get(ch)) {

                if (ch != '"') {
                    field += ch; 
                }
                else { 
                    ss.get(ch); 
                    break;
                }
            }
        }

        if(ch == ',') {
            rv.push_back(field);
            field.clear();
            continue;
        }
        field += ch;
        
    }

    return rv;
}

std::vector<std::string> get_input() {
    std::vector<std::string> return_vstr;
    std::fstream ifh;

    ifh.open(R"(input_info.xml)", std::ios::in);
	if (!ifh) 
        throw std::runtime_error("File not opened--check to see if input_info.xml is correct");

    std::string info;
    char ch;

    while (ifh.get(ch)) { 
        if (ch == '\n') ch = ' '; 
        info.push_back(ch); 
    } 

    std::smatch sm;  
    std::regex in(R"(.*?<input1>(.*?)</input1>.*)");
    std::regex out(R"(.*?<output1>(.*?)</output1>.*)");

    std::regex_match (info, sm, in); 
    return_vstr.push_back(sm[1]);

    std::regex_match (info, sm, out); 
    return_vstr.push_back(sm[1]);

    return return_vstr;

}

inline void replace_all(std::string& data, std::string to_search, std::string replace_str){

    size_t pos = data.find(to_search);

    while( pos != std::string::npos){
        data.replace(pos, to_search.size(), replace_str);
        pos =data.find(to_search, pos + replace_str.size());
    }
}


void task(const std::regex& number, const std::regex& leading_and_trailing, std::string str, std::vector<std::string>& record){

        replace_all(str, R"("")", R"('')"); //removing double quotes inside of quotes

        record = process_record(str); //putting the individual fields in a vector.

        for(auto& s : record) { 
            s = std::regex_replace(s , leading_and_trailing, "$1"); //remove leading and trailing spaces
            
            std::smatch sm;
            if (std::regex_match(s, sm, number)) { 
                const auto e1 {std::remove(s.begin(), s.end(), ',')};	//remove commas in numbers
                s.erase(std::remove(s.begin(), e1, '$'), s.end());		//remove dollar sign
            }

            std::replace(s.begin(), s.end(), ',', ';'); //replacing commas with semi-colons
            
        }
}

Last edited on

If you read the whole file into a single std::string, then this is an example of a possible way to process this by iterating the string once char by char. Not fully tested so may need some tweaks but could be the basis of an alternative to using regex etc which could further improve the performance.

#include <iostream>
#include <string>
#include <cctype>

// Remove spaces before/after , if not in quotes
// Remove $
// Replace "" with ''
// Remove , if in a number inside quotes
// Replace , with ; if in quotes
// Ignore /n /r inside quotes
std::string process(const std::string& str) {
	std::string outstr(str.size(), 0);

	bool inelem{};
	bool inquote{};

	const char* ch {str.c_str()};
	char* outch {outstr.data()};

	for (; *ch; ++ch) {
		// Remove $
		if (*ch == '$')
			continue;

		// Ignore space if not in an element
		if (*ch == ' ') {
			if (inelem)
				*outch++ = *ch;
			continue;
		}

		// Remove previous spaces before a , or \n if not in quote
		if (!inquote && (*ch == ',' || *ch == '\n')) {
			while (outch > outstr.data() && *(outch - 1) == ' ')
				--outch;

			*outch++ = *ch;
			inelem = false;
			continue;
		}

		// Ignore \n or \r if in quote
		if (inquote && (*ch == '\n' || *ch == '\r'))
			continue;

		// Replace "" with ''
		if (*ch == '\"' && *(ch + 1) == '\"') {
			*outch++ = '\'';
			*outch++ = '\'';
			++ch;
			continue;
		}

		// Deal with quote
		if (*ch == '"') {
			inquote = !inquote;
			inelem = true;
			continue;
		}

		// Is , in a number in quotes
		if (*ch == ',') {
			// Ignore if part of a number otherwise replace with ;
			if (!std::isdigit(static_cast<unsigned char>(*(ch + 1))))
				*outch++ = ';';

			continue;
		}

		inelem = true;
		*outch++ = *ch;
	}

	outstr.resize(outch - outstr.data());
	return outstr;
}

int main() {
	//std::string str{ R"!(Item exp, 2"" B/M 90deg widget (As Per Quote AABBITU26.056))!" };
	const std::string str{ R"!(   qwe  ,  pou ,  "4,5 qw,po "" (8)", 987)!" };

	std::cout << process(str) << '\n';
}


qwe,pou,45 qw;po '' (8),987

@seeplus

Thank you. I spent way too much time on this yesterday. I won't be able to get back to this for awhile. I was able to get my program down to 8 minutes by adding 4 threads.

Interestingly to me, adding one thread increased the speed from 1 hour down to 12 minutes, but adding three additional threads increased the speed from 12 minutes to 8.

Eight minutes is tolerable, but now that jonnin told me he thinks I should be able to get it to under 2 minutes I'm going to give it a try. It'll just have to wait awhile.

kbw (9488)

You could consider using Boyer-Moore to find the match, that's faster than linear and regex.