What is a regular expression to parse csv files?

Mar 20, 2021 at 6:55pm
MSVS 16.9.2

I am using regex to parse input lines in a Comma Separated Value (csv) file and can't seem to get a correct RE. Using ECMAScript (https://docs.microsoft.com/en-us/cpp/standard-library/regular-expressions-cpp?view=msvc-160,

I am currently trying: (field separator is ':')

static const regex pattern(".*[^[:$]]");

Unfortunately, Miscrosoft Visual Studio crashes with this. Using

static const regex pattern(".*[^:]");
regex_search(line, matches, pattern);

Doesn't crash but with an input line like "Name:0" returns matches[0] = "Name:0".

I've tried other RE's put I can't seem to get it right.
Mar 20, 2021 at 7:16pm
If all you need is to split a string by a character, a regex is overkill.
1
2
3
4
5
6
7
8
9
10
11
12
13
std::vector<std::string> split(const std::string &s, char separator){
    std::vector<std::string> ret;
    std::string accum;
    for (auto c : s){
        if (c == separator){
            ret.emplace_back(std::move(accum));
            continue;
        }
        accum += c;
    }
    ret.emplace_back(std::move(accum));
    return ret;
}
Mar 21, 2021 at 5:03am
CSV files are actually well and truly evil. They look simple, but there are some pretty hard caveats in there that’ll mess up your algorithm.

You said : was the separator.
• What is your data?
• Can your data include the : character? If yes, how?
• Can your data span lines? If yes, how is the newline embedded?
• Any special weirdness in formatting numbers or anything? (For example, did the data come from Excel?)
• How big is the data file? How big do you expect to to grow?

Best way to read a CSV is, sadly, a DFA tailored to your expected input.
If all the above answers to my above questions are 'no' and 'small' then helios’s solution will suffice.
Mar 21, 2021 at 10:41am
and what about ignoring any white space either side of the delimiter? Can you provide a sample from the file.
Mar 21, 2021 at 12:14pm
I don't think that regex is the right tool.
Consider using a library.
https://github.com/d99kris/rapidcsv/
Mar 21, 2021 at 1:03pm
If the actual format of the csv is known and is simple and doesn't change and can't have variants, then something like helios's code above will be the simpler. However if you are to parse a file that is 'csv format' and it's content is not under your control, then don't try to parse yourself. It's very tricky to parse a general csv file correctly. Use a 3rd party library. Even if you write code to correctly parse a 3rd party csv file now, if the code isn't general and the 3rd party changes their format slightly (eg spaces around the delimiter), then you'll probably have to change your code.

Splitting a string on a delimiter is easy - fully parsing a general csv file is not.
Mar 22, 2021 at 12:12am
It turns out that my csv output is dead simple. It contains only a ':' (wisely chosen I might add) delimiter with a terminal nothing (or \r if it's a DOS output). I have been convinced that the best way to proceed is to just write a simple piece of code to 'split' the input. To the convincers credit, it worked. And so I proceed gracefully into my future.

I do understand the uncertain world of csv input, and am glad to avoid it.

I might also add that I'm using Visual Studio (because NetBeans has a lack of manpower to provide C/C++ support). I find VS riddled with IDE and compiler errors, and VS seems to ignore some language constructs. Pity. I really like NetBeans.

Thanks to all.
Last edited on Mar 22, 2021 at 12:14am
Topic archived. No new replies allowed.