You can use std::getline to read the entire line, and std::stringstream to break the line into individual tokens. That assumes that the two units of text you want to compare are on adjacent lines. If each unit of text is in different files, then you can do something like this
1 2 3
|
std::ifstream my_file("my-file.txt");
std::vector <std::string> text;
for (string tok; my_file >> tok; text.emplace_back(tok));
|
After which each element of
text
contains each whitespace-separated token in the file (or else the stream extraction failed -- you can query the state of the
std::ifstream
object after the loop to check.)
Here is a character-based implementation of the Wagner-Fischer edit-distance algorithm I had lying around. I don't have one templated nor one that is token-based, but it should be extremely easy to modify to accept a vector of strings (tokens).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
|
int edit_dist(std::string const a, std::string const b) {
std::vector <std::size_t> r0(b.size() + 1, 0);
std::vector <std::size_t> r1(r0);
/* Fill the row-based edit distance relative to the empty string. */
std::iota(r0.begin(), r0.end(), 0);
for (std::size_t i = 0; i < a.size(); i ++) {
for (std::size_t j = 0; j < b.size(); j ++) {
bool const subst_needed = a[i] != b[j];
constexpr int subst_cost = 2;
/* dynamically optimize */
r1[0] = i + 1;
r1[j + 1] = std::min({r0[j + 1] + 1,
r0[j] + (subst_needed? subst_cost: 0),
r1[j] + 1});
}
r0.swap(r1);
}
return r0[b.size()];
}
|
What is
char(32)
? Yeah, I know that it's
' '
, but why not just write that -- magic numbers aren't good. Character literals are in the language for a reason! Use them! If you don't, you're assuming an ASCII-compatible character encoding by using the character codes directly.
(At the very least the functional-style cast should probably be removed in favor of a value initialization expression
char {32};
or a
static_cast
, to silence compiler warnings.)