An article on Input Correction.

I recently wrote an article about a program I wrote which tries to suggest corrections to input. The idea was for command-line parameters, although it could, in theory, be used for any kind of input (even auto-correction in a text editor, although it would be really slow to operate on an exhaustive dictionary).

I got the idea from git(3). When you commit to the repository in git(3) you type git commit. If you mistype "commit" as "comit" git will produce the following output:
$ git comit
git: 'comit' is not a git command. See 'git --help'.

Did you mean this?
	commit

That's what I'm trying to achieve with my 'correct' program.

The thing is, I'm not entirely sure how "correct" the program is (excuse the pun :). Could I get some feedback on it? (i.e., does it work? how valid is the algorithm in the similarity() function? how could the program in general be improved? is there a simpler way that I'm missing?). One feature I'd like to add would be for the program to express its confidence that the suggestion is correct. Another would be to have a threshold for how close the strings have to be to qualify as valid suggestions.

Article URL: http://cplusplus.com/articles/iN3hAqkS/

Thanks! :)
1
2
int x = 0 - s1[i];
score += ABS(x);
I think this is a little harsh. You'll end comparing the lengths.
Try this metric http://en.wikipedia.org/wiki/Levenshtein_distance
Thanks! I've heard of that algorithm before but I didn't know it was related to this.
There's a more advanced version, called the Damerau-Levenshtein algorithm. It's a bit softer on 'common mistypes' such as two letters in the wrong sequence ('wrong' <-> 'worng'). The Levenshtein distance would be 2 (substitute o for r, and r for o), while the DL distance would be 1 (swap o and r).
That sounds even better. I'll try implementing that instead.

Edit: Does this look about right?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/** Computes the Damerau-Levenshtein distance between two strings s and t */
static int damerau_levenshtein(const char* s, const char* t)
{
	size_t i, j, m = strlen(s), n = strlen(t);
	int d[m + 1][n + 1];
	/* Initialise 'd'
	 */
	for (i = 0; i < m; ++i)
		d[i][0] = i;
	for (j = 0; j < n; ++j)
		d[0][j] = j;
	/* Compute distances
	 */
	for (i =  1; i < m; ++i) {
		for (j = 1; j < n; ++j) {
			int	cost = (s[i] == t[j]) ? 0 : 1,
				a = d[i - 1][j    ] + 1,
				b = d[i    ][j - 1] + 1,
				c = d[i - 1][j - 1] + cost;
			d[i][j] = MIN3(a, b, c);
			if (i > 1 && j > 1 && s[i] == t[j - 1] && s[i - 1] == t[j - 1])
				d[i][j] = MIN(d[i][j], d[i - 2][j - 2] + cost);
		}
	}
	/* Return the distance
	 */
	return d[m][m];
}

I have a feeling I might have gotten the array subscripts wrong, I'm not sure if the way I converted d[i, j] into d[i][j] is right.
Last edited on
Topic archived. No new replies allowed.