1) Open files. 2) Iterate them both at the same time. 3) If the char is the same, add to a are_similar counter. 4) If the char is different, add to a are_different counter. 5)similarity_percentage = are_similar * 100 / (are_similar + are_different)
And here's some sample code to get you started. It wasn't tested.
@OP: you need to define a `distance'
Using Catfish3 definition you would say that
_ abcdefghijklmnopqrstuvwxyz
_ zabcdefghijklmnopqrstuvwxy
are completely different
Do you expect files to be identical and just want to count how many characters differ between them? Then go for something like Catfish3's suggestion.
If not, you need a far more sophisticated approach as ne555 suggested.
1) You will need a similarity measure and score for characters. In your case this may be easy, e.g.,: give each pair of equivalent characters in the two files a score of +1 if both chars are equal, 0 if they differ. Depending on the task, this similarity measure may be too easy though (Is an "E" as different from an "e" as from a "Y"?).
2) Because of possible insertions and deletions (the example ne555 gave contains both, or 1 "move"), the problem gets far more complicated because it is not at all obvious which characters in the two files are equivalent.
You may want to look at the UNIX diff command and the algorithms used in stuff like this. Many version control systems like GIT need to perform this task a lot.
I suggest you look up "Edit distance" on google.
In Bioinformatics, a special version of your problem (where the 2 files being compared are actually DNA strands) is usually solved by dynamic programming, more specifically, the Needleman-Wunsch Algorithm.
Finally, in the case of XML, you may be better of using an XML parser first (unless you expect one of the files to be broken XML du to changes). I suggest you describe in more detail why you want to compare the files.
Okay. Well, I need a program just like plagiarism but only with two files (can be either excel or any text documents). Here I need to compare two files and display their similarity percentage. Suppose the first word of file1 contains the word "Is" and the file2 contains "is" it should display it as same and calculate the total number of similar words and display PERCENTAGE.
I would appreciate any help :)
Well, by using getline I would compare the whole sentence rather than each and every word. and one more thing, I'm completely inserting some random files.
Excel, Word, PDF documents (to name a few) are not text documents!
Rule of thumb: if when you open it with Notepad, you see all kinds of crazy symbols, it's not a text document.
Well, by using getline I would compare the whole sentence rather than each and every word. and one more thing, I'm completely inserting some random files.
Not necessarily. You can pass the delimiter as space:
1 2 3 4 5
std::ifstream file("input.txt");
std::string str;
std::getline(file, str, ' '); // or, a bit worse
file.getline(&str.front(), str.length(), ' ');
I would appreciate any help :)
It is not my purpose here to bring you down, but it has to be said: to me it's clear you don't know what you're doing. You need to gain more knowledge before you can create this program.
And make no mistake, this program you want to create is not simple, which is why I doubt anyone here will give you a full solution.