Quick comparison of two files.

Forum

Forum
General C++ Programming
Quick comparison of two files.

Quick comparison of two files.

Feb 27, 2013 at 5:50pm

Hi all. Please advise the fastest way to compare files byte by byte.
File size varies from 1 byte to 2 gb
Here is my current code:


//BUFFER_SIZE = 1 mb

bool FindDublicateFiles::isFilesEqual(const std::string& lFilePath, const std::string& rFilePath) const
{
    std::ifstream lFile(lFilePath.c_str(), std::ifstream::in | std::ifstream::binary);
    std::ifstream rFile(rFilePath.c_str(), std::ifstream::in | std::ifstream::binary);

    if(!lFile.is_open() || !rFile.is_open())
    {
        return false;
    }

    char *lBuffer = new char[BUFFER_SIZE]();
    char *rBuffer = new char[BUFFER_SIZE]();

    do {
        lFile.read(lBuffer, BUFFER_SIZE);
        rFile.read(rBuffer, BUFFER_SIZE);

        if (std::memcmp(lBuffer, rBuffer, BUFFER_SIZE) != 0)
        {
            delete[] lBuffer;
            delete[] rBuffer;
            return false;
        }
    } while (lFile.good() || rFile.good());

    delete[] lBuffer;
    delete[] rBuffer;
    return true;
}

Feb 27, 2013 at 8:43pm

toum (353)

As a quick test you could check if the 2 files have the same size.

Lines 14/15, you call the default constructor for char, which sets all the chars to 0. It's completely useless and takes time for nothing.

You could use a buffer bigger than 1MB.

If you call this function a lot, you could also allocate the buffers elsewhere and pass them to the function. This way you'd avoid a lot of allocation/deallocation cycles.

Feb 27, 2013 at 9:33pm

seftoner (4)

Thank you for having responded

Lines 14/15, you call the default constructor for char, which sets all the chars to 0. It's completely useless and takes time for nothing.

if the file size is smaller than the buffer size that remaining buffer will be filled with "garbage" - it affects the outcome. need to clean buffer.

do {
        lFile.read(lBuffer, BUFFER_SIZE);
        rFile.read(rBuffer, BUFFER_SIZE);
	numberOfRead = lFile.gcount();//I check the files with the same size

        if (std::memcmp(lBuffer, rBuffer, numberOfRead) != 0)
        {
			memset(lBuffer,0,numberOfRead);
			memset(rBuffer,0,numberOfRead);
			return false;
        }
    } while (lFile.good() || rFile.good());

You could use a buffer bigger than 1MB.

increase buffer for small files increases the time

Last edited on Feb 27, 2013 at 10:12pm

Feb 27, 2013 at 11:35pm

Cubbi (4774)

If you want to be efficient with file I/O, try memory mapping:

#include <iostream>
#include <algorithm>
#include <boost/iostreams/device/mapped_file.hpp>
namespace io = boost::iostreams;
int main()
{
    io::mapped_file_source f1("test.1");
    io::mapped_file_source f2("test.2");

    if(    f1.size() == f2.size()
        && std::equal(f1.data(), f1.data() + f1.size(), f2.data())
       )
        std::cout << "The files are equal\n";
    else
        std::cout << "The files are not equal\n";
}

Edit & run on cpp.sh

Last edited on Feb 27, 2013 at 11:36pm

Feb 28, 2013 at 4:07am

Cubbi (4774)

gcc(linux)   0.92 s
intel(linux) 0.97 s
sun(sun)     4.04 s
xlc(ibm)     3.67 s

ifstream.read() version into a 1M buffer (as posted by seftoner)

gcc       1.80 s
intel     1.89 s
sun(sun) 14.5 s
xlc(ibm)  2.43 s

trivial I/O stream-based version

1
2
3

   if(std::equal(std::istreambuf_iterator<char>(f1), 
                 std::istreambuf_iterator<char>(), 
                 std::istreambuf_iterator<char>(f2)))

gcc(linux):    14.1 s
intel(linux):  35.2 s
sun(sun)       29.3 s
xlc(ibm)       27.1 s

Mar 2, 2013 at 4:19pm

seftoner (4)

just for fun, I ran this on a few boxes. On Linux, I was comparing two copies of Intel parallel studio distro (size 2,152,945,149 bytes), on Sun and IBM, two copies of some binary of size 898,215,121 bytes)

memory-mapped version (as posted by me)
gcc(linux) 0.92 s
intel(linux) 0.97 s
sun(sun) 4.04 s
xlc(ibm) 3.67 s

ifstream.read() version into a 1M buffer (as posted by seftoner)
gcc 1.80 s
intel 1.89 s
sun(sun) 14.5 s
xlc(ibm) 2.43 s

trivial I/O stream-based version
1
2
3
if(std::equal(std::istreambuf_iterator<char>(f1),
std::istreambuf_iterator<char>(),
std::istreambuf_iterator<char>(f2)))

gcc(linux): 14.1 s
intel(linux): 35.2 s
sun(sun) 29.3 s
xlc(ibm) 27.1 s

WOW! Very quickly! Why does my code works for me very slowly. For example comparison of two files of 650 MB size takes 40 seconds

//bufferSize = 8 mb
{
    std::ifstream lFile(lFilePath.c_str(), std::ios::in | std::ios::binary);
    std::ifstream rFile(rFilePath.c_str(), std::ios::in | std::ios::binary);


    if(!lFile.good() || !rFile.good())
    {
        return false;
    }

    std::streamsize lReadBytesCount = 0;
    std::streamsize rReadBytesCount = 0;

    do {
        lFile.read(p_lBuffer, *bufferSize);
        rFile.read(p_rBuffer, *bufferSize);
        lReadBytesCount = lFile.gcount();
        rReadBytesCount = rFile.gcount();

        if (lReadBytesCount != rReadBytesCount || std::memcmp(p_lBuffer, p_rBuffer, lReadBytesCount) != 0)
        {
            return false;
        }
    } while (lFile.good() || rFile.good());

    return true;
}

Mar 2, 2013 at 5:08pm

JLBorges (13770)

1
2

bool FindDublicateFiles::isFilesEqual(const std::string& lFilePath, 
>                                     const std::string& rFilePath) const

If this is to be done many times for the same files:

1. Pre-compute and store a checksum (say MD5) for each large file file (along with a timestamp).

2. If the file was not modified after the timestamp, compare the checksums first. Compare byte by byte only if the checksums and the file sizes match.

Mar 2, 2013 at 5:24pm

seftoner (4)

1. Pre-compute and store a checksum (say MD5) for each large file file (along with a timestamp).

2. If the file was not modified after the timestamp, compare the checksums first. Compare byte by byte only if the checksums and the file sizes match.

That's what I do

Topic archived. No new replies allowed.

C++

Forum

Quick comparison of two files.