Finding a char array in a large file

Feb 21, 2015 at 5:41pm
I have a specific byte (that is unsigned char) array, that I need to find in a very big file (2GB or so), currently I'm using:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
size_t fsFind(FILE* device, byte* data, size_t size)
{
        int SIZE = (1024 > size) ? 1024 : size;
        byte buffer[SIZE];
 
        int pos = 0;
        int loc = ftell(device);
 
retry:
        while(fread(buffer, sizeof(byte), SIZE, device) == SIZE)
        {
                for(int i = 0; i < SIZE; i++)
                {
                        if(buffer[i] == data[0])
                        {
                                for(int j = 0; j < size; j++)
                                {
                                        if(i + j < 1024)
                                        {
                                                if(buffer[i + j] != data[j])
                                                        goto nofound;
                                        }
                                        else
                                        {
                                                int pos = ftell(device);
                                                pos -= j;
                                                fseek(device, pos, SEEK_SET);
                                                goto retry;
                                        }
                                }
                                fseek(device, loc, SEEK_SET);
                                return pos + i;
                        }
nofound:;
                }
                pos += SIZE;
        }
        fseek(device, loc, SEEK_SET);
        return -1;
}


Which seems to find proper result on first use, but on subsequent searches it seems to find invalid positions, is there something I'm doing wrong, or is there a better way?
Last edited on Feb 21, 2015 at 5:41pm
Feb 22, 2015 at 12:38am
Have you tried rewriting it in C++ instead of C?
Feb 22, 2015 at 1:58am
Does your OS have the grep command? Use it if possible.
Feb 22, 2015 at 10:57am
@LB:
I don't see how using ifstream::read would make any difference, I thought these were merely wrappers

@dhayden
wouldn't grep be slower?
Feb 22, 2015 at 2:48pm
wouldn't grep be slower?

No, grep (or fgrep) would probably be faster. There are incredibly clever algorithms for searching for one string inside another. The larger the search string, the faster they go.

You should be limited only by the speed of your disks. In other words. you should be able to search the data as fast as you can read it from the disk drive.
Feb 22, 2015 at 3:46pm
interesing, but I can't seem to find any docs on fgrep, except for the shell, also does it work with binary files?
Last edited on Feb 22, 2015 at 3:47pm
Feb 22, 2015 at 4:58pm
Use a string search algorithm (instead of a pattern-matching algorithm) for this.

Boyer–Moore: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

http://www.boost.org/doc/libs/1_57_0/libs/algorithm/doc/html/algorithm/Searching.html

Something along these lines (off the cuff, untested):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <iostream>
#include <vector>
#include <string>
#include <boost/algorithm/searching/boyer_moore.hpp>

std::streamoff find( std::string substr, std::istream& stm /* opened in binary mode */ )
{
    std::size_t str_size = substr.size() ;
    if( str_size > 1024 ) str_size = 1024 ;
    boost::algorithm::boyer_moore<std::string::iterator> boyer_moore( substr.begin(), substr.begin() + str_size ) ;

    std::size_t buff_size = 1024 * 1024 * 32 ; // 32 MB
    std::vector<char> corpus(buff_size) ;
    char* begin = std::addressof( corpus.front() ) ;

    std::streamoff offset = stm.tellg() ;
    std::size_t nread = buff_size ;

    while( nread == buff_size )
    {
        stm.seekg(offset) ;
        stm.read( begin, buff_size );
        nread = stm.gcount() ;
        if( nread == 0 ) break ;

        char* end = begin + nread ;
        char* found = boyer_moore( begin, end ) ;
        if( found != end ) return offset + (found-begin) ;

        offset += (buff_size-str_size) ;
    }

    return std::streamoff(-1) ;
}
Topic archived. No new replies allowed.