Problems with reading from File

Hi I currently want to read from a file all its content and store it in a buffer so I can further process with it.

However the buffer contains a little bit more content that I want to have. The strange thing is that half of the additional stuff that is in the buffer, is actually the last part of the file content repeated, the other half is just empty/nonsense (Í).

Here's the code:
1
2
3
4
5
6
7
8
9
FILE* stats = fopen(".\\folder\\file.txt", "r");
//get the file size
fseek(stats, 0, SEEK_END);
int size = ftell(stats);	// size is now 95211
rewind(stats);

char* buffer = new char[size];
fread(buffer, size, 1, stats);	// now all the content of the file is in buffer, but as I described, its actually more than I want
fclose(stats);

anybody know what I am doing wrong here?


edit: I also wanted to try this with the Windows API functions CreateFile and GetFileSizeEx but I dont know how I can create the array with the Long Integer value that GetFileSizeEx returns so maybe could someone answer me this bonus question?^^
Last edited on
ok I figured it out. the file contained word wraps so it seems that fread cuts them off, is that right?

because the file contained exactly 95211 symbols so obviously the "get the file size" code was correct. so I tried to figure out how much the buffer was too big, which was 3883 which was the exact same number of the lines of the file (well 3884-1).

but now I would like to know how does fwrite() know where to set the word wraps if they are not included in the saved buffer?
because I always wrote the buffer into an external file again so I could take a look at it if it had the correct content and the word wraps were all correctly set.

and how can I count the lines of the file so I can set the buffer size correctly?
Last edited on
closed account (S6k9GNh0)
word wraps are non-existent in the file itself and are based on the text editor.
then how does the texteditor knows when to do word wraps?
ok if it is based on the text editor then this answer will probably vary but I would be satisfied with one example.

I mean I practically only take the bare content from the file, without word wraps and nothing more. then we take this content and put it into another new naked file. but the text editor still knows when to place the word wraps? how can this be I ask you :p

and it would be helpful if someone could answer these questions I already posted:
I also wanted to try this with the Windows API functions CreateFile and GetFileSizeEx but I dont know how I can create the array with the Long Integer value that GetFileSizeEx returns
how do I do that?

how can I count the lines of the file so I can set the buffer size correctly?
Last edited on
closed account (S6k9GNh0)
Define "word wraps". A new line character or carriage return character is not a "word wraps" and are stored in the text.
sry my english is not that good so I wasn't sure if "word wrap" was the right expression.

All I know is that there was no appearance of something like a "new line" sign (like \n) in the char* buffer = new char[size]; where I stored the text in.

I also know that although the buffer had the same size as the file, it was still too big. And as I already said, that part which was too big had the exact same size as the file's line numbers so these "new line" signs were obviously not stored in the buffer with fread.

Still he knew where to set the new lines after writing the buffer to a completely new file with fwrite.
I dont know how to better explain it, if you can answer me this question then I'm happy, if not then it's also not a big deal ;)
Last edited on
closed account (S6k9GNh0)
Word wrapping (or rather, *soft* word wrapping) simply emulates a newline to remove the hassle of scrolling over a mile to read a really long sentence. They don't play 0x0A or 0x0D into the file.

New lines and carriage is represented by the byte 0x0A (new line) and 0x0D (carriage return). Without actually seeing the text file, I can't really explain any further as to what's going on.
1
2
fseek(stats, 0, SEEK_END);
int size = ftell(stats);	// size is now 95211 

This will not reliably give you the size of the file (number of chars in the file)
See (that is a single url without a new line in the middle) : https://www.securecoding.cert.org/confluence/display/seccode/FIO19-C.+Do+not+use+fseek()+and+ftell()+to+compute+the+size+of+a+file


so I tried to figure out how much the buffer was too big, which was 3883 which was the exact same number of the lines of the file (well 3884-1).

but now I would like to know how does fwrite() know where to set the word wraps if they are not included in the saved buffer?
because I always wrote the buffer into an external file again so I could take a look at it if it had the correct content and the word wraps were all correctly set.

That is the second problem with the code. fread() performs unformatted input; that means no escape sequence translations for new-lines are applied. Which explains why the count is off by the number of new-lines in the file. See: http://en.wikipedia.org/wiki/Newline#In_programming_languages

For a text file, the only portable way to do this is to read char by char, with escape sequence translations turned on, resizing the buffer as required, till end of file is reached. Doing this in C++ is painless. For example:
1
2
3
4
5
        std::ifstream file( ".\\folder\\file.txt" ) ;
        std::vector<char> seq ;
        char c ;
        while( file.get(c) ) seq.push_back(c) ;
        char* buffer = &seq.front() ; // if you want it as a c-syle array 

Another way of doing the same thing:
1
2
3
4
5
        std::ifstream file( ".\\folder\\file.txt" ) ;
        file >> std::noskipws ;
        std::istream_iterator<char> bof(file), eof ;
        std::vector<char> seq( bof, eof ) ;
        char* buffer = &seq.front() ;// if you want it as a c-syle array 



fseek()/ftell() to determine the size will not work portably even for streams opened in binary mode; fseek(file, 0, SEEK_END), has undefined behaviour for a binary stream. Btw, the code example in http://www.cplusplus.com/reference/clibrary/cstdio/fread/ is one that leads to undefined behaviour.
closed account (S6k9GNh0)
1. The above method is *incredibly* slow and is exactly why C programmers tend to hate C++ programmers.

2. He's not opening the file in binary mode.
Okay I'll read thorugh the links when I have time, thanks for the explanations guys ;)
@JLBorges:
instead of
1
2
file >> std::noskipws ;
std::istream_iterator<char> bof(file), eof ;
do
std::istreambuf_iterator<char> bof(file), eof ;
Otherwise you're needlessly constructing and destroying the sentries on every character.
> Otherwise you're needlessly constructing and destroying the sentries on every character.

Yes, Cubbi, it would be faster (about 8% or so faster with the GNU implementation). Thanks.

The "*incredibly* slow" comment prompted me to do a performance measurement on the slowest machine I could lay my hands on (a netbook with a 1.6 GHz Atom processor). I didn't have access to Zapeth's file - it would have been too small (95 KB with about 4K new-lines) anyway, so I used a 2,493 KB file with about 236K new lines.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <iostream>
#include <fstream>
#include <ctime>

std::clock_t elapsed( const char* path )
{
    std::ifstream file(path) ;
    char c ;

    std::clock_t start = std::clock() ;

    while( file.get(c) ) ;

    return std::clock() - start ;
}

int main()
{
    const char* const path = R"(D:\usr\local\dict\words.txt)" ;

    std::cout << elapsed(path) / double(CLOCKS_PER_SEC) << " secs\n" ;
}


This took between 0.547 and 0.640 seconds with g++ 4.7 on MinGw. Equivalent code with Microsoft C++ 2010 took a bit longer - a steady 0.813 seconds.

Incredibly slow? I suppose it is a matter of opinion, but to me it isn't. Perhaps because I tend to treat performance as a design constraint rather than a design goal, and because I believe that programmer time is way more expensive than machine time.
Here's another one:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#include <iostream>
#include <fstream>
#include <sstream>
#include <cstdlib>
#include <ctime>

using namespace std;

void generate_test_file()
{
    ofstream fout("my_test_file.test");

    for (int i = 1; i <= 3000000; ++i)
    {
        fout << char('a' + rand() % 26);

        if (i % 10 == 0) fout << '\n';
    }
}

int main()
{
    generate_test_file();

    stringstream buffer;

    clock_t start = clock();

    buffer << ifstream("my_test_file.test").rdbuf();

    cout << (clock() - start) * 1.0 / CLOCKS_PER_SEC << endl;
}

It's cleaner, faster (0.063 seconds at most) and it actually creates the buffer.
(EDIT: I forgot to mention that the other test takes at most 0.141 seconds)
Last edited on
Topic archived. No new replies allowed.