I am currently outlining a program I'd like to make, but before I can even really start to think about it, I need to know if there is a maximum filesize that can be handled by fopen(). I have scoured through a few other websites and they all mention an off_t internal variable that works as a physical limit to the file's size. However, I can't find a definite answer to off_t's internal structure. Some say it's a longint, some say it's a longlongint... there is some implication that it might be somehow system-dependent which'd be a problem, given how my program would have to run on win32 and win64.
The files I've seen so far are 1-4GB, but I've been told there have been a few that were >10GB. Can fopen() (or some other standard-library function) deal with such files?
off_t does not itself limit the size of the file that can be created. It limits the locations to which you can seek, because
seek takes an off_t as either the offset from the beginning of the file, end of the file, or from the current file pointer.
(I am speaking here of POSIX APIs, as I am familiar with off_t's use there; I have no idea about Windows).
If you want to get the maximum value that an off_t can hold in a platform independent way, you can:
I was about to edit my post because I've discovered a few places that say that off_t wouldn't actually be the problem, but rather, to quote a reply to a similar question in another forum:
It isn't usually a problem to *open* an existing large file: the problem is usually in *reading* the large file. Some systems are only able to read to about the 2 gigabyte mark, or are allowed to read indefinitely but cannot position (ftell/fseek) beyond 2 gigabytes without using system-specific calls.
And obviously, fseek'd probably be the most important function in my program, since I'd have to skip literally a few dozen thousand lines at a time to get to the part of the file that actually interests me.
And apparently the maximum value of off_t is 2147483647, aka INT_MAX. So it's not even a longint. But I still don't actually know what that means in terms of file-size. >.>
EDIT: And I just had the craziest idea. Obviously, you can't just go #define off_t long long off_t , right? Your own definitions don't actually go into the source code of the functions you call, right? Hell, can you even do that? Define one variable type as another?
There is a little trick that may or may not be useful to you. Say off_t has max value 2^32 - 1 (2 GB, as the
value you've discovered). If you want to seek to, say, the 2 * ( 2 ^32 - 1 ) byte, you can, with the file
pointer at the beginning of the file, do two consecutive relative seeks of 2^32 - 1 bytes.
I don't know what compiler you are using, but I had this problem with gcc/g++ a while back, and it turns out
there is a compiler switch: -D_FILE_OFFSET_BITS=64 that automagically causes off_t to be 64 bits instead
of 32 (on my platform).
Also, if you are using a POSIX-compliant platform, you can consider using the system call open() instead of
fopen(), as yes, FILE* may very well have a limitation due to an internal off_t variable or two in the struct.
But open() will not have that exact limitation since it returns just an integer file descriptor (there is no
user-land data structure).
And would that trick work? Doesn't fseek simply work on the fpos_t variable contained in FILE*?
And using your little code, I found out
a) for whatever reason long_int::max is also INT_MAX. You'd think it'd be bigger, right?
b) INT_MAX (int::max and long_int::max) is actually 2^31-1
b) fpos_t is a long long int, so fpos_t::max = 2^63-1
All this is very strange. Or, well, not really. One bit needs to be dedicated to the sign, after all. But why oh why is fpos_t signed? What use is there for a negative value in file positions?
Anyways, back to the question at hand, can you simply use fseek recursively even if it's beyond it's own internal scope? Doesn't fpos_t still set a limit at 4GB?
The trick probably wouldn't work on a FILE*, only on a file descriptor.
a) IIRC the standard mandates that sizeof( long int ) >= sizeof( int ) -- note the equals.
Also IIRC "int" in the C standard was supposed to be machine word size. I would venture
a guess that "long int" would be the largest integral type that can be manipulated in a single
arithmetic instruction. The two could be equal, and on Intel they are.
b) Oops, my bad.
Why is fpos_t signed? Good question. My cynical side: probably because
printf() returns an int and strlen() returns an int. Don't you like strings that have
length -42? I figure that having a string of negative length means it gives me
memory, so I never need to buy more RAM. I just have to instantiate enough of
those :)
I have not tried with fseek(); because I'm on linux, I always use open(), close(), read(),
and write(). But fpos_t, being a 64-bit value, actually gives you 2 billion ( 4 GB ), or
8 billion gigabytes, not 4GB.
So I'm guessing that's not really much of a problem then, huh?
And I'm not entirely sure I understood what you meant in the first part. FILE* contains a file position descriptor (fpos_t). Sure, you use fseek(FILE* fp), but (in my oh-so-limited) understanding of it, all it really does is change the value of fp.(fpos_t), right?
Or do I just not get the difference between FILE* (and it's fpos_t variable) and a file descriptor?
And I've never had a string return a negative value. That sounds fun.
fopen() returns a FILE*; open() returns an int. This int, in POSIX-land, is called a file descriptor, and
is different from a FILE*.
Me neither on the strlen() thing; it's just the idea that it could return a negative value thanks
to the data type. Actually, one of my pet peeves of C/C++ is that "int" is too easy to type, so a lot
of people tend to use it for everything integral -- lengths of strings, number of elements in a
container, etc -- even though all negative values are nonsensical (unsigned int would be better,
but is also much more to type).
Well, I'm going to give my program a try. Should be easy enough to see the feasibility of it. I'll just get that 11GB file and test fopen() and fseek(fp,ABSURDLY_LARGE_NUMBER,SEEK_CUR) and see what happens.
I'm dealing with the same problem here. I want to be able to seek beyond the 2GB limit. In order to do that I added the directive "_FILE_OFFSET_BITS 64", but sizeof(off_t) still returns 4. I thought it should be 8 now. Also if I try to define a variable of type off_t64 the compiler tells me that "‘off_t64’ was not declared in this scope".
Btw I'm using the POSIX APIs (lseek()).
I'm looking forward to see the results of your experiments, keep us informed please! Many thanks for taking the time to read that.