Character Count != Character Count

I had to write a program to count characters in a file. Here's the working bit of that:

1
2
3
4
5
while(infile.good())
{
  inchar = infile.get();
  ++counter;
}


To check my work, I ran the program against a reasonably large text file I dl'd from Project Gutenberg. To make sure my count was right, I opened the same file in Ultra Edit... and the counts were not the same. Check this out...

Character Counts:
-----------------
UltraEdit : 770577
notepad++ : 770577
OpenOffice Writer : 737415
C++ : 753996

Convinced I botched it, I wrote a text file of a precise length and ran it again, and sure enough, my count was different. Here is what I think is happening.

C++ reads the newline character as a single character. Looking at the actual value of inchar, I can see that it's 10. But DOS ANSI / ASCII says the newline character is actually 2 characters -- line feed, and carriage return -- 10 and 13 respectively. Similarly, Unix says \n = 10, and Mac says \n = 13.

Sure enough, when I changed my code to this:
1
2
3
4
5
6
7
while(infile.good())
{
  inchar = infile.get();
  ++counter;
  if(inchar==10)
    ++counter;
}


the count came out to 770577, matching UltraEdit and Notepad++. However, this is not a proper approach, because the file actually was DOS encoded us-ascii when downloaded from Project Gutenberg, meaning the line terminators really were two characters: CR(10) and LF(13). So strictly speaking, UE and notepad++ both had it right, and C++ had it wrong. UE and notepad++ were both consistent with the number of bytes on disk for that text file.

Using the feature in UltraEdit, I changed the line terminators in the file to Unix (ascii 10). This changed the actual size of the file on disk from 753K to 737K. I ran the C++ program again, this time using the first version of the C++ program. The C++ program reported the same character count (753996) as before, and this time both UE and notepad++ reported 753996 characters -- same as the C++ program. This also made the file practically unreadable in Windows notepad.exe.

What's interesting about this is that even though the byte count of the file was physically reduced and evidenced by the difference in file size, C++ reported it the same. So as far as C++ is concerned \n might be 10, or 13, or 10 and 13. To C++, it's still one character, even if it's physically two bytes.

OpenOffice.org Writer, meanwhile, doesn't count the newline at all, and so was 16581 characters different than the actual byte count on disk. In fact, when I tried to open the Unix-formatted file, OpenOffice.org Writer didn't know what to do with it, and decided to import it as a delimited spreadsheet! (UE and notepad++ both handled the file without a problem.)

This is not an earthshaking discovery, but I thought it was interesting. Maybe some other newbies will also when they attempt to write their character counting program.
That's because you opened the file as text. The library does an automatic character translation depending on the locale both when it writes and when it reads text files.
Since CRLF happens to be Windows' standard newline, it is translated as '\n'. If you had run the program on Linux or MacOS, the result would have been the one you were expecting (IINM).

To get the results you need, you should open the file as binary, but keep in mind that get() is no longer a reliable method, since it considers 0xFF as the EOF, but binary files can feasibly contain any byte value. You should use a combination of seekg() and read() tellg(), instead.

And next time you need to get the size of a file, just right-click->properties; or, if you like the command line, dir <file name>.
Last edited on
Isn't there a C library that operates on directories and such that could be used to determine the size without actually counting the bytes manually?
Actually, no. Although there should be one.
You can use seekg() and tellg() to get the size of an open file, though.

(I edited my previous post when I realized my mistake.)
Last edited on
Ok, cool. Thanks for the clarification!
Topic archived. No new replies allowed.