I had to write a program to count characters in a file. Here's the working bit of that:
1 2 3 4 5
|
while(infile.good())
{
inchar = infile.get();
++counter;
}
|
To check my work, I ran the program against a reasonably large text file I dl'd from Project Gutenberg. To make sure my count was right, I opened the same file in Ultra Edit... and the counts were not the same. Check this out...
Character Counts:
-----------------
UltraEdit : 770577
notepad++ : 770577
OpenOffice Writer : 737415
C++ : 753996
Convinced I botched it, I wrote a text file of a precise length and ran it again, and sure enough, my count was different. Here is what I think is happening.
C++ reads the newline character as a single character. Looking at the actual value of inchar, I can see that it's 10. But DOS ANSI / ASCII says the newline character is actually 2 characters -- line feed, and carriage return -- 10 and 13 respectively. Similarly, Unix says \n = 10, and Mac says \n = 13.
Sure enough, when I changed my code to this:
1 2 3 4 5 6 7
|
while(infile.good())
{
inchar = infile.get();
++counter;
if(inchar==10)
++counter;
}
|
the count came out to 770577, matching UltraEdit and Notepad++. However, this is not a proper approach, because the file actually was DOS encoded us-ascii when downloaded from Project Gutenberg, meaning the line terminators really were two characters: CR(10) and LF(13). So strictly speaking, UE and notepad++ both had it right, and C++ had it wrong. UE and notepad++ were both consistent with the number of bytes on disk for that text file.
Using the feature in UltraEdit, I changed the line terminators in the file to Unix (ascii 10). This changed the actual size of the file on disk from 753K to 737K. I ran the C++ program again, this time using the first version of the C++ program. The C++ program reported the same character count (753996) as before, and this time both UE and notepad++ reported 753996 characters -- same as the C++ program. This also made the file practically unreadable in Windows notepad.exe.
What's interesting about this is that even though the byte count of the file was physically reduced and evidenced by the difference in file size, C++ reported it the same. So as far as C++ is concerned \n might be 10, or 13, or 10
and 13. To C++, it's still one character, even if it's physically two bytes.
OpenOffice.org Writer, meanwhile, doesn't count the newline at all, and so was 16581 characters different than the actual byte count on disk. In fact, when I tried to open the Unix-formatted file, OpenOffice.org Writer didn't know what to do with it, and decided to import it as a delimited spreadsheet! (UE and notepad++ both handled the file without a problem.)
This is not an earthshaking discovery, but I thought it was interesting. Maybe some other newbies will also when they attempt to write their character counting program.