New line character

Consider the following code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <iostream>     
#include <fstream>      
int main () {

  std::ifstream is ("example.txt");
  if (is) {
    // get length of file:
    is.seekg (0, is.end);
    int length = is.tellg();
    is.seekg (0, is.beg);

    char * buffer = new char [length];

    std::cout << "Reading " << length << " characters... ";
    // read data as a block:
    is.read (buffer,length);

    if (is)
      std::cout << "all characters read successfully.";
    else
      std::cout << "error: only " << is.gcount() << " could be read";
    is.close();
    // ...buffer contains the entire file...
    std::cout<<std::endl;
    for(int i = 0; i < length; i++)
    {
        std::cout<<buffer[i];
    }

    delete[] buffer;
  }
  return 0;
}

Say i have a file with only a newline in it
My question is how is the new line character stored in buffer and why is the newline character considered as 2 characters and why only 1 of these 2 characters is extracted.
closed account (2AoiNwbp)
Hi Void life,

How did you create your example.txt? did you use Unicode characters? they are 2 bytes long. If so, bytes are inverted in the buffer, that means that the LSB is at left, and MSB at right. Thus, '\n' character is 0x000D, but in buffer is stored as 0x0D00, and you are using char*.
That's what I think is happening..

regards,
Alejandro
Last edited on
Hey,
Thanks for the reply alejandro

Um i just created example.txt in notepad on windows. I'm not sure but i think it is unicode characters.

If so, bytes are inverted in the buffer, that means that the LSB is at left, and MSB at right. Thus, '\n' character is 0x000D, but in buffer is stored as 0x0D00, and you are using char*.


I don't exactly understand what you are saying here could you elaborate a little please?. If i use is.read() to extract stuff from the file into the char pointer called buffer the bytes of the extracted stuff become inverted?. In anycase even if the newline character is 2 bytes long its still only "1" character right? so how come tellg() which returns the position of the current character in the input stream, returns 2 instead of 1
closed account (2AoiNwbp)
I don't exactly understand what you are saying here could you elaborate a little please?
Yes, sure. If we take a single character like 'H', its ASCII code is 72, but its unicode value is also 72, but ocuppying two bytes instead of one.
So you can see it as:

0(MSB) 72(LSB),

but in memory they are inverted, so you are going to see them as

72(LSB) 0(MSB).

I elaborated this little code for you to see how unicode characters are stored in memory.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <iostream>

using namespace std;

int main()
{
	char cLetter = 'H';
	wchar_t wcLetter = L'H';
	char* pbuf = &cLetter;

	cout << "Sizeof(" << cLetter << ") = " << sizeof(cLetter) 
		 << " byte" << (sizeof(cLetter)>1 ? "s" : "") << endl;
	cout << cLetter << " = " << (int)*pbuf << endl;
	
	cout << "Sizeof(" << wcLetter << ") = " << sizeof(wcLetter) 
		 << " byte" << (sizeof(wcLetter)>1 ? "s" : "") << endl;
	pbuf = (char*)&wcLetter;
	cout << "*(pbuf + 0) = " << (int)*pbuf << "\tLSB" << endl;
	cout << "*(pbuf + 1) = " << (int)*(pbuf + 1) << "\t\tMSB" << endl;

	system("pause");
	return 0;
}

closed account (2AoiNwbp)
On the other hand, if your file example.txt is not saved with unicode characters, I don't know what else could be happening... may be notepad adds extra characters?? don't know.

regards,
Alejandro
It is due to line endings. On most unix like systems, you may see the line ending being declared as LF (linefeed, '\n'). However, for some reason or another, on Windows they give 'extra flexibility' and by default their line endings are CR+LF (carriage return + line feed, "\r\n"). The only times you will ever notice, though, is really when you open your file in binary mode (which disallows Windows from doing things like that).
Last edited on
Thanks Alejandro. I kindof see what you were saying now.

NT3
Thanks,
Ahh so that explain why tellg() was returning 2. But any ideas why read() reads only the newline character?
\r\n newlines will automatically be read as \n on Windows. This is useful because it makes it possible to handle text files the same way on Windows, Linux and everywhere else, despite the difference in how line endings are marked. If you don't want this behaviour open the file in binary mode.

 
std::ifstream is ("example.txt", std::ios::binary);

Thanks peter. So if i'm understanding right this is what happens when example.txt has only a newline:
is.tellg() returns 2 as the default line ending is "\r\n" which is "2 characters". Then is.read(buffer , length) reads the "\r\n" as 1 character and there isint anymore characters but you tell it to read for 2 characters which is why you get
error: only 1 character could be read
Topic archived. No new replies allowed.