New line character

May 2, 2014 at 11:09pm

Consider the following code.

#include <iostream>     
#include <fstream>      
int main () {

  std::ifstream is ("example.txt");
  if (is) {
    // get length of file:
    is.seekg (0, is.end);
    int length = is.tellg();
    is.seekg (0, is.beg);

    char * buffer = new char [length];

    std::cout << "Reading " << length << " characters... ";
    // read data as a block:
    is.read (buffer,length);

    if (is)
      std::cout << "all characters read successfully.";
    else
      std::cout << "error: only " << is.gcount() << " could be read";
    is.close();
    // ...buffer contains the entire file...
    std::cout<<std::endl;
    for(int i = 0; i < length; i++)
    {
        std::cout<<buffer[i];
    }

    delete[] buffer;
  }
  return 0;
}

Edit & run on cpp.sh

Say i have a file with only a newline in it
My question is how is the new line character stored in buffer and why is the newline character considered as 2 characters and why only 1 of these 2 characters is extracted.

May 2, 2014 at 11:49pm

closed account (2AoiNwbp)

Hi Void life,

How did you create your example.txt? did you use Unicode characters? they are 2 bytes long. If so, bytes are inverted in the buffer, that means that the LSB is at left, and MSB at right. Thus, '\n' character is 0x000D, but in buffer is stored as 0x0D00, and you are using char*.
That's what I think is happening..

regards,
Alejandro

Last edited on May 2, 2014 at 11:50pm

May 3, 2014 at 12:27am

Void life (71)

Hey,
Thanks for the reply alejandro

Um i just created example.txt in notepad on windows. I'm not sure but i think it is unicode characters.

If so, bytes are inverted in the buffer, that means that the LSB is at left, and MSB at right. Thus, '\n' character is 0x000D, but in buffer is stored as 0x0D00, and you are using char*.

I don't exactly understand what you are saying here could you elaborate a little please?. If i use is.read() to extract stuff from the file into the char pointer called buffer the bytes of the extracted stuff become inverted?. In anycase even if the newline character is 2 bytes long its still only "1" character right? so how come tellg() which returns the position of the current character in the input stream, returns 2 instead of 1

May 3, 2014 at 1:35am

closed account (2AoiNwbp)

I don't exactly understand what you are saying here could you elaborate a little please?

Yes, sure. If we take a single character like 'H', its ASCII code is 72, but its unicode value is also 72, but ocuppying two bytes instead of one.
So you can see it as:

0(MSB) 72(LSB),

but in memory they are inverted, so you are going to see them as

72(LSB) 0(MSB).

I elaborated this little code for you to see how unicode characters are stored in memory.

#include <iostream>

using namespace std;

int main()
{
	char cLetter = 'H';
	wchar_t wcLetter = L'H';
	char* pbuf = &cLetter;

	cout << "Sizeof(" << cLetter << ") = " << sizeof(cLetter) 
		 << " byte" << (sizeof(cLetter)>1 ? "s" : "") << endl;
	cout << cLetter << " = " << (int)*pbuf << endl;
	
	cout << "Sizeof(" << wcLetter << ") = " << sizeof(wcLetter) 
		 << " byte" << (sizeof(wcLetter)>1 ? "s" : "") << endl;
	pbuf = (char*)&wcLetter;
	cout << "*(pbuf + 0) = " << (int)*pbuf << "\tLSB" << endl;
	cout << "*(pbuf + 1) = " << (int)*(pbuf + 1) << "\t\tMSB" << endl;

	system("pause");
	return 0;
}

Edit & run on cpp.sh

May 3, 2014 at 2:19am

closed account (2AoiNwbp)

On the other hand, if your file example.txt is not saved with unicode characters, I don't know what else could be happening... may be notepad adds extra characters?? don't know.

regards,
Alejandro

May 3, 2014 at 6:06am

TwilightSpectre (1392)

It is due to line endings. On most unix like systems, you may see the line ending being declared as LF (linefeed, '\n'). However, for some reason or another, on Windows they give 'extra flexibility' and by default their line endings are CR+LF (carriage return + line feed, "\r\n"). The only times you will ever notice, though, is really when you open your file in binary mode (which disallows Windows from doing things like that).

Last edited on May 3, 2014 at 6:06am

May 3, 2014 at 11:55am

Void life (71)

Thanks Alejandro. I kindof see what you were saying now.

NT3
Thanks,
Ahh so that explain why tellg() was returning 2. But any ideas why read() reads only the newline character?

May 3, 2014 at 12:08pm

Peter87 (11251)

\r\n newlines will automatically be read as \n on Windows. This is useful because it makes it possible to handle text files the same way on Windows, Linux and everywhere else, despite the difference in how line endings are marked. If you don't want this behaviour open the file in binary mode.

std::ifstream is ("example.txt", std::ios::binary);

May 3, 2014 at 1:27pm

Void life (71)

Thanks peter. So if i'm understanding right this is what happens when example.txt has only a newline:
is.tellg() returns 2 as the default line ending is "\r\n" which is "2 characters". Then is.read(buffer , length) reads the "\r\n" as 1 character and there isint anymore characters but you tell it to read for 2 characters which is why you get

error: only 1 character could be read

Topic archived. No new replies allowed.

C++

Forum

New line character