Text files decoding in text editors

May 11, 2012 at 9:44am
Hello guys,

I have a question about the way text editors read text files.

How does a text editor detect that a text file is binary, and not readable as a text file?

And how does it know that a text file could be written in unicode (2 byte coding) or any other coding?

Such files have normally no headers at all!! If a file is open with fstream, what the fstream object sees is a series of a bytes... how can we know if this file is readable as text?
May 11, 2012 at 10:32am
how can we know if this file is readable as text?
There're several ways.

You can check with isprint() if the characters are ASCII. If it is not ASCCII then several encodings have certain characteristics that you can recognize.

unicode isn't necessary 2 bytes. Microsoft keeps naming it unicode but it is actually UCS-2:

http://en.wikipedia.org/wiki/Unicode

For instance for the majority of those letters (plain alphanum) there's a leading or trailing 0.

May 11, 2012 at 10:57am
Thank you for your reply.

So this check has to be done for EVERY character? sounds pretty expensive!!!
May 11, 2012 at 11:30am
But you need to read each and every char nonetheless. So why not doing the checking. But you can be satisfied with 1000 or so characters. It's only done once and reading from the hard drive is likely to be much slower
Last edited on May 11, 2012 at 11:31am
May 11, 2012 at 1:35pm
OK. Thanks a lot for the info :)

Have a nice weekend!
Topic archived. No new replies allowed.