How to detect UNICODE file?

Nov 22, 2012 at 10:38pm
I've tried opening a unicode file by many methods. If I knew "the file is ANSI or Unicode" then I would use the proper method to solve them. Now I'm having a strange text file. (Ansi - Unicode??? - nobody knows unless it's opened by someone) Actually I couldn't detect any text file what type of a text file is. ANSI? UNICODE? So I got a big trouble. If the detection failed I would not open any text file properly and correctly.

Does any one know? Any help would be greatly appreciated. :)
Last edited on Nov 23, 2012 at 6:30am
Nov 23, 2012 at 12:36am
You open the file in binary mode and check the BOM (Byte Order Mark) - the first
few bytes of the file

From:

Byte order mark
http://en.wikipedia.org/wiki/Byte_order_mark

Encoding     BOM (hex) BOM (dec)
----------------------------------
UTF-8        EF BB BF  239 187 191
UTF-16 (BE)  FE FF     254 255
UTF-16 (LE)  FF FE     255 254


(see Wikipedia for more)

But note that not all UTF-8 files have the BOM. And this might be the case for other encodings, too. Though modern editors are supposed to use a BOM when they write a file.

Without a BOM, you'd need to to use some sort of statistical approach, like these guys:

A composite approach to language/encoding detection
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

Andy
Last edited on Nov 23, 2012 at 12:41am
Topic archived. No new replies allowed.