I am currently playing around with Microsofts rtf standard and I am trying to figure out how to interpret the data that it stores when using foreign characters.
I have the character '日' (japanese) saved in an rtf file which produces the following raw file data...
Now I know that this character ('日') has the raw data value (in hex)...
0xE6 0x97 0xA5
I know this because I assigned it to an std::string using the following...
std::string myString = "日";
...and viewing the hex values of the strings data... I also know that 0xE6 0x97 0xA5 is this characters UTF8 character code (google for a UTF8 character map).
Now, looking at the contents of the rtf file I can see that it is mapping this character to the character code...
0xC8 0xD5
.. which I know because according to rtf documentation unicode characters are stored as a back slash followed by an appostrophy followed by a value. But I cannot find this character code mapping to the character "日" anywhere.
Does anyone have any words of wisdom/advice for me? Or does anyone have an experience with rtf files that might help me figure out how to get the character code 0xE6 0x97 0xA5 from 0xC8 0xD5?
the code mapping is generally saved in font style files if i am correct, ie: if you got the font that created the japanese characters and used that to create the 日 symbol that you want it would automatically have the character code referenced via C++
The problem is that you think it is storing the sun as a Unicode (UTF) encoding... except that it is not.
Microsoft uses "code pages" to specify how a specific character is encoded.
The very first line of the RTF tells you what code page is in use. In this case, it is the "ANSI CP G 1252", which basically means "standard US Windows character set." Additional information about it follows (like default language 2057, which I haven't looked up).
I haven't looked up the '日' character either, but it is presumably in the standard code page as 0xC8 0xD5.
So really what I need to do is separate the RTF file into blocks of text encoded using different code pages, then I need to decode each of these blocks using their respective code pages, and this should provide me the text in its original UTF form?
Are these code pages stored as files local to the operating system? Or do the various operating systems provide APIs to access the various code pages via platform specific libraries?
Or even better, is anyone aware of some open source goodness that can convert text data in code page form to UTF8?
Perhaps I should not be using rtf files. If they are using code pages then they are clearly living in the past. Code pages are a little dated aren't they?
Perhaps I should not be using rtf files.It depends what you're using it for. RTF files represent a formatted document, they're not holding flat unformatted text.
Well currently it is being used to stored formatted text and images. Otherwise I would have used something else. RTF file format is certainly not the easiest file format to work with!
Hey guys. I found out some more info that might be helpful to anyone finding themselves in a similar situation as I did above...
Firstly, there is a cross platform codepage to unicode translator written in c++ available at the following website... http://site.icu-project.org/repository
Secondly if you wanted to write your own instead (which I would advise against due to to the size of the project) you can find a full list of names for all the CodePages in existence here: http://www.iana.org/assignments/character-sets
This page provides all the names and common aliases for each CodePage which is useful. You can also FTP into the iana.org website and find translation documents for every code page they list by copying this link into a browser: ftp://ftp.unicode.org/Public/MAPPINGS
Since there are a lot of files I would recommend using an FTP client (e.g. filezilla) to ftp onto the site and download the 'MAPPINGS' folder! You should then have a ton of documents containing codePage mappings (left column in the text files) to unicode (right column).