How to interpret rtf unicode characters

Mar 29, 2012 at 4:52pm
Hey guys.

I am currently playing around with Microsofts rtf standard and I am trying to figure out how to interpret the data that it stores when using foreign characters.

I have the character '日' (japanese) saved in an rtf file which produces the following raw file data...

1
2
3
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset134 SimSun;}{\f1\fnil\fcharset0 Calibri;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22\'c8\'d5\f1\par
} 


Now I know that this character ('日') has the raw data value (in hex)...
 
0xE6 0x97 0xA5


I know this because I assigned it to an std::string using the following...
 
std::string myString = "日";

...and viewing the hex values of the strings data... I also know that 0xE6 0x97 0xA5 is this characters UTF8 character code (google for a UTF8 character map).

Now, looking at the contents of the rtf file I can see that it is mapping this character to the character code...
 
0xC8 0xD5


.. which I know because according to rtf documentation unicode characters are stored as a back slash followed by an appostrophy followed by a value. But I cannot find this character code mapping to the character "日" anywhere.

Does anyone have any words of wisdom/advice for me? Or does anyone have an experience with rtf files that might help me figure out how to get the character code 0xE6 0x97 0xA5 from 0xC8 0xD5?

Regards!
Mar 29, 2012 at 5:10pm
the code mapping is generally saved in font style files if i am correct, ie: if you got the font that created the japanese characters and used that to create the 日 symbol that you want it would automatically have the character code referenced via C++
Mar 29, 2012 at 5:16pm
The problem is that you think it is storing the sun as a Unicode (UTF) encoding... except that it is not.

Microsoft uses "code pages" to specify how a specific character is encoded.

The very first line of the RTF tells you what code page is in use. In this case, it is the "ANSI CP G 1252", which basically means "standard US Windows character set." Additional information about it follows (like default language 2057, which I haven't looked up).

I haven't looked up the '日' character either, but it is presumably in the standard code page as 0xC8 0xD5.

Hope this helps.
Mar 29, 2012 at 6:30pm
Mar 30, 2012 at 9:33am
Ok I see a bit more clearly now thank you guys.

So really what I need to do is separate the RTF file into blocks of text encoded using different code pages, then I need to decode each of these blocks using their respective code pages, and this should provide me the text in its original UTF form?

Are these code pages stored as files local to the operating system? Or do the various operating systems provide APIs to access the various code pages via platform specific libraries?

Or even better, is anyone aware of some open source goodness that can convert text data in code page form to UTF8?

Perhaps I should not be using rtf files. If they are using code pages then they are clearly living in the past. Code pages are a little dated aren't they?

Mar 30, 2012 at 10:25am
Perhaps I should not be using rtf files.It depends what you're using it for. RTF files represent a formatted document, they're not holding flat unformatted text.
Mar 30, 2012 at 10:43am
Well currently it is being used to stored formatted text and images. Otherwise I would have used something else. RTF file format is certainly not the easiest file format to work with!
Last edited on Mar 30, 2012 at 10:44am
Mar 30, 2012 at 12:07pm
RTF file format is certainly not the easiest file format to work with!
You can say that again. Clearly Microsoft has some model that serialises/deserialises RTF conveniently. For the rest of us, it's a pain.
Apr 5, 2012 at 2:18pm
Hey guys. I found out some more info that might be helpful to anyone finding themselves in a similar situation as I did above...

Firstly, there is a cross platform codepage to unicode translator written in c++ available at the following website... http://site.icu-project.org/repository

Secondly if you wanted to write your own instead (which I would advise against due to to the size of the project) you can find a full list of names for all the CodePages in existence here: http://www.iana.org/assignments/character-sets

This page provides all the names and common aliases for each CodePage which is useful. You can also FTP into the iana.org website and find translation documents for every code page they list by copying this link into a browser: ftp://ftp.unicode.org/Public/MAPPINGS

Since there are a lot of files I would recommend using an FTP client (e.g. filezilla) to ftp onto the site and download the 'MAPPINGS' folder! You should then have a ton of documents containing codePage mappings (left column in the text files) to unicode (right column).

Hope this helps someone!
Topic archived. No new replies allowed.