Unicode characters / UTF-8 hexadecimal / printf %x

Jul 14, 2012 at 12:00pm
Hello and thank you for reading my post.

I am simply using the following code to print UTF-8 encoded Unicode characters on the console in hexadecimal format:
1
2
3
4
5
6
7
8
int n_i;
int n_posCursorInBuffer;
char bufferChars[100];
[...]
for(n_i=0 ; n_i<n_posCursorInBuffer ; n_i++)
{
    printf("%x ", bufferChars[n_i]);
}


The original (UTF-8 encoded Unicode) character being 志, that is to say "E5 BF 97" in hexadecimal format, here is what is printed on the console:
ffffffe5 ffffffbf ffffff97


Do you know what are these 6 leading "f"s in front of "e5", "bf" and "97"?

Note that, if the original (UTF-8 encoded Unicode) character is "F" for instance (46 in hexadecimal format), what is printed on the console is 46 (and not ffffff46).

Can you explain this behaviour?
Is it normal?
Can these "f"s be removed and how?

Thank you for helping.
Best regards.
--
Léa Massiot
Jul 14, 2012 at 12:56pm
%x is expecting an unsigned int and writes lower case.
http://pubs.opengroup.org/onlinepubs/007908799/xsh/fprintf.html

The char parameter you're passing in is passed as a signed int and interpretted as an unsigned int. As the high bit is set on E5, and you're on a 32 bit platform (apparently), it's being sign extended to 32 bits as FFFFFFE5, FFFFFFBF and so on.

You could try casting to unsigned int so you don't get the sign extension. I can't test it at the moment.
1
2
for(n_i=0 ; n_i<n_posCursorInBuffer ; n_i++)
    printf("%x ", (unsigned)bufferChars[n_i]);
Jul 14, 2012 at 1:42pm
if you use c++11, you can use the new string literals to make UTF-8 characheters: u8"UTF-8 string"
Jul 14, 2012 at 4:04pm
Hello and thank you for your answers.

@kbw
Thank you for the link.
You pointed me in the right direction.
Yet, note that the casting you proposed doesn't change the result and that my OS is a 64 bit OS.
Here is what I did instead:
1
2
3
4
5
6
7
8
9
10
int n_i;
int n_posCursorInBuffer;
char bufferChars[100];
int n;
[...]
for(n_i=0 ; n_i<n_posCursorInBuffer ; n_i++)
{
    n = bufferChars[n_i]&0xff;
    printf("%x ", n);
}


This actually solves the problem.
It's important to do this because, otherwise, the program may access some uninitialized memory...
(Cf. post #4 in http://cboard.cprogramming.com/c-programming/75761-unwanted-sign-extension.html )

Thank you and best regards.
--
Léa
Topic archived. No new replies allowed.