Char * encoding

Mar 16, 2012 at 2:24pm
Hi,

If I write the statement below in C++ under visual studio, what will be encoding here.
 
const char *c = "£";


Under the visual studio project setting I have set the Charset to "Not set".

And what actually inside the char* buffer?


Thanks
Last edited on Mar 16, 2012 at 2:27pm
Mar 16, 2012 at 2:53pm
Interesting question. Not sure. UTF-8?? I think Visual Studio knows when a Unicode character is present in a source file and then asks the user to switch the source file encoding.

But I guess that'll be just for saving the source file. For the actual program execution, I really don't know. Will the compiler respect the UTF8 nature of the source file and create the extra bytes for the character in order to encode the same in RAM? I recommend that you do a small test and find out.
Mar 16, 2012 at 3:02pm
closed account (o1vk4iN6)
I believe all that option does is enable windows function and typedef to use either char or wchar_t. _T("some text"). I forget what the keyword windows uses, it's either TCHAR or something. As well as for functions. for example MessageBox(), depending on the setting it's either MessageBoxA(), MessageBoxW() or undefined.
Last edited on Mar 16, 2012 at 3:05pm
Mar 16, 2012 at 3:08pm
That's a different story, xerzi. The macro name is UNICODE and it is defined through the project properties, never in code in the case of Visual Studio.

The UNICODE setting will never change the meaning of a char. It will change TCHAR's, but never char's. Since the question uses char and not TCHAR, I assume the OP really wants to know if the read-only string will be encoded in a particular encoding. I suggested that he/she tries it out. It is simple: Create a new project that declares that string, then examine each byte pointed to by 'c'. Is it one byte? Is it two or three bytes? Ideally, he/she should find out beforehand the actual representation of the sterling pound symbol in UTF8 and UTF16 at least to have some comparison criteria.
Mar 16, 2012 at 3:16pm
closed account (o1vk4iN6)
Which is why that option does nothing, if you want to use wide characters than you need to define it:

1
2
3
// may need to include a header

const wchar_t* c = L"£";


Depending on the compiler, wchar_t can be of different size, windows uses 16-bits. You might be better off finding a library which supports different encodings, as it differs from compiler to compiler.
Mar 16, 2012 at 3:30pm
I understand, but I don't think that's the point. I think the point here is: What will the final representation in RAM be? Or at least that is what I think the point is. I could be wrong and you could end up being right.

I guess I'll just wait to hear from the OP. :-)
Mar 16, 2012 at 3:33pm
I am in visual studio 2005 windows xp environment.

In fact in my situation, I have api giving char* pointer also this api have not given any information about the encoding of this char*.

While debugging the debugger view shows the character "£" correct but the char[0] showing -65 and rest as 0 as I am giving a char[51] buffer to it.

Second I am passing this value to another api which is taking char* also this api have not mentioned anything about it's encoding.

For testing purpose I have directly passed "£" string to the second API.

I am not getting how to deal with situation like this even I am not been able to understand and conclude something for this behavior except this---that....

Also possibility would be many But I have this current situation and I cannot go for wchar_t *.

What next try I should go for?

Thanks for all the replies above!


Mar 16, 2012 at 3:36pm
Simply making it a wide char won't help, necessarily.

It comes down to 2 things:

1) How is the IDE saving the file?
2) How is the compiler interpretting the file?

Ideally, in this case, the answer to both questions would be "UTF-8" but that's not really an assumption you can make.

It's likely that both IDE and compiler options are configurable. But I'm too lazy to check how to do that right now.

The easiest way to test this is to do something like this:

1
2
3
4
5
6
7
8
9
10
11
const char* c = "£";  // £ sign is U+A3, which in UTF-8 is stored as 0xC2 0xA3

// so an easy way to check to see if it's really UTF-8:
if( c[0] == 0xC2 && c[1] == 0xA3 )
{
  // yes, it's UTF-8
}
else
{
  // no, it's some other encoding
}


This would probably be better done with an assert or something.

If you want to ensure that it's UTF-8 all around, the safe bet is this:

 
const char* c = "\xC2\xA3";


but that's hardly intuitive...
Mar 16, 2012 at 3:50pm
First of all thanks a lot for you post Disch.

1
2
3
4
5
6
7
8
9
10
11
12
13
        const char *c = "£";

	if( c[0] == 0xC2 && c[1] == 0xA3 )
	{
		printf("%c\n", c[0]);
		printf("%c\n", c[1]);
	}
	else
	{
		printf("%c\n", c[0]);
		printf("%c\n", c[1]);
	}


if test fails and control goes into else part.
c[0] is some wired character
c[1] is 0
Last edited on Mar 16, 2012 at 3:51pm
Mar 16, 2012 at 4:14pm
Well, it doesn't appear to be Windows 1252 (a superset of ISO 8859-1) because the pound sign is A3. What's your default non-Unicode charset in Control Panel? Maybe that's the one used by the compiler?
Mar 17, 2012 at 1:40pm
@webJose
Can you please elaborate how can I access the "non-Unicode charset in Control Panel"?
Mar 17, 2012 at 3:00pm
The setting is in Region and Language, Administrative tab, but it seem that the setting has changed now. You now select a locale and I guess that selects the charset to use. See if the actual value stored in memory is altered when you change this setting.
Mar 17, 2012 at 3:30pm
> If I write the statement below in C++ under visual studio, what will be encoding here.
> const char *c = "£";

The encoding of both the source character set and the execution character set are specified as ' implementation-defined'.
For the Microsoft C/C++ compiler, the source and execution character sets are both ASCII. - http://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx


To specify encoding, use the u8, u or U encoding prefix.
1
2
3
4
5
6
     const char* const narrow = "abcd" ;
     const wchar_t* const wide = L"abcd" ;

     const char* const utf8 = u8"abcd" ;
     const char16_t* const utf16 = u"abcd" ;
     const char32_t* const utf32 = U"abcd" ;




Mar 18, 2012 at 12:20pm
@JLBorges

Thanks for the information.

Just for clarification, I wants to know whether u8, u and U are C/C++ standard or Microsoft extension?

Mar 18, 2012 at 12:28pm
1
2
3
const char* const utf8 = u8"abcd" ;
const char16_t* const utf16 = u"abcd" ;
const char32_t* const utf32 = U"abcd";



In visual studio 2010 I am getting error if I use u8, u or U.

Can you please explain about this.
Mar 18, 2012 at 1:44pm
It's in the C++11 standard. Not sure if visual studio 2010 supports this feature or not.
Mar 18, 2012 at 2:11pm
Thanks Peter!!

Nice to know c++ comes with this!
Topic archived. No new replies allowed.