Char * encoding

Forum

Forum
General C++ Programming
Char * encoding

Hi,

If I write the statement below in C++ under visual studio, what will be encoding here.

const char *c = "£";

Under the visual studio project setting I have set the Charset to "Not set".

And what actually inside the char* buffer?

Thanks

Last edited on

webJose (2948)

Interesting question. Not sure. UTF-8?? I think Visual Studio knows when a Unicode character is present in a source file and then asks the user to switch the source file encoding.

But I guess that'll be just for saving the source file. For the actual program execution, I really don't know. Will the compiler respect the UTF8 nature of the source file and create the extra bytes for the character in order to encode the same in RAM? I recommend that you do a small test and find out.

closed account (o1vk4iN6)

I believe all that option does is enable windows function and typedef to use either char or wchar_t. _T("some text"). I forget what the keyword windows uses, it's either TCHAR or something. As well as for functions. for example MessageBox(), depending on the setting it's either MessageBoxA(), MessageBoxW() or undefined.

Last edited on

webJose (2948)

That's a different story, xerzi. The macro name is UNICODE and it is defined through the project properties, never in code in the case of Visual Studio.

The UNICODE setting will never change the meaning of a char. It will change TCHAR's, but never char's. Since the question uses char and not TCHAR, I assume the OP really wants to know if the read-only string will be encoded in a particular encoding. I suggested that he/she tries it out. It is simple: Create a new project that declares that string, then examine each byte pointed to by 'c'. Is it one byte? Is it two or three bytes? Ideally, he/she should find out beforehand the actual representation of the sterling pound symbol in UTF8 and UTF16 at least to have some comparison criteria.

closed account (o1vk4iN6)

Which is why that option does nothing, if you want to use wide characters than you need to define it:

1
2
3

// may need to include a header

const wchar_t* c = L"£";

Depending on the compiler, wchar_t can be of different size, windows uses 16-bits. You might be better off finding a library which supports different encodings, as it differs from compiler to compiler.

webJose (2948)

I understand, but I don't think that's the point. I think the point here is: What will the final representation in RAM be? Or at least that is what I think the point is. I could be wrong and you could end up being right.

I guess I'll just wait to hear from the OP. :-)

dearvivekkumar (104)

I am in visual studio 2005 windows xp environment.

In fact in my situation, I have api giving char* pointer also this api have not given any information about the encoding of this char*.

While debugging the debugger view shows the character "£" correct but the char[0] showing -65 and rest as 0 as I am giving a char[51] buffer to it.

Second I am passing this value to another api which is taking char* also this api have not mentioned anything about it's encoding.

For testing purpose I have directly passed "£" string to the second API.

I am not getting how to deal with situation like this even I am not been able to understand and conclude something for this behavior except this---that....

Also possibility would be many But I have this current situation and I cannot go for wchar_t *.

What next try I should go for?

Thanks for all the replies above!

Disch (13742)

Simply making it a wide char won't help, necessarily.

It comes down to 2 things:

1) How is the IDE saving the file?
2) How is the compiler interpretting the file?

Ideally, in this case, the answer to both questions would be "UTF-8" but that's not really an assumption you can make.

It's likely that both IDE and compiler options are configurable. But I'm too lazy to check how to do that right now.

The easiest way to test this is to do something like this:

const char* c = "£";  // £ sign is U+A3, which in UTF-8 is stored as 0xC2 0xA3

// so an easy way to check to see if it's really UTF-8:
if( c[0] == 0xC2 && c[1] == 0xA3 )
{
  // yes, it's UTF-8
}
else
{
  // no, it's some other encoding
}

This would probably be better done with an assert or something.

If you want to ensure that it's UTF-8 all around, the safe bet is this:

const char* c = "\xC2\xA3";

but that's hardly intuitive...

dearvivekkumar (104)

First of all thanks a lot for you post Disch.

        const char *c = "£";

	if( c[0] == 0xC2 && c[1] == 0xA3 )
	{
		printf("%c\n", c[0]);
		printf("%c\n", c[1]);
	}
	else
	{
		printf("%c\n", c[0]);
		printf("%c\n", c[1]);
	}

if test fails and control goes into else part.
c[0] is some wired character
c[1] is 0

Last edited on

webJose (2948)

Well, it doesn't appear to be Windows 1252 (a superset of ISO 8859-1) because the pound sign is A3. What's your default non-Unicode charset in Control Panel? Maybe that's the one used by the compiler?

dearvivekkumar (104)

@webJose
Can you please elaborate how can I access the "non-Unicode charset in Control Panel"?

webJose (2948)

The setting is in Region and Language, Administrative tab, but it seem that the setting has changed now. You now select a locale and I guess that selects the charset to use. See if the actual value stored in memory is altered when you change this setting.

JLBorges (13770)

> If I write the statement below in C++ under visual studio, what will be encoding here.
> const char *c = "£";

The encoding of both the source character set and the execution character set are specified as ' implementation-deﬁned'.

For the Microsoft C/C++ compiler, the source and execution character sets are both ASCII. - http://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx

To specify encoding, use the u8, u or U encoding prefix.

     const char* const narrow = "abcd" ;
     const wchar_t* const wide = L"abcd" ;

     const char* const utf8 = u8"abcd" ;
     const char16_t* const utf16 = u"abcd" ;
     const char32_t* const utf32 = U"abcd" ;

dearvivekkumar (104)

@JLBorges

Thanks for the information.

Just for clarification, I wants to know whether u8, u and U are C/C++ standard or Microsoft extension?

dearvivekkumar (104)

1
2
3

const char* const utf8 = u8"abcd" ;
const char16_t* const utf16 = u"abcd" ;
const char32_t* const utf32 = U"abcd";

In visual studio 2010 I am getting error if I use u8, u or U.

Can you please explain about this.

Peter87 (11256)

It's in the C++11 standard. Not sure if visual studio 2010 supports this feature or not.

dearvivekkumar (104)

Thanks Peter!!

Nice to know c++ comes with this!

Topic archived. No new replies allowed.