I'm new to c++ but not in the programming world so I put my question in the beginners forum as a guess.
My question is about a piece of code for a Windows registry wrapper I'm writing but mostly related to vc++ behavior about dereferencing string pointers.
In the piece of code below, I would like to know what are the underlying specifications for the total of bytes read in a dereferenciation.
For example if lpData is a LPWSTR pointer (WCHAR) and I use *(lpData + dwDataSize) then it reads 2 bytes but the same with *(unsigned*)(lpData + dwDataSize) reads 4 bytes.
1) What should I know about it
2) Is it a behavior I can trust to be the same all the time
BOOL HlpRegSetValueSzW(HKEY PredefinedKey, LPCWSTR lpSubKey, LPCWSTR lpValueName, LPCWSTR lpData, DWORD dwType)
{
HKEY hKey = NULL;
DWORD dwDataSize = 0; // very important initialization = 0
// Check if dwType is a supported string value types and while we're at it, let's compute the byte size
// including the string null terminator(s) of lpData that is needed for the cbData parameter of RegSetValueEx
if ((dwType == REG_SZ) || (dwType == REG_EXPAND_SZ)) // one null-char terminator strings
{
while (*(lpData + dwDataSize)) // read 2 bytes (if 0 then we've found the null char)
{
dwDataSize++; // at loop exit dwDataSize holds the string length
}
dwDataSize = (dwDataSize + 1) * sizeof(WCHAR); // add 1 null-char and multiply by 2 for bytes count
}
elseif (dwType == REG_MULTI_SZ) // two null-chars terminator strings
{
while (*(unsigned*)(lpData + dwDataSize)) // read 4 bytes (if 0 then we've found the 2 null chars)
{
dwDataSize++;
}
dwDataSize = (dwDataSize + 2) * sizeof(WCHAR); // add 2 null-chars and multiply by 2 for bytes count
}
else
{
return FALSE;
}
...
From a clarity standpoint... foo[index] is much more clear than *(foo + index). Every time I see the latter, I die a little inside.
That said...
1) What should I know about it
2) Is it a behavior I can trust to be the same all the time
You are assuming sizeof(WCHAR)*2 == sizeof(unsigned), which is probably true for whatever compiler you're using, but is not guaranteed. So this code will probably work fine for now, but might break if compiled under a different configuration.
As for whether or not you can "trust" it... I guess it depends. Realistically it'll likely always work, but that isn't guaranteed by the language... so it's conceivable that it might not work, even if that's extremely unlikely. Personally, I would say "no, don't trust it", but it's a judgement call.
I would probably change this to use [array indexes] rather than pointer math, and avoid casts completely:
You're probably right about readability but I found nothing else for what I want in the assembly. My function doesn't need optimizations of course but for later coding I want to have an idea of what's in the assembly.
Does that matter? If you prefer control over the assembly more than readable code, then why are you using C++? Why not use assembly? ;P
IMO readable code is the most important. You should not sacrifice readability for performance unless it's a significant difference. I highly doubt the extra cmp will make any impact on the resulting program.
But like I say... it's a preference. That's my preference. Maybe yours is different. And that's fine. There's a tradeoff here and it's your decision as to which route you feel best suits your needs.
So to recap the situation as I see it:
option A: while (*(unsigned*)(lpData + dwDataSize))
Advantages to option A:
- might run ever-so-slightly faster.
option B: while ( lpData[dwDataSize] || lpData[dwDataSize+1] )
Advantages to option B:
- Easier to read/understand. More clearly represents the logic of what you're actually trying to do (that is: look for 2 consecutive nulls)
- Does not make assumptions about the size of WCHAR or the size of unsigned and therefore is ever-so-slightly more portable.
Does that matter? If you prefer control over the assembly more than readable code, then why are you using C++? Why not use assembly? ;P
When use to it the 32bits macro assembler of microsoft was "for me" a great dev tool but the 64bits version of masm (and the whole 64bits architecture) has made assembly development at low level not as easy as it was. C++ is a powerful language and I'm making the switch. Maybe I should go easy with c++ but I can't erase many years with a finger snap.
I made a little change for readability and not assuming anything on unsigned : while (*(DWORD*)(lpData + dwDataSize))
Recasting is not that bad but one should known was he's doing with it. At the end, there's a limit to write idiots proof code :-)
When I post that question, I thought that maybe a reply would appear saying DON'T EVER DO THAT!!! BECAUSE...
I learn c++ slowly in my spear time and a lot more learning to come.
I made a little change for readability and not assuming anything on unsigned :
while (*(DWORD*)(lpData + dwDataSize))
You're still making the assumption about WCHAR (assuming it is 2 bytes).
Though again that is a reasonably "safe" assumption on Windows. Though it certainly is not true on other systems (for example, wchar_t is typically 4 bytes wide on *nix).
If lpData is not aligned on an alignof(DWORD) boundary, the result of the cast is unspecified; we would hopefully get a pointer to a mis-aligned object which would work, albeit with a performance penalty.
In the spirit of C (and assembly):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// compute the byte size including the string null terminator(s) of lpData
auto p = lpData ;
switch(dwType)
{
case REG_SZ:
case REG_EXPAND_SZ:
while( *p++ ) ;
break ;
case REG_MULTI_SZ:
while( *p++ || *p++ ) ;
break ;
default: returnfalse ;
}
const std::ptrdiff_t dwDataSize = ( p - lpData ) * sizeof(*lpData) ;
a. The value computation of the built-in postincrement and postdecrement operators is sequenced before its side-effect.
b. Every value computation and side effect of the first (left) argument of the built-in logical AND operator && and the built-in logical OR operator || is sequenced before every value computation and side effect of the second (right) argument.
When substracting last+1 and first adress we should get the bytes count but here we get the caracters count. When they say c++ is highly typed I understand :-)