Un-friggin-believable

I just fixed a bug where pointers were mysteriously changing values (e.g. 0x0000020E0B4E3160 turned into 0x000002000B4E3160) by replacing this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
std::list<std::unique_ptr<SomeClass>> a;
std::list<std::unique_ptr<SomeClass>> b;

//...

a.splice(a.end(), b);

//...

b.emplace_back(std::move(something));

//...

auto b = a.begin();
auto e = a.end();
for (auto i = b; i != e;){
    if (!(*i)->foo()){
        auto copy = i++;
        a.erase(copy);
    }else
        ++i;
}

with this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
std::vector<std::unique_ptr<SomeClass>> a;
std::vector<std::unique_ptr<SomeClass>> b;

//...

for (auto &c : b)
    a.emplace_back(std::move(c));
b.clear();

//...

b.emplace_back(std::move(something));

//...

for (size_t i = 0; i < a.size();){
    if (!a[i]->foo())
        a.erase(a.begin() + i);
    else
        i++;
}

I know it doesn't make any sense, but the former was crashing after a few minutes and the latter doesn't crash. What do you want me to say?

I'm pretty pissed off about it.
does it mess up on multiple compilers?
or is the top code wrong (I don't see the issue?)
Last edited on
In theory there's no issue; the code should be correct. It would appear to be a bug in the std::list implementation, hence my incredulity. I find it very hard to believe that std::list would just arbitrarily zero some bits of the data, but that appears to be what's happening.

I've only tried it on MSVC 2015.
Try it on Clang 9 and MSVC 2019. (And whatever the latest GCC is, too, I guess.)

If it only fails on MSVC it might be worth a bug report.
Are you sure you don't have UB somewhere else?
It's not impossible, but the thing is, it crashes fairly predictably. Not instantly, mind you, but every once in a while the system runs some process or series of processes and it crashes (process creation triggers allocations of SomeClass). It always crashes for the same reason: part of the pointer being zeroed and then an invalid pointer being dereferenced. If it was simply a buffer overflow in my code I'd expect the program to crash in a different place every time, not always in the same instruction. That means that the cause of the crash must have direct or indirect access to the region being modified. There's only like three functions that touch that std::list, not counting the implementation, and those are correct.
I myself find it pretty bizarre, but I've been looking into this problem for days and I see no other explanation.

It's a shame vmware no longer has that reverse debugging feature. That would really help in this situation.
Have you run a RAM test on your machine?

Can you run the same code on another machine and have it crash in the same way?

Tweaking a nibble within a long word seems an awfully specific sequence of events. Something has to do rather more than assignment to make that kind of detailed corruption.

Are you able to predict (with some percentage of success) which pointer will be clobbered before it actually happens?
If so (and assuming it's the result of some machine instruction), then perhaps a data breakpoint.
https://devblogs.microsoft.com/cppblog/data-breakpoints-15-8-update/

Can you run the same code on another machine and have it crash in the same way?
Yes. Plus, again, if it was simply a memory error the program should not crash in the same place every time. Virtual pages should be getting assigned to physical regions non-deterministically. Not to mention that if memory errors were happening this frequently my computer would just crash constantly.

Tweaking a nibble within a long word seems an awfully specific sequence of events. Something has to do rather more than assignment to make that kind of detailed corruption.
Not necessarily. I haven't looked into it enough to find exactly how the pointers change values, but in the example in OP the change can be accomplished by setting a whole byte.

Are you able to predict (with some percentage of success) which pointer will be clobbered before it actually happens?
Sadly, no. And yes, I know about data breakpoints. They're a god-sent when your program behaves deterministically, but otherwise not much help. Reverse debugging would be so sweet, though.

Another very useful tool that I used when debugging my emulator that I'm surprised hasn't been implemented in any debugger/CPU (even though it would likely have a significant performance cost) is the ability to track what instruction pointer last modified the value of an address or register. Of course, that time I was also recording the clock-by-clock states of the IO registers, thus I was removing the non-determinism and doing poor man's reverse debugging.
* Run.
* Detect crash.
* Write down guilty address.
* Set breakpoint.
* Run again.
* Hit breakpoint.
* Write down guilty address.
* Set breakpoint.
* Run again.
Et cetera.
Reminds me of an assembly assignment I coded not far back. We used openGL to pixel plot a Mandelbrot, and I was debugging it for days. It wouldn't draw more than a few pixels on the screen. I was using a couple of r (general purpose) registers as counters for the nested for loop. Debugged high and low, after a couple days of feeling defeated, I started taking chunks of code and writing them differently with the same logic. I figured I may as well change the registers I used for the counters, and bam... suddenly worked like magic. I was about ready to murder.

I had checked those very registers in the debugger, and they always held the correct values, still can't explain why switching them out fixed things.

Definitely feel your pain.
For giggles, try replacing this:
1
2
        auto copy = i++;
        a.erase(copy);

with this
i = a.erase(i);

The most likely problem is heap corruption somewhere else in your code. I understand that the symptom always shows up here, but heap management works in mysterious ways.

Can you use a heap checker?

Is this single-threaded?
Can you use a heap checker?
Are there any new memory debuggers for Windows that are worth a damn? The last time I checked the only contender was Intel's.
Anyway, it's a service, so who knows if that would work.

Is this single-threaded?
No.
Topic archived. No new replies allowed.