As expressed in the title, does delay happen when data race of non atomic variable (normal variable) exists?
My code perhaps entails such a delay, is this due to data race?
As example,
1 2 3 4 5 6 7 8 9 10 11 12 13
bool is_check = false;
// std::atomic<bool> is_atomic_checked = false;
tbb::parallel_for(tbb::blocked_range<std::size_t>(0, N, [&](const tbb::blocked_range<std::size_t>& r) {
for (std::size_t i = r.begin(); i != r.end(); ++i) {
bool will_be_checked = false;
// some treatment, if check is passed, will_be_checked is set to be true
if(will_be_checked)
is_check = true;
}
});
A data race implies that there is no action taken to STOP the data race. In that case, I don't see why there would be a delay. The threads aren't waiting for each other, they're accessing as soon as they can.
A delay is incurred if you actually STOP the data race by forcing threads to wait for each other before executing certain parts of code.
I appologize for my lack in detailed situation or code.
This is a general question, and I would appreciate it if you could list up some candidates for the delay.
In simple parallelization (it is OK limited to data parallelization),
is there possibility that delay, exactly speaking slower than serial processing, occurs in case atomic variables or mutex statement (lock) are no use.
Delays in this sense are caused when a thread needs to wait, either because a required resource is not yet available, or because the operating system wants to use the core it's running on for a different thread. For parallel code to run slower than serial code, the parallel code would have to be very poorly designed, or it would have to make very poor use of the hardware.
For example, imagine that the serial code takes 1 unit of time, and the parallel code tales 1/n+0.1 unit of time, where n is the thread count. The 0.1 is an additional cost incurred from scheduling the thread. Suppose you have 4 processing cores. If you split the problem into 4 threads, the total time will be
((1/4 + 0.1)*4)/4 = 0.35.
A 65% improvement. Now, if you split it into 32 threads, the run time will be
((1/32 + 0.1)*32)/4 = 1.05
You didn't have any more cores to distribute the extra threads into, and since each thread required more work, the total time ended up being longer than without any parallelism.
Another example could be if you have multiple threads reading from multiple locations of a hard drive simultaneously. Since ah HDD physically moves a mechanical arm, this kind of access pattern is very inefficient. When using mechanical drives in parallel code, it's quite likely that it will be more efficient to use a mutex around the disk accesses than to allow the threads to use the disk without restriction.
The short version: it isn't the atomic variable itself.
An atomic variable as implemented on most modern CPU's invokes a version of an instruction which locks on a memory location (wherever that variable is stored), adding perhaps 20 clock ticks to the operation (give or take a wide range depending on the model). Some older CPU's might add as many as 50 clock ticks.
If the instruction is merely an assignment to true of false, that would take one clock tick plus whatever memory delay is involved (too varied to account here) as a non-atomic operation. The optimizer is free, however, to perform that within a register, temporarily ignoring that is located somewhere in RAM.
Adding some 20 to 40 clock ticks to such an operation (when it is made atomic) is not actually much time, especially if that isn't the interior of a loop and there is more being calculated in the mix, as it appears in your code example. However, the optimizer is prohibited from performing this within a register, as it must lock the memory location where it resides and assume that memory is updated when the lock is released so all other threads see that as the one and only value.
Technically, one could say that an atomic bool assignment is some 20 to 50 times slower than a non-atomic assignment, but that is to say the range is 1 to 50 clock ticks on a machine performing several billion ticks per second. In order for that to be the performance issue, you'd have to set this bool so frequently that the difference of 40 or so clock ticks is measurable (as in hundreds of millions of assignments on this bool).
We have no vision into that code you comment, so we can only guess.
Everything in an answer depends on what that code does. Asking for a generalization is, therefore, misleading.
That said, one thing occurs to me, but this is theory, I haven't checked it.
There is an implication where an atomic bool is declared...it is also considered "volatile" in the sense that it can't be read from optimized cache...it must be read from RAM directly at every access, and written to similarly.
Your global bool isn't even declared volatile, which is to say the parallel code may not be considering it as such, and there could be some issue at the join of multiple threads implied by the end of the loop where the is_checked bool is consumed.
To expand, you probably realize the "will_be_checked" bool is private to each presumed thread spun off to run the loop in parallel. This makes it quick because the optimizer is free to use a register inside the loop.
When "is_check" is assigned, however, it must exist as a global. However, multiple threads must access it, and if it is atomic they'll be coordinated in time where collisions occur.
How often this is likely to happen is opaque to us because we can't see what is happening, but if I assumed there is nothing happening but some random change to "will_be_checked", this bool could be assigned many, many millions of times.
Without declaring atomic, and without declaring it volatile, the optimizer may be free to consider the "is_check" a register within the loop (making it fast), for summary output to the "real" memory location later.
That is what volatile should prevent. It is a "third" style of declaration not in your example. It is no magic bullet, and does not imply atomic synchronization. It merely accompanies most variables protected by mutexes or critical sections so the compiler can't optimize their operations and thus ensure the memory is "publicly modified" once the locks are released.
atomic is even further prevention of un-synchronized access, but I note that it isn't strictly required to be coordinated as an atomic for this particular case.
That is to say, "is_check" is set to false at the initial start of the code, and if anything sets it to true in the many parallel loops, then as long as that is "publicly visible" to all threads after the join, atomic coordination doesn't matter here.
There is nothing that will return it to false within the loop.
That's why it doesn't really require atomic synchronization.
It merely requires volatile declaration, so the optimizer "knows" it can only be used after the join from all of the threads at conclusion as read from RAM (not an optimized register privately held inside a loop in a thread, for example).
If this were an incremented value, then of course, non of that applies. The increment would have to be synchronized as atomic (or protected under a mutex) to work.
This isn't an accumulated value, though. It is merely a switch to be triggered, and any one trigger is all that is being "sensed" here.
Put another way, there can't be a negative side effect if two or more threads set "is_check" to true outside of synchronization (that is, they overlap in time), because there is no action to set it to "false" anywhere that could possibly conflict. All that matters is that the value is "publicly visible" at all times, which is to say written to and read from RAM, which is in part what "volatile" is about.
If you witness the same performance change by declaring is_check as a volatile bool as you do when set as an atomic bool, then perhaps you are writing to is_check so often that it's a performance drain.
You may even be able to measure the difference between an optimized bool inside a thread, a volatile value accessed by the thread and the overhead of atomic assignment.