Some people at this forum say that programmers must never use "new" and "delete".
I think they say something like "never drive car by yourself".
Sure, newbies should not use it in first day of programming. But it's official part of c++. So forbid to use "new" and "delete" sounds as forbid to drive cars. If I use malloc() and free(), these people would catch heart infarct and say that I has ho rights to repair my car by myself.
What about allocating memory directly from system by HeapAlloc() in windows or brk() in linux?
My computer is my own. I have freedom to use it as i want. Avoid use "new" and "delete" is recomendations for newbies, not The State Low.
Example. I have to compare eight bytes strings.
Recommended way is to use string class and == operator.
I walked through stl source and find that it calls something like __builtin_memcmp(). I debuged through binary and found many hundreds of CPU instructions. Why should my program execute these hundreds. Eight bytes easy compared by CPU by couple instructions.
Questions:
1. Is it forum for elementary school programmers of for researchers and forensic?
2. If you make everyone to use STL, why do u think that my program in this example must execute hundreds of instructions instead of couple?
3. If you think that programmers in this forum should never use "new" and "delete", do you think that people should be arrested for using these? And do you think that people should be arrsted for driving and repairing cars by themselves?
If I may suggest that you calm down a little? I wish to antagonise you any further or anything, but I do think that is a huge reaction to what I said. Essentially I missed 1 word IMO.
see my post in the other topic, where I suspect this all came from:
If you set up your GPS to a location but you want to use a different road, do you still follow your GPS' path?
We give you the advice, you do what you want to (and, personally, I couldn't care less if you used operator== or your own function which will probably (not) speed up your program by 32x (64x on a 64bit machine; kudos to whoever gets the reference)).
Why should my program execute these hundreds. Eight bytes easy compared by CPU by couple instructions.
Why do you care? Is it a bottleneck? Are you sure? Isn't it is the case of premature optimisation, which is the root of all evil? Won't naive approach kill perfomance even more? 14 unaligned uncached accesses in a row might kill perfomance way more than "many hundreds of CPU instructions" ever will.
Usual way is to minimise memory access time (as it is a common bottleneck) by reading aligned blocks of memory same as register size. Or even coerce CPU to cache it.
Do you think that library — C library which was polished for decades — writers are that dumb that they would create a subpar implementation that even naive approach will outperform?
Yes. In my database analyzer i speed up my program by 10% by using own string comparsion. Even if i didn't optimize it yet.
Isn't it is the case of premature optimisation, which is the root of all evil?
Why should i use slow implementations?
Something slower here. Somethink slower there. As result, modern computers run their programs much slower then old computers run their old porgrams. Cause earlier programmers use some hints and hacks to speedup their programs. And they tell to others how did they do it.
Did you ever see string implementation where data is unaligned proerly?
Do you think that library — C library which was polished for decades — writers are that dumb that they would create a subpar implementation that even naive approach will outperform?
Yes. In my database analyzer i speed up my program by 10% by using own string comparsion
If you did profiling, tested and your implementation does speed up whole program, then yes, it is a good candidate for optimisation.
Why should i use slow implementations?
Because portability and readability > speed. If you ever maintained large project you should already experience that.
Your code is not portable, as it contains undefined behavior and attempt to dereference end might easily result in crash on RISC CPU (or ARM processors, depending on compiler).
Did you ever see string implementation where data is unaligned proerly?
32bit libstdc++ bundled with GCC 4.3, I believe, on 64bit platform. It takes result of allocator (which is aligned by maxalign) and sets first character to be in position 5*4 bytes ahead, which is out of aligment for 8-byte access.
If you did profiling, tested and your implementation does speed up whole program
Yes. I run it with rdtsc. I use profiler. I check it's real work with timer (on mobile phone :-).
Because portability and readability > speed.
Readability achieved by well written documentation.
Portability. Fast low level blocks for every target better than universal slow one.
Your code is not portable, as it contains undefined behavior..
Please, point me. Tell me the line number and explain.
and attempt to dereference end
Where?
32bit libstdc++ bundled with GCC 4.3
Yes, 32-bit version of my function will compare by four bytes only. Using something like SSE or AVX will be better. So should i have different implementations and check speed of every of it on different targets? It's still better than usual universal slow implementation.
There was a lot of variance though. In any case, what MiiNiPaa said, if you found a bottleneck, profiled, and improved, good job. It's what I do for living. But I still don't recommend new/delete in C++ code, thanks to make_unique.
Please, point me. Tell me the line number and explain.
reinterpret_cast<const size_t*&> Result of reinterpret_cast to anything (aside from char* and void*) which is not its original type is unspecified (Not undefined actually). So results are unknown.
I encountered implementation which automatically align resulting pointer, which might make comparison to not work properly (however I doubt that you will ever need to analyse databases on industrial machines).
Fast low level blocks for every target better than universal slow one.
I doubt that anyone can cover all potential targets. There are hundreds of them.
I run it with rdtsc.
I am not to discredit your results here, but rdtsc() is not the silver bullet: 800 cycles of actual work are executed in less time than 100 cycles mostly spend on stalling waiting for some IO operation. Also it is mostly useless in multithreaded applications, as another thread can interrupt your execution and you will get exagerrated results, or you can even get start and end results from different cores.
I say it again: If it is a proven problem, fix it. Optimise, refactor, etc.
Just do not do that before it is a problem. Do not start juggling pointers to GUI elements instead of packing it into neatly managing class in the very beginning because it will be more efficient. As I saw projects which descent in support hell and it was found that rewriting module from scratch would cost less that to let new member to understand how is it works.
What are you doing differently that is causing such a speedup? I'm guessing the numbers you have shown is number of instructions executed in order to do the comparison?
What are you doing differently that is causing such a speedup?
The basic idea is that instead of the traditional comparison of one character at a time, he's accessing the character memory in size_t-sized blocks and comparing those (effectively dividing the number of comparisons done by 4, assuming sizeof(size_t)==4)
I'm guessing the numbers you have shown is number of instructions executed in order to do the comparison?
It's the number of cycles. rdtsc returns the number of cycles elapsed since system reset.
Cubbi (3720):
Thank you for there numbers. There are many implementations of STL. Some are quick enough, some are slow. I should better bugreport to authors of my implementation.
make_unique is good. Btw, how can i use it with my own allocators?
I doubt that anyone can cover all potential targets. There are hundreds of them.
So lets have one universal and many specific. If we have specific and quicker implementation, let's use it.
rdtsc() is not the silver bullet
Average values of many runs give common view. Ok. slow implementation is not problem, which i should discuss here. I should bugreport to authors.
Good idea, did you try to write your own specific char_trait implementation and provide it as string template parameter instead of relying on external function? This will let all standard library use your own implementation instead of standard one everywhere.
The basic idea is that instead of the traditional comparison of one character at a time, he's accessing the character memory in size_t-sized blocks and comparing those (effectively dividing the number of comparisons done by 4, assuming sizeof(size_t)==4)
Got it! I remember doing this before but not in C. Also it reminds me of something I saw earlier http://qr.ae/Em78U
make_unique is good. Btw, how can i use it with my own allocators?
As the original proposal http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3588.txt says, "Expert users with custom allocators can easily write their own wrappers with corresponding custom deleters.". It kind of annoys me that allocators are regarded to be expert material, because it leads to libraries that go after the system heap when I don't want them to (my old job viciously gutted and re-wrote all such libraries, the system heap was off-limits).
There is basic_string and allocator is it's third parameter. It's ok.
When i use make_unique, this function allocates memory. I want that memory allocated using my own allocator. Where should i put my allocator when using make_unique?
If you want for all objects of your own type to be allocated with said allocator, you can overload member operator new.
Otherwise you will have to write your own wrapper like make_allocated_unique. make_unique sees no love from standard comitee, as it deemed trivial to write :(.