Optimization: why is my Eigen code slow?

I just recently finished rewriting a code I had that used C++ arrays to include Eigen matrices instead. Unfortunately the Eigen code is slower. I read also that Eigen is sensitive to optimization so I incorporated more flags in my command that helped with the speed but not significantly. My old code would return the following time for about 64 iterations
1
2
3
4
5
6
Iteration = 64    t = 0.6381360078   phi_iter = 1
^C
real    0m8.911s
user    0m8.819s
sys     0m0.020s

Where as my Eigen code will return the following for the same iteration:
1
2
3
4
5
6
Iteration = 64    t = 0.6381360078   phi_iter = 1
^C

real    1m10.344s
user    1m5.873s
sys     0m4.440s

The flags I am using to run the Eigen code:
 
 -O3 -DNDEBUG -march=native -Ofast

I am eventually going to use openmp, but first I would like for the two codes to be at similar speeds. I will post my full code here for anyone to add suggestions/recommendations to speed up the code. Thanks.

Full code:
https://www.zipshare.com/download/eyJhcmNoaXZlSWQiOiJmNDIzMDgxNS01NDc4LTRjN2ItYWVmZi04MTFhNTFkN2U5MjciLCJlbWFpbCI6ImFzdHJvbHVqeUBnbWFpbC5jb20ifQ==
C arrays are going to be faster, at least a little. They go on the stack, so no memory get & free, no error / bounds checking, whatever else goes on in there.

but you need to profile that code and see where the time is being spent. I suspect a great deal of it is being spent creating and destroying objects in your N*N for loops, but that is just a guess. Pow is also slow; if your powers are integers, write your own as the extra steps used to do a floating point power take time, esp for simple stuff like N*N. Trig is also slow, you gotta have it (there really isnt anything else) but reduce and reuse where you know the angles are the same or whatever. Its not likely that pow and trig etc are problems given that you are doing matrices, but in a loop like that, wasted time that is easy to fix should still be poked at.

I hate to say it but move every single matrix out of the loops, and inside the loops just rezero them if necessary.
I have some time to take a look, but to optimize the program effectively I need to be able to execute the code with a representative test case.

The code in the zip file doesn't compile. It looks like some matrix dimensions are wrong, and more importantly some variable named all is not declared.

Can you provide code that compiles and some "realistic" test input data?
Last edited on
@mbozzi
Sure! I thought I provided my exact code, but I could be wrong. I will post the full code that runs with a real output for you.
and more importantly some variable named all is not declared.

oh all is like a slicing and indexing operator from Eigen, check out documentation:
https://eigen.tuxfamily.org/dox/group__TutorialSlicingIndexing.html
I included all the needed library for that, but you could try adding:
 
Eigen::all

These are the exact files I compile, and I just did they run okay for me, so makes me kinda concerned about the matrix dimension issue. I use the following command line:
 
g++ -I /mnt/c/Users/J/eigen-3.4.0/eigen-3.4.0/ 2DPACMAN.cpp spectralFunctions3D.cpp arithmeticFunctions3D.cpp  -lfftw3 -lm -O3 -DNDEBUG -march=native -Ofast -o test.out

Code:
https://www.zipshare.com/download/eyJhcmNoaXZlSWQiOiJmOWE5OThiMS1kNDAxLTQwNjAtYTVkYS0wODc4NzI5YTBiYzgiLCJlbWFpbCI6ImFzdHJvbHVqeUBnbWFpbC5jb20ifQ==

NOTE: change the size nx,ny to 128 (in spectralFunctions3D.h) in order to see how SLOW the code is
Last edited on
The code compiles fine, unchanged, but it seems to use up a lot of memory. I ran out of RAM around iteration 18.
Last edited on
Oh, I was trying to build against Eigen 3.3.7-2.

I had the same issue the last time I built your program but I didn't remember the solution. I don't know the lowest supported version but it does compile against v3.5.0.
One idea is to run both programs through a profiler and compare to see what parts that seems to be slower.

You might also want to look at intermediate results to ensure both programs are doing the same amount of work. It seems like you're using a do-while loop that loops until the error is small enough so if there is some problem there that causes it to run many more loop iteration that could very well explain why it's slower.

I also happen to stumble upon line 175 in line in spectralFunctions3D.cpp:
 
1.5 * ph; //correct ph = 1.5 * ph; 
I could be wrong but to me it looks like it doesn't do anything.
Last edited on
@JamieAl. As you haven't posted code I can't comment. I don't normally open unknown site links. However I did try to open your posted link on a test computer - and McAfee complained about the site being potentially unsafe!

Although without seeing the code, in general I agree with Jonin's statement re creating/destroying non-stack objects (or those with a non-trivial constructor) within a loop - for best performance, don't if found to be an issue. For pow/trig functions - if you know the values to be computed (or can be found), then pre-compute them and use a look-up table (use constexpr so that the calculations are done at compile time and not run-time). Any variable whose value doesn't change should be marked as const (or better constexpr if feasible).
Last edited on
seeplus wrote:
McAfee complained about the site being potentially unsafe!

Probably because anyone can upload anything there.
Last edited on
The code compiles fine, unchanged, but it seems to use up a lot of memory. I ran out of RAM around iteration 18.

oh no! Mmm do you mean like a memory leak or just uses lot of memory? Well, I might run Valgrind with this as well. I will check a profiler as well.


1.5 * ph; //correct ph = 1.5 * ph;
I could be wrong but to me it looks like it doesn't do anything.

I am comparing my code to an original one I wrote in MATLAB with correct results/output. What is interesting, this line of code was returning correct results matching MATLAB even though it kinda made no sense to me from a syntax point of view, so I will leave that for now. I guess as @Jonin mentioned I will get rid of any pow and other time consuming operators.

@
use constexpr so that the calculations are done at compile time and not run-time

Yes, I will do that as well. Thanks for the suggestions and sorry I couldn't share the code in a safer website/link.
Also, one last thing. I vaguely remember reading something here by @mbozzi long time ago about Eigen being more efficient or is faster than C++ array. But I could be wrong or misunderstood what he said.
This is like the complaints about vector vs C array. The vector uses dynamic memory, and it can reallocate it, and it tracks its own size, and such.. its slower than the C array. Microscopically: you almost need to count cpu instructions to see it, it won't even register on any timer. No one cares: vector is a powerful tool with a TON of friendly operations and features while a C array is little more than a pointer where you have to do it all yourself.

Eigen does a lot more than a C array or a vector, and its been gone over for good performance. You would be hard pressed to reproduce even simple parts of it with faster results, even if you used C arrays (which you can't, since general purpose sizes can't be fixed). And it would take months, even years, to reproduce just the pieces you need, a fraction of the whole. Speed isnt everything; you can't humanly recreate every library you need by hand to get a few ms of speed here and there at the cost of decades of labor.

Worry about the variable allocations before pow etc. Low hanging fruit.
Last edited on
oh no! Mmm do you mean like a memory leak or just uses lot of memory?

I don't know. But it seems to use more and more memory over time. I only have 8 GB of RAM.
Last edited on
JamieAl wrote:
I vaguely remember reading something here by @mbozzi long time ago about Eigen being more efficient or is faster than C++ array.

A C++ array or an Eigen matrix doesn't have a speed. It all depends on how you use it.

A C++ array is a very low level thing with essentially no overhead at all. It's hard to beat. If you manually loop over an Eigen matrix and do something with each element you cannot expect it to be faster compared to doing the same with a "2D array" (array of arrays).

But if you're using Eigen to do matrix operations (matrix multiplication, dot product, etc.) then those operations have probably been optimized and will likely be faster than if you tried to implement them yourself.

Make sure you don't compare apples and oranges. You cannot really compare a Eigen::MatrixXd with a double[10][10] because one is fixed size and one is dynamic size. Instead you might want to compare Eigen::Matrix<double, 10, 10> with double[10][10], and Eigen::MatrixXd with ... a dynamically allocated array of doubles (plus variables for storing width and height)?
Last edited on
Thanks everyone! I am slowly, but surely working my way through everyone's suggestions here.

About the memory thing, I ran Valgrind and it seems there might be a memory leak issue. This sort of error usually makes sense to me when working with C++ arrays but now with Eigen matrices. I am not so sure.

==7593== Conditional jump or move depends on uninitialised value(s)
==7593==     by 0x12FC0F: potentialk(Eigen::Ref<Eigen::Matrix<std::complex<double>, -1, -1, 0, -1, -1> const, 0, Eigen::OuterStride<-1> > const&, Eigen::Ref<Eigen::Matrix<std::complex<double>, -1, -1, 0, -1, -1> const, 0, Eigen::OuterStride<-1> > const&, Eigen::Ref<Eigen::Matrix<std::complex<double>, -1, -1, 0, -1, -1> const, 0, Eigen::OuterStride<-1> > const&, Eigen::Ref<Eigen::Matrix<std::complex<double>, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1> >, Eigen::Ref<Eigen::Matrix<std::complex<double>, -1, -1, 0, -1, -1> const, 0, Eigen::OuterStride<-1> > const&, Eigen::Ref<Eigen::Matrix<double, -1, 1, 0, -1, 1> const, 0, Eigen::InnerStride<1> > const&, Eigen::Ref<Eigen::Matrix<double, -1, 1, 0, -1, 1> const, 0, Eigen::InnerStride<1> > const&, Eigen::Ref<Eigen::Matrix<double, -1, -1, 0, -1, -1> const, 0, Eigen::OuterStride<-1> > const&, double, int) 

Usually this means something isn't initialized well, but I am not really sure what the issue is. There are similar errors that I have to sort of go through carefully. Not sure if the type or the initialization is the issue here.
Ok, I finally had a chance to take a look. I was able to get a 25x speedup over baseline with only small changes. Because of how the program is structured I think there is plenty more room for improvement, but I think there are correctness issues that need to be addressed first. I used ny = nx = 128 the whole time.

First I looked at measured problem areas. First I replaced r2cfft1d and c2rfft1d to avoid resource leaks, nonessential memory allocations, and redundant work. Another resource leak in Ecalc_dt was removed by moving the free call above the return statement. After that, measurement showed that the program spent 90% of execution time within potentialk's inner loop, particularly within the call to Eigen's colPivHouseholderQr. So that was changed to the marginally-faster householderQr.

The iterations of the inner loop are independent, so I parallelized it. For simplicity, a new batch of worker threads are created each time potentialk is called, which isn't ideal but it should be alright for now. There may still be a big potential for improvement here because the processor spends 90% of its time inside this loop waiting on I/O. For example it might be beneficial to switch to float if you can afford the reduced precision. Additionally, the elements of L are linear functions of the loop variable l. Maybe this can be exploited to avoid paying full-cost to factor the matrix every time. Finally, Eigen has faster solvers than householderQr, but most of them have constraints on the types of matrices that can be solved, and I don't know enough linear algebra to know whether they're suitable. See:
https://eigen.tuxfamily.org/dox/group__TutorialLinearAlgebra.html

At this point the computer spends roughly 1/2 of wall-clock time inside potentialk. This corresponds to about 12ms per call or 50ms per iteration of the main loop, and the processors are severely starved for data during this period. The rest of the main loop consumes the other 50ms per iteration, of which a good chunk of that time is spent in memset, probably because setZero is frequently called for no reason.

Then I looked at the rest of the program. As a whole, the program does a lot of redundant work. For example this snippet from EcalcPotSourcek_inertia:
1
2
3
Eigen::MatrixXd D2((ny + 1), nx);
D2.setZero();
D2 = D * D;

Could be written
Eigen::MatrixXd D2 = D * D;
Note this does incur a nonessential memory allocation, which can be eliminated if the matrix is allocated just once at the start of the program:
D2 = D * D

In addition EcalcPotSourcek_inertia doesn't do anything as far as I can tell. In particular it never appears to modify any of its arguments. Besides it does a whole bunch of (unrelated?) work and throws away the results. Same deal with Ecalc_residualt, which overall seems very broken. I'd strongly recommend that you spend some time to write automated tests for each of your functions and do a thorough review of the entire program, focusing on correctness.

I think you're well on track to outperform your original program. A performance profile shows that 18% of program time is spent inside memset, which is mostly unnecessary, another 11% inside memcpy, which is also mostly unnecessary. 32% inside some implementation of the BLAS routine called gemm. 5% inside pow taking integer powers, which is trivial to remove. You'll also be interested to know that when the call to EcalcPotSourcek_inertia was removed from the main loop, the subsequent (slow) call to potentialk sped up by 2x, likely because of memory effects.

As far as how to proceed, I would suggest doing a thorough code review & writing some tests. In case you've never done testing before, the most basic way to do it is:
1. Write a new C++ file with an alternate main function.
2. Call each function-under-test individually from this alternate main function, and compare each function's results against expected answers. Print out an error message if something's screwed up.
3. Compile against the "test" main function when you want to do tests; compile against the "real" main function when you want to run the program.
You would be able to run these tests for some assurance that your code is correct.

As far as performance improvements are concerned, try to avoid unnecessary work. Let Eigen do your matrix math and get rid of the hand-written for loops. Stop reallocating matrices. Stop calling setZero by rote. Avoid computing stuff that's not used later.

Then you can look at changing the program to better use the machine's full capabilities, parallelization and so on.
If you have clang++ available, try that. For me it offered a consistent 1.15x speedup over g++.

The exact compiler command I used is
clang++ -std=c++17 -D_CRT_SECURE_NO_DEPRECATE -Wall -Wextra -pedantic -Wno-comment -Wno-unused-parameter -Wno-unused-variable arithmeticFunctions3D.cpp spectralFunctions3D.cpp 2DPACMAN.cpp -I D:\opt\vcpkg\installed\x64-windows\include -lfftw3 -pthread -Ofast -march=native -ffast-math -flto -fno-color-diagnostics -isystem d:\jamieal\eigen -o bin/main.exe -LD:\opt\vcpkg\installed\x64-windows\lib -fuse-ld=lld

Hope this helps, the modified code is at
https://mbozzi.com/t1G8gXj.zip
along with this commentary & the original code & patches.
Last edited on
@mbozzi
WOW! First, thanks for taking the time to write such an incredible feedback, I will take a look at everything now. I will mention this here before looking at the code, I was running into memory leak issues, specifically in the potentialk in particular the inner loop and where I use colPivHouseholderQr. It seems like an indexing issue, as things become not initialized correctly(not enough not enough space or incorrect size). I am using Valgrind for this, but I will take a look at your modifications first. Thanks again, this is incredibly insightful.
In addition EcalcPotSourcek_inertia doesn't do anything as far as I can tell. In particular it never appears to modify any of its arguments

This is mainly my fault for writing the code in such a complex way, but this function is important in a sense it calculates a term "potSourcek_inertia" that will be passed to the potentialk. The term is very complicated unfortunately but so far this was the best way to calculate it.
Topic archived. No new replies allowed.