Hi there I was just trying my matrix class to perform a 1000x1000 matrices multiplication, and I have observed that the command C=A*B runs is TREMENDOUSLY slow as compared to Matlab or so, here is the algorithm I am using, which is the easiest one but nevertheless I dont think there should be so much difference right?
what is temp.fill doing? Why not temp[some row][some col] = kk?
your = operator should do a single memcpy or similar statement instead of one by one
looping for the actual matrix data (there may be other variables to set like rows and cols etc).
the copy ctor should do that also, possibly reusing the assignment op internally if possible.
I can post my multiply, but its very old code and not really up to modern design or standards, if you want to see it.
couple of tweak ideas..
the fastest I was able to do it was by transposing the second matrix so BOTH matrices index off row major ordering. This adds an up-front cost but the multiply function is significantly faster with this because it greatly reduces the page faults.
you can also short out zeros so that rather than do 1000 multiplies for a zero element, you just skip it.
There is a faster multiply algorithm beyond brute force but it is complex, you can google it.
1000x1000 is not that big, it should be doable in less than a second.
because I couldn't think about another way of changing the_matrix[i][j] element from external data... So the "=" operator might be the big problem then or the copy constructor? I don't understand very well what do you mean by making the copies.
About the ideas, thank you for them, they can really make a difference I believe, I guess Matlab is fully implemented with those. However, when I say my code is slow as compared to Matlab is that, a 1000x1000 product of matrices in Matlab costs a time of order 0.04 seconds after allocation, whereas in my code is fairly taking like 16 seconds.... the difference shouldnt be that big I believe, even when more sophisticated algorithms like the one you are suggesting work... It seems to change O(n^{3}) to O(n^{log(7)})
EDIT 1: Ok I just discovered that the problem lies in the matrix multiplication itself. When I perform it without using the "=" operator, that is, only doing A*B but not C=A*B, this consumes a lot of time too, any hints?
the alternative algorithm shouldnt be necessary unless you need it VERY fast. Again, I think you should be able to run a 1k by 1k in under a second with the basic brute force shown here. You HAVE to be doing a copy or something inside that is bogging it down.
while I look, can you time ONLY the multiply and then time ONLY an assignment operation?
Edit it has to be that.
Let me guess ... in main you have
R = A*B;
my code was 1) realtime and 2) 1990s era hardware so it did its own memory management and kept a huge pool of memory for temporary intermediate results all the time. So I would have dumped the results into the pool and returned a pointer to it, eliminating the creation of temp and the copy back out of temp to the answer matrix. Terrible design, but it was fast.
Time just
A*B;
instead of
R= A*B; That will answer where the problem lies.
Ok when I perform A*B, the total time taken is 15.7 seconds, whereas when I do C=A*B time is 15.95 seconds, I would say the problem really lies within the overloading of the operator *. MAYBE THE IF CONDITION INSIDE THE BODY?
//main function
#include<iostream>
#include<fstream>
#include<cmath>
#include "matrix.h"
#include<time.h>
#include<stdlib.h>
#include <iomanip>
#define pi 3.1415926535
#include "EIGENproblem.h"
usingnamespace std;
int main(){
matrix A(1000,1000),B(1000,1000),C(1000,1000);
A.ones();
B.randomS();
cout<<"done"<<endl;
C=A*B;
cout<<"done"<<endl;
return 0;
}
I just discovered the problem lies in the line kk+=(the_matrix[i][k])*(M.get(k,j)) in the overloading function, why can that be?? maybe the brackets and the fact that I am also using the "*" operator inside an overload of the "*" operator (although its for matrices and the other for doubles but apart from that, what else can it be?)
If I comment the line $kk+=(the_matrix[i][k])*(M.get(k,j)) $ then the issue disappears, why is that happening?? When I comment that line, time is 2.54 seconds, which is not "fast" I would say, but at least 1/8 less roughly
Time just
A*B;
instead of
R= A*B; That will answer where the problem lies.
I did, in the first case is 15.7 in the second 15.95.
UPDATE: It is the "get" function what is making the trouble, but seriously I don't know why this could be as get has the following simple form:
great. Lets look at those.... gimme a few min im multi-tasking.
in the meantime, remove the call to .get and time it again. Directly access M's data instead (I can't recall if you need to make that public for testing purposes only, or if it is already public to you because you are in the matrix class). Whatever it takes to test it for a moment, you can undo it after.
if you comment out that line the routine isnt doing much work :)
That may or may not be the problem --- checking the get function will tell you.
also pay attention to your values. 1000x 1000 is a million, and a million cubed is substantial. A million to a power less than 1 is much less than a million cubed. You can check the O(N) where N = 1 million to see if you think you need the other approach.
also, the obvious 1 row X 1 column shot off to worker threads on an 8+ core cpu would help... you have to watch these, thread creation has a cost, but done correctly it would cut the time linearly by # of CPU.
also if you get frustrated enough, clapack * friends library does it for you, but it also has a moderate learning curve as all the routines expect you to factor, normalize, and do various things before you can do anything. I used it on the eigen stuff and in a few places, but my code was set up to take a .M file and with a few changes it was C++. Very few changes; I had a lot of M files to convert at that time.
did you remember to compile the code with optimizations on? If you forget it will be in "debug" mode and the executable result is full of junk that makes it slower. In cases like this one, MUCH slower.
EG for g++ that is the O2 or similar options. For visual studio it is flipping to "release build" toggle. Other tools do it differently, but this is critical.
Your answer indicates that this is likely the problem, that and your code looks pretty darn good apart from the little stuff I offered.
OK, I am trying to translate the stuff I had which had <complex> data type and conditions for that coupled with single dimensional data storage with manual indexing into something you can read. I HOPE I did not mangle anything in the process. I know my original works.
operator *(matrix b)
set up result matrix;
matrix at = transpose(this); //remember, I didnt copy anything but a pointer here due to the
// internal memory management. This is a little more costly for your code.
for(brows = 0; brows < b.rows; brows++)
{
for(bcols = 0; bcols < b.cols; bcols++)
{
con = b.data[brows][bcols];
if(con) //is not zero...
{
//... grumble... the version I found of it does not actually
use the transpose correctly. the idea is that whatever is in the inner loop
is accessing both the result and the AT matrix by iterating the column dimension
while the rows are static.
looks something like result[something ] += at[something]*con but
I am having trouble re-creating it tonight.
}
return result
Hopefully that makes sense to explain the transpose & skip zero ideas.
> I just discovered the problem lies in the line kk+=(the_matrix[i][k])*(M.get(k,j)) ... why can that be??
(For large matrices) "every iteration of the inner loop incurs a cache miss when accessing an element of B. This means that the algorithm incurs Θ(n3) cache misses in the worst case. ... the cache misses, rather than the actual calculations, dominate the running time for sizable matrices." https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Cache_behavior
that page fault hit is exactly what the transpose fixes.
Back when, matlab was slower. Probably because it was doing correct numerics (normalize and probably L/U). At some point around the time multi core cpus came out it started to really smoke the brute force C. Most likely whatever is under the hood is multi-threaded, using a better algorithm (this is, after all, brute force), using correct memory access, and so on. If there is any ONE routine in LA that is examined and tweaked to the max, its the multiply.
We were doing control theory stuff with stable well conditioned values so we could skip the 'general purpose' stability checks and adjustments.
> that page fault hit is exactly what the transpose fixes.
Even after the transpose, there may be many cache misses for large matrices.
The high performance matrix multiplication algorithms typically use tiling or block partitioning
(in conjunction with expression templates in C++).
UPDATE: OK guys, thank you all for your responses and become interested in the poblem. I have set the compiler options to optimization level -Ox to the fastest, and the elapsed time of the single product A*B, for 1000x1000 matrices is 10.6 seconds, as compared to the ~18 seconds we had initially. Although some improvement has been done, I still think is a considerable amount of time of calculation.
(For large matrices) "every iteration of the inner loop incurs a cache miss when accessing an element of B. This means that the algorithm incurs Θ(n3) cache misses in the worst case. ... the cache misses, rather than the actual calculations, dominate the running time for sizable matrices." https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Cache_behavior
Is there a way to make this better in some way? Of course I could try the easy path, take Eigen, learn all declarations and run my program... but that 's not where the challenge is! Also I want to understand what makes such a big difference between creating your own matrix class with operations and those used by programs like MATLAB.
Well I have all the matrix class defined in a different file .h , and the body functions are in another file with .cpp extension, as different to the code you are posting above (you define all in a single file where the main function is), can this also make a difference? Also, I thought I have turned the optimiser on because there has been a decreasing from ~17-18 seconds to 10 seconds or so, but still cannot reach the seconds scale!!
I suspect - but haven't time to investigate - that the use of the highly-optimised std::inner_product makes a huge difference. Pre-computing the transpose allows one contiguously stored row vector to be inner-producted with another, which should help as well (although it demands some resources in both time and memory to set up).
Once you have linked the object code then the fact that things started in different files is irrelevant. I can't see repeated application of M.get() for single elements being very helpful, though. You are within a member function of the class, so it shouldn't be necessary.
You may want to have a look at this: much of the code is "C compiled with a C++ compiler"), but it is quite easy to understand: https://github.com/attractivechaos/matmul
We can see from the results (caveat: obsolete GNU compiler; though this is typical mainstream amongst the Linux crowd) that Eigen is very fast. Eigen is just a set of headers (it is copyleft, but the license is a lot saner than LGPL); have a look at its source code if you are comfortable with expression templates (C++98 would do).
If you just want results, that eigen looks like a very good option. You have 3-4 other options ... one of the lapack family (I used the intel math kernel version), you can also tell matlab itself to generate an executable and you MAY be able to beat a C or CPP file out of it, I forget if they got that working properly, or one of the other libraries mentioned (some of this stuff is pay to win, but time is money).
Writing your own can be fun, and the code for basic matrix stuff is not overwhelming at first, but you are re-inventing the wheel and the pro library guys have a tire factory while you have a rock, a hammer, and a chisel. This is a place to "buy not build".
If you insist on writing your own, one last thing on top of the above. You can try valarray instead and use the slice to generate the inner product chunks. Multi-thread that. You might even beat matlab's speed if you decide to dig into the problem. Not sure if the valarray is any better or not, its just something to try. Multi-threaded should tear it apart though.