When should I break up math equations or pre-calculate values, especially when concerning loops?

Pages: 12
closed account (Ezyq4iN6)
I know that functions are used to eliminate repeated code, but I figured that a similar concept would probably help math equations in loops. My professor mentioned that some things, for example cosine/sine, take longer than simpler math like addition. That said, I'm wondering if there is a way to determine when you should either pre-calculate things or even make a variable inside a loop to hold a value that is used by multiple equations. Please see my example below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
  // Doing the cosine calculation in the loop.
  for (int i = 0; i < 1000; i++)
    my_array[i] = my_int * cos(some_value);

  // Calculating a cosine once before the loop.
  double some_cosine = cos(some_value);
  for (int i = 0; i < 1000; i++)
    my_array[i] = my_int * some_cosine;

  // Another example where you have to have cosine
  // in the loop, but it is calculated twice.
  for (int i = 0; i < 1000; i++)
  {
    my_array[i].x = my_int * cos(i);
    array_2[i].x = other_int * cos(i);
  }

  // Pre-calculating the cosine.
  for (int i = 0; i < 1000; i++)
  {
   cosine_i = cos(i);
    my_array[i].x = my_int * cosine_i;
    array_2[i].x = other_int * cosine_i;
  }


My professor is relatively old, so maybe this is no longer an issue with modern programming, but he said that many people used pre-calculated sine/cosine tables to avoid calculating it in program. If people went to that extreme, then I figured there must be some benefit to trying reduce the number of calculations, especially in a very large loop (like 1,000,000+).

Is there any way to know when to split things up, or is it not needed, and if it is, what does it depend on?
Last edited on
What did your professor say when you asked him this question (minus the old part of course)?

closed account (Ezyq4iN6)
@jlb
He said it was up to me to figure it out, which is his standard answer for all questions. He is more than happy to tell us things in lecture, but believes that all other knowledge should be self-sought or self-taught as he put it.

While I can do some of my own testing on syntax and possibly timing as well, I am simply concerned that what may pass in academia, where only the basics of coding are taught, may be laughed at in the industry and viewed as poor practice.

It would seem logical that doing anything you can ahead of time would shorten the calculation time, but the effects may not be worth it in some situations, especially if they cost readability, add extra variables, etc. However, sometimes even then "best practices" overrule because they are expected.
1
2
for (int i = 0; i < 1000; i++)
  my_array[i] = my_int * cos(some_value);

You do assign the same value into each element.

These make it more obvious, don't they?
1
2
3
4
5
6
const auto foo = my_int * cos(some_value)
for ( int i = 0; i < 1000; ++i )
  my_array[i] = foo;

// or
std::fill( my_array, my_array+1000, my_int * cos(some_value) );

More obvious and possibly more efficient too sounds like a win-win.

This is clearly different from the previous case, for each element gets different value:
1
2
3
4
5
for (int i = 0; i < 1000; i++)
{
  my_array[i].x = my_int * cos(i);
  array_2[i].x = other_int * cos(i);
}

Two arrays at the same time? The CPU's have limited registers and cache memory.
Have you considered:
1
2
3
4
5
6
7
8
for (int i = 0; i < 1000; i++)
{
  my_array[i].x = my_int * cos(i);
}
for (int i = 0; i < 1000; i++)
{
  array_2[i].x = other_int * cos(i);
}


I'm wondering if there is a way to determine when you should either pre-calculate things

Yes. It is called profiling.

You run your program in a way(1) that shows how much time each part of your program takes.
If some part is significantly slow(2) in correctly working code with representative data, then pay attention to that code.

When you have different version, profile it too. Compare the timings.


1) Compilers have options for that.
2) Complexity of chosen algorithm (the "Big O notation") can affect much more than one value (that optimizing compiler might keep in register even without code change).
closed account (Ezyq4iN6)
I get the use of const, but why use auto for foo instead of a double?

Also, does splitting the loop into two loops speed it up because it frees up registers and cache? How do you know when to put things in separate loops for cases like this?
Modern compilers have gotten very good at optimization, including invariant optimization.
Invariant optimization involves recognizing what instructions don't change during each iteration of the loop and moving those instructions outside the loop.

Using your first example:
1
2
3
some_register = my_int * cos(some_value);  // moved outside the loop by the compiler
for ( int i = 0; i < 1000; ++i )
  my_array[i] = some_register;


Of course, this doesn't help if you have optimization turned off.

why use auto for foo instead of a double?
autoLets the compiler deduce the type.

does splitting the loop into two loops speed it up because it frees up registers and cache? How do you know when to put things in separate loops for cases like this?

Not necessarily. Some processors such as the intel itanium have lots and lots of registers.
If you have performance critical code, it's a good idea to understand the generated assembly language. For shallow loops such as the examples you've given, the compiler optimization is probably going to do a better job of optimizing than you can. For deep, complex loops, profiling and assembly inspection is called for. Where the trade-off point is between compiler optimization and hard optimization depends on the machine and the complexity of the loop.
closed account (Ezyq4iN6)
I know it may sound like a strange question, but when would you not want the optimizer if it makes things better? I use Code::Blocks and I believe g++ is my compiler. Can I just select the -O optimize for speed option to get the benefit?

If auto deduces the type, should I switch all of my variables to auto from now on? And if not, when should I use it?

So I guess it really only matters in situations where you are programming a scientific device, or something that requires precise timing, right? Assembly looks a bit out of my league for right now, so I'll probably leave that to the compiler.

Thanks for your advice.
when would you not want the optimizer if it makes things better?

When you're debugging.

Depending on the optimization level, the optimizer can remove statement boundaries and can also remove variables. For example, the optimizer is very likely to assign a loop index to a register since you would typically not have any references to the loop index outside of the loop. If you have a typical loop idiom of the form for (int i=0; ... ), depending on your debugger, if I is assigned to a register, trying to watch the contents of i is not as straightforward as simply watching a variable.

For development testing, optimization is usually turned off. When a program is turned over for production, it's usually built with the highest level of optimization.

If auto deduces the type, should I switch all of my variables to auto from now on? And if not, when should I use it?

My rule is that I let the compiler deduce the type when the type is complex. I see it as lazy to specify auto when the type is obvious. I'm sure others will have their opinions.



closed account (Ezyq4iN6)
Since I'm still new to coding I haven't really used the debugging features much, so I guess I never considered that. I main just do things in release mode, mainly because a professor got me use doing that. I'll at least make sure that I don't have the option selected until I am ready for the final version.

Ok, so maybe just use auto when accepting the results from equations with multiple data types, but not for things like base variables you know are of of one type or another, or maybe not for loop counters.
Im getting on up there and once tried to do a sin/cos lookup. It was slower than the built in sin/cos by a significant margin.

pre-calculation is tricky on modern compilers. If you have the same expression many times, it can help to stuff that into a double and compute it once, but the compiler may already be doing that for you when it optimizes. Sometimes the compiler misses, but its very smart. Lookup tables are awesome, but they can't beat a built in dedicated piece of hardware, so a CPU with a hot sin/cos (it really only has one and shifts for the other) built into it is going to beat anything you do in code all day long.

It comes down to a few rules of thumb, really.
the first thing to ask is whether your code is fast enough. No on cares if you can do something twice as fast if its already giving the user the desired output in sub nanosecond time (well, unless the machine is tapped out and you want to free up some cpu for something else). Im a big fan of making code as fast as you can with minimal effort -- get rid of the easy to fix slowness where you can. But after that, if its not running too slow, let it be.

once you find something that is too slow, profile it, find where it is slow.
then rewrite that little bit a few different ways, or research algorithms that might help, try to fix it.

If you do it that way, you will know when to optimize, and that just leaves the easy to fix slowness that I mentioned. That comes with experience, but the things new coders do that are slow are the same few things... unnecessary memory allocation /deallocation (costs OS resources that are sluggish), cache missing (causes OS to flush and reload parts of your memory), unnecessary copying of large objects, unnecessary I/O (almost all IO is slow compared to anything else the machine can do, whether its disk IO or just writing to the console), poor choice of algorithm (usually brute force solutions) … stuff like that. You will eventually get a handle on this stuff; try reading the 'how can I make this faster' threads in here, too. You can learn a bit from many of those.



closed account (Ezyq4iN6)
@jonnin
I'm surprised that using a const array lookup table would be slower than a math function In that case, is it actually better to do some multiple of the same calculation (2 - 4 times) rather than doing it once and storing the answer as a variable, because the math is faster than variable access?

I think there is a profiler in Code::Blocks, so I'll see if I can get that to work. Do most use an IDE profiler, or do they set up tests with variables to find the time difference between the start and the end of the code in question? Also, I get your point about letting something be. If it isn't affecting the performance, don't waste extra time tweaking something that won't produce a noticeable result.

Since I am a new programmer, can you correct some of my misconceptions on those errors? I'll also read over the "how I can make things faster" threads.

For memory allocation and deallocation, what exactly are the major pitfalls?

With cache missing, can I avoid that by declaring large arrays as static and using my array loops with i then j, rather than j then i (array[i][j] form)?

As for large object copying, is it better to pass a const reference or a const pointer to avoid that?

To eliminate disk reading/writing should I simply read everything at once and then sort it later and the same for writing but reversed, as opposed to reading/writing it into/from one variable at a time? As for console writing, can I do something like with graphics and double buffering, where I write a lot of text to memory and then write it all to the screen less often, or maybe use a single cout statement instead of multiple ones?

I get using efficient sort algorithms for things like finding names in a list, but what about if you are drawing pixels to a buffer and want ones with certain values? Can you use algorithms for that, or do you have to pretty much brute force through them all and have if statements to check for when you do or don't draw them?

Thanks!
For memory allocation and deallocation, what exactly are the major pitfalls?

The biggest pitfall is too frequent allocation and deallocation of dynamic objects.
Allocation requires searching the heap of a free block of memory of the appropriate size.
Modern heap managers keep this to a minimum by keeping lists of blocks of various sizes. If there is a block available of an appropriate size, allocation is relatively quick. If the heap gets extremely fragmented, the search for a block of the appropriate size can involve searching multiple lists and possibly splitting a larger block into smaller blocks. Deallocation can involve coalescing adjacent free blocks (if any).

With cache missing, can I avoid that by declaring large arrays as static and using my array loops with i then j, rather than j then i (array[i][j] form)?

Probably not. Declaring the block static will have no effect on cache misses. If you're referencing the entire array, the order of the indexes will probably not have any effect either. If you're referencing only part of the array, then keeping your references close together will help.

To eliminate disk reading/writing should I simply read everything at once and then sort it later and the same for writing but reversed, as opposed to reading/writing it into/from one variable at a time?

If you're reading an entire file, the total time is going to be the same one way or the other.
This is what I regard as a false optimization. You're not going to save any time overall.
The amount of time "reading" the file is tiny compared to the amount of time required to do the physical I/O.

maybe use a single cout statement instead of multiple ones?

You can get some minor performance improvement by using '\n' instead of std::endl. std::endl causes a call to cout.flush(). Again, the amount of time saved is insignificant.




You can get some minor performance improvement by using '\n' instead of std::endl. std::endl causes a call to cout.flush(). Again, the amount of time saved is insignificant.

It can make a big difference if you are writing to file.
I'm surprised that using a const array lookup table would be slower than a math function:

maybe read this.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.396&rep=rep1&type=pdf

Think about what you actually would DO if you did this in C++.
youll make a table of values, say every 1 degree as a simple case. That table exists in memory, so its going to cost you if its not in the cpu cache to move it in and out. Its going to cost you 2 pointer hits (get the left and right value around the answer you want, eg if you want cos(33.33) you need 33 and 34 from the table. Then it costs you a few math operations to do the interpolation. There are some register moves, FPU access/use, etc in all that too. And a good implementation has to handle converting out of bounds angles back into the known range, eg if you put 1000 degrees it converts that to the range 0-360 or 0-90 or whatever.

The CPU cos circuit does not have to deal with ANY of that.

-----------------
For memory allocation and deallocation, what exactly are the major pitfalls?
Doing it :)
these cost you a fair bit of time because they interface to the hardware via the operating system. If you can allocate once, use a long time, and deallocate, its great. If you create a guy, use it once, destroy it, over and over in a tight loop, its incredibly slow (relative to doing the allocate/deallocate once).

-----------
cache management is not something you can always control. Large blocks of memory that you iterate somewhat randomly instead of sequentially will burn you. For example, image processing, if you went down one pixel, going row to row (vertically) in c++, which stores memory the other direction, that would be very expensive for cache misses on a big image. If you went across a row (horizontal) instead, it would not. Does this make sense? Something that simple can make a huge difference.
-------------------
As for large object copying, is it better to pass a const reference or a const pointer to avoid that?

there are any number of ways to avoid it, depending on the code. A reference or a pointer makes no difference in this scenario, prefer a reference but the mechanics are identical, so its just code readability and pointers should be used only if nothing else works for the job at hand.

-------------
but what about if you are drawing pixels to a buffer and want ones with certain values?
it depends on how it is stored, what you can do. Any random image loaded from disk, find the pixels that are exactly some RGB value? Brute force is about your option without additional storage. You can multi thread the search, taking 1/2 or 1/4 in each thread, if its huge. You can keep a greyscale copy of the image, and search that 3 times faster for candidates, and then check candidate locations against the real image. You can keep a 1/4 or 1/8 or whatever sized copy and search that for boxes to search inside in the real image, stuff like that. How applicable these kinds of ideas are depend on what you are doing. Also, its the old time-space tradeoff.. the more memory/space you waste, the more ways you can look at the data which means one of them may be faster than the others. You can even build a map of the colors (this color is at these locations, for each distinct color). That last is expensive but if you are going to search for a bunch of colors, paying once to build it and getting the answer nearly instantly after that for each query may be the best overall solution. Etc.
Last edited on
1
2
3
some_register = my_int * cos(some_value);  // moved outside the loop by the compiler
for ( int i = 0; i < 1000; ++i )
  my_array[i] = some_register;

Unless the compiler is aware of what cos() does, I doubt that it can do this optimization. Consider this:

file1.cpp:
1
2
3
4
int val;
int f() {
    return ++val;
};


file2.cpp:
1
2
for ( int i = 0; i < 1000; ++i )
  my_array[i] = f();


This can't be optimized to:
1
2
3
int temp = f();
for ( int i = 0; i < 1000; ++i )
  my_array[i] = tmp;
because f() has a visible side effect. And unless the compiler can see f() while compiling file2.cpp, it must assume that f() has side effects so it must make the calls.
I think the compile is aware of what cos() does, on the machines where its a CPU instruction. On chips that lack the function, it should still know as its provided by the language/library ... but whether they added optimization to be aware of their knowledge is the question (ans: probably yes). Its less capable for user defined f()s.
Last edited on
closed account (Ezyq4iN6)
@AbstractionAnon

I'll try to minimize my dynamic memory allocation then. I guess it would be a smart idea for me to replace my vectors with arrays that have enough elements to handle any normal amount of values placed in it.

Usually I read all of my arrays at one time (drawing pixels to the screen, etc.), but I suppose there are sometimes when I may just need part of one. I had read somewhere that reading row-by-row was faster than column-by-column, but I guess not.

Well that's good to know about the files. Instead of one read statement into a char array, now I can read the file once for each variable and it will make it easier to work with.

I have been using \n for a bit instead of endl, but when would it be necessary to use cout.flush()?



@Peter87

Do you mean using \n instead of endl when writing a file? I have only used \n when writing files, so I didn't know you could use endl doing that.



@jonnin

I'll have to read over that PDF. Thanks for the link.

I see your point about all of the additional calculations, but if lookup tables have all that extra cost, then why were they ever used? Also, do most modern CPUs have the cos capability?

So as far as allocating and deallocating, it's better to minimize dynamic objects? I'm guessing it would be far better to create something like an array instead of a vector, and to assign enough elements to that array that it would cover any amount of elements your vector would have used, right? In other words, you reserve more memory, but then you don't have to waste time chasing it later? I usually make the mistake of declaring objects, structs and variables in the scope of a function instead of the main function (is this dynamic?). Would it be better to define all of these objects in main at the beginning?

As far as the image reading, I should basically read like [0][0], [0][1], ... [0][n] instead of [0][0], [1][0], ... [n][0], right? I tend to do my image arrays as the following:

1
2
3
4
5
for (int i = 0; i < width; i++)
{
  for (int j = 0; j < height; j++)
    draw_pixel(pixel[i][j];
}


I'll stick with references then. I often see people using pointers when creating an object, or a pointer to that object. See code:

1
2
3
4
5
// How I make an object.
MyObject object;

// What others usually do.
MyObject *object;


They then pass the pointer, whereas I would pass a reference to my object. What's the difference, and why do I see that pointer format so much? Is their way much better and can I still pass it by reference?

Is it worth me adding multi-threading to my loops, and is it hard to implement? Also, as I understand it each core has 2 threads, correct? What if I have 4 cores and I make my code to use 8 threads (can I use them all?) and then someone who only has 2 cores runs the program? Will it only use 4 threads, will it fail, or something else?

So a grayscale image is faster for searching? Would I create one of those by taking the average of the RGB values and then writing that value to all RGB for that pixel? Also, how can I use the grayscale to search for colors easily?

Sorry, you lost me after the grayscale part. What do you mean that I could create a map of colors? Would that be like a 2D vector where every color is represented in the first element and the next elements are the x,y points where that color can be found in the image?
I'll try to minimize my dynamic memory allocation then. I guess it would be a smart idea for me to replace my vectors with arrays that have enough elements to handle any normal amount of values placed in it.

std::vector is generally preferred over arrays. See vector.reserve() for a method to pre-declare the expected size of vector and prevent repeated reallocation.
I have been using \n for a bit instead of endl, but when would it be necessary to use cout.flush()?

You want to call stream.flush() when you want to ensure the data from the buffer is written to the disk. stream.flush() gets called automatically when the stream buffer is full and of course when the file is closed.
I usually make the mistake of declaring objects, structs and variables in the scope of a function instead of the main function (is this dynamic?).

No, that's not dynamic allocation. Dynamic allocation is when you use new. It is best to define objects as locally as possible. If an object has the same lifetime as your program, then it's best to define it in main.
why do I see that pointer format so much? Is their way much better and can I still pass it by reference?

Your method of defining an object is preferred if there is no requirement for a pointer. Polymorphisim requires the use of a pointer. Pointers (i.e. new) are also used when the lifetime of an object doesn't line up with the function where it is created.
They then pass the pointer, whereas I would pass a reference to my object. What's the difference, and why do I see that pointer format so much? Is their way much better and can I still pass it by reference?

The compiler generates the same code whether you're passing by pointer or by reference so there is no difference in performance. Yes, you can pass a dynamic object by reference. Simply dereference the pointer. i.e. myfunc (*pobj); where myfunc is expecting an object by reference.





closed account (Ezyq4iN6)
@AbstractionAnon

I didn't know about the reserve method for vectors, but it sounds useful so I'll have to start using that.

So is stream.flush() called much manually? I have written files for class that were several megabytes, but never used flush(). What is a scenario when a person would actually need to call it when writing?

Thanks for clearing that up on dynamic memory. I guess it is basically a way of creating a scope for an object or variable that is user controlled rather than by a function, object, etc.

I never thought about it being related to polymorphism like that, but that makes sense.

I thought a dynamic object was global until 'delete' was used. Would there not be another way you could access it rather than passing it?
I see your point about all of the additional calculations, but if lookup tables have all that extra cost, then why were they ever used? Also, do most modern CPUs have the cos capability?

-- the very first CPUs I worked on had a cos. They had this goofy thing called microcode for any lookup tables they needed way back when but I don't know that trig functions did it that way given that they are easy to compute. I think most cpu have a trig function, but I am not 100% sure on that. Everything I use does. Lookup tables are for things that are harder to compute, where the cost is justified, or in your own high level code, they can save a ton. Very few things have a CPU instruction, trig, roots/exponents/logs, and a couple of others are about all you get. Anything else, you do in your high level code, and for that, lookup tables start to look pretty. Lets take an example... factorials are used in a lot of approximations. You can write a function that computes factorial, or you can just make an array of the values. The cpu cannot compute a factorial, so the table is faster than a loop to multiply up the values... that is a good place to pre-compute something. fact[value] is pretty darn fast, recursive_fact(value) is relatively much slower.

---------
again, can't say it more plainly: use dynamic memory. You have to, string, vector, all that good stuff use it for you (hidden, but its there). Just avoid doing it in a loop, or calling a function that does it over and over in a loop, in the places where you care about speed find a better way. If you don't care about speed, let it go and just code.

local variables, arrays, classes/structs all go on the stack unless their constructors use dynamic memory or they contain classes that do so (string, vector, anything stl really, etc). You can't avoid it. You should NOT try to avoid it. Just try to avoid this:
for(i=0; i < 1000000000000000; i++)
{
vector <classtype>v;
v.push_back(something); //memory grab
} //memory release

in favor of this:
vector <classtype>v(1); //memory grab
for(i=0; i < 1000000000000000; i++)
{
v[0] = something; //reuse memory
}
-----------------
pointers used as reference and references are verysimilar. The differences are subtle -- do not worry about them right now.

I don't know where you are seeing all the pointers, but they should not be used unless necessary. Use a reference where you can. Some class designs require a pointer. Some code does as well; you cannot change references on the fly, so a pointer to an item in a vector that changes which item it refers to is not doable as easily with reference, unless there is a trick I missed.

multithreading is not that hard, but design of it and getting it right is tricky. Its a linear speedup to most code: if you have 4 cpus and 4 threads, it runs nearly 4 times faster. The baseline/standard cpu has 4-6 cpus now, and soon will be 8.

a greyscale image is 1/3 the size so it is faster to iterate via brute force. Yes, it is the average of the RGB. Let see.. say you wanted to find color 1/2/3 and had a greyscale image. the average of 1/2/3 is 6/3 is 2. Look at the pixel in the greyscale image. Is it 2? yes, then look at the same pixel in the color image too. Is it 1/2/3? Found it! If it were 3/2/1, didnt find it, same average, so you get some false alarms but depending on the data this may give you a massive speedup. Computing the greyscale costs, so again, looking for 1 color once is not so good, but looking for 50 colors back to back is big savings.

the scaled images are the same thing. if you look at the image scaled 1/4 as big, find 'similar' color to what you seek, then look at a box that represents that pixel back in the original and scan for it. here again it works on some kinds of images for some kinds of searches.

yes a color map would be a 2-d vector of colors with a list of coordinates that have that color. Its very, very, very fast to use once you generated it. It costs a bit to create. Which stl container is best for that... not sure. You could also do buckets of colors that are 'similar' (hmm, what if you had 255 buckets of the possible greyscale values...) ... see how you can mix and match these ideas to your needs? No one generic solution is best, it depends on what you want to do, how often, how long you have to do it, how much memory you are willing to waste... etc..




Pages: 12