I am doing very large data manipulations (looking at Bayesian networks, trying to produce a program that will be self learning for as large a network as possible). My background is logic and I am only beginning to program for this specific task. Where possible, I have used 'short' type, figuring that keeping memory usage to a minimal would be essential in getting the most out of the program. As I learn more about computers, it seems that 32 bit systems will pick up 32 bytes in one go and hence (I think) could as easily process an int as a short. There would still be a difference regarding total memory usage of course, but is this important?
There would still be a difference regarding total memory usage of course, but is this important?
That is really the question that needs to be answered. If you really are working with very large sets of data, then it could indeed be an issue. Keep in mind that 32bit systems are limited to a total of 4gb of addressable memory, with all the subsystems allocating memory space, that typically leaves you with around a max of 3gb of usable system memory (possibly less if you have a high end video card with lots of video memory). The OS itself will use up some of that system memory, and Windows will only give any given application no more than 2gb of memory space to work with.
So, what is the largest 'chunk' of data that you'd need to be able to directly work with at any given time? For your Bayesian network, determine what data points need to be stored in your nodes and your edges. Then calculate the memory space requirements for those dataset as both 16bit and 32bit types. Then multiply these memory space requirements time the total number of nodes and edges your network needs to work with, and you roughly have your memory requirements for 16bit vs 32bit types.
Note that, unless your nodes and/or edges are going to contain more than several hundred datapoints each, your network would need to contain over a million nodes and edges to get into the 1gb+ storage requirement range.
The largest chunk of data I need to work with is the power set of any given node's parents. But these are probability values that, to reseamble real numbers, are stored as doubles (and quick calculations here tell me to worry about this in any densely connected network of reasonable size!). The shorts are the values/names of the nodes and serve only as prob table keys.
As a follow up, now with concern regarding doubles - would it be prohibitive in terms of processing power to store multiple sets of 10,000+ values as shorts (say between 0 and 10,000) and divide by 10000 to get a probability number. I am thinking that this would entail massive number of calls to a division function when processing the probabilities.
Also - does performance deteriorate markedly as the free parts of the 4gb is used, or is it to all intents and purposes an all or nothing event?
Also - does performance deteriorate markedly as the free parts of the 4gb is used, or is it to all intents and purposes an all or nothing event?
Do you mean, your 2gb + windows 1gb +
whatever other variables etc 0-1gb <-----
Typically OS's write to hard drive as this starts to fill up (about up to another 2gb "swap space"), which is when you start seeing increasingly slower system performance.(RAM > HDD) This is different OS to OS, but you shouldn't assume your user has 4gb of actual RAM, unless you know that for fact.
Generally yes, anything working over 1gb would start to see the system slow down somewhat.
edit: For example, I have 3gb ram. If using a Unix system I will set my swap space to about 1gb, to make up the rest 32bit systems can use. On windows, this is done automatically and I'm not sure how well it does it, but it generally works of the same principle and tries to mimic 4gb max ram.
Does anyone have anything wise to say about the trade off between using types (short or int) which take up less memory and then using functions to turn them into applicable (real number approximates) numbers compared with storing the numbers as, say, doubles in the first place.
Sorry to go OT, Just remembered in Windows it's referred to as Virtual memory, and on Unix is Swap Space, so if you query help "change the size of virtual memory" you get these directions:
1. click open system
2. left pane, click advanced system settings
3. under advanced tab, under performance, click settings
4. click advanced tab, and then, under virtual memory. click change.
5. Clear the Automatically manage paging file size for all drives, check box.
6. under Drive[Volume Label], click the drive that contains the paging file you want to change.
7. click Custom size, type a new size in MB in the initial size (mb) or Max size(MB) click set, then ok.
Generally not needed unless you change(remove some of) your ram after install. I'm pretty sure (I'm guessing but) that on install of the OS it checks ram and so it is automagically set on this basis. It could be that it resizes itself when needed also. I'm not 100%, but going by how reliable MS is, I would say it probably is not automagical after the initial install.
edit: mine is set for 3gb for some silly reason.... go figure.. apparently 3gb + 3gb = 4gb :-/
edit2: at least my laptop is somewhat closer, 2gb ram + 2345 MB Virtual memory (auto set by windows)
Here's a good way to think of it:
Accessing CPU cache is like taking a sheet of paper from your desk
Accessing RAM is like taking a book from the shelf accross the room
Accessing the disk is like walking all the way down the road and carrying a crate back with you.
Does anyone have anything wise to say about the trade off between using types (short or int) which take up less memory and then using functions to turn them into applicable (real number approximates) numbers compared with storing the numbers as, say, doubles in the first place.
Short integers on a standard x86 will use 16 bits; you can use the register keyword and they'll probably get temporarily stored in AX or somewhere (probably because register is a hint and not a command) but using the standard integer tends to be fastest. The x86 CPU deals with 32-bits at a time, so using a 32-bit variable will tend to be faster.
edit: For example, I have 3gb ram. If using a Unix system I will set my swap space to about 1gb, to make up the rest 32bit systems can use. On windows, this is done automatically and I'm not sure how well it does it, but it generally works of the same principle and tries to mimic 4gb max ram.
Does that mean that I, having a 64-bit system with 6 GiB RAM should attempt to set 1019 bytes of swap space (that's just under 16 exbibytes)?
I have no swap space. I deleted it (it was really big; I made a mistake with my binary multiples and set 13 GiB instead of 1.3 GiB) and couldn't be asked to recreate it.
As a follow up, now with concern regarding doubles - would it be prohibitive in terms of processing power to store multiple sets of 10,000+ values as shorts (say between 0 and 10,000) and divide by 10000 to get a probability number. I am thinking that this would entail massive number of calls to a division function when processing the probabilities.
That will really depend on what else you are doing with those sets of data. The more complicated the full calculation is, the less significant an additional divide will be.
For example, if you are performing something like fourier transforms on the dataset, then adding in another divide will have relatively small impact on the overall processing time.
If on the other hand you are just generating a histogram and/or standard deviation, the divides can become a more significant portion of the overall performance.
Your real bottleneck is likely going to be in taking cache hits from your graph structure. Your average multi-Ghz entry level cpu today has no problem doing 10k or even 100k divides in the blink of an eye, IF you can keep it's cache full of data to divide. That won't be easy with a graph structure. If you had a linear array of 100,000 values that you wanted to divide by some constant, the computer would crunch through that in a flash. If those 100,000 values are all scattered randomly through your memory space, the cpu is going to get stalled a lot as it continually fetches individual values from all these various locations, and it's going to take a lot longer to crunch through those values.