Java...

Pages: 1... 56789
Note that, if the client JRE is running in 32-bit mode, x86-64 has more registers (twice as many, IIRC) than x86. That can have a serious impact on performance.
This whole exercise lead me to believe JVM is extremely important. I guess the world Hotspot mean something. I will not be surprised if some commercial companies are selling proprietary JVM that promise performance that will magnify many folds from the default provided by Oracle.

Hence all Java programs timings will be skewed so asking which JVM is used as benchmark becomes critical else all metrics taken is not representative at all.

I guess I learnt something new about Java behavoir again or maybe this is old news, it is just that I did not pay too much attention to this issue.
sohguanh wrote:
You have also not explain why for virtual calls in C++ it is excruciating slow in comparison to Java :P

Perhaps because in my tests C++ is twice as fast as Java for virtual calls. I suspect you did not have any optimisations turned on in your C++? Also the VM, as we are discovering, makes a whole lot of difference.
Client VM is not optimised for heavy computational tasks, it is optimised rather for fast startup and good responsiveness in GUI applications. It does only some lightweight optimisations, it does not too much PGO, forget about speculative optimisations, only very basic inlining, its GC is also tuned by default for low pause times and not for throughput. The usual performance difference between client and server is often 1.5-3 times. Since Java 6 update 24 there is a new option -XX:+TieredCompilation which brings advantages of both client and server compiler together, but this option is present only in the server hotspot.

But that still doesn't exlpain very poor performance of Java in Galik's tests. I'd expect the bigger problem is using 32bit Java on a 64bit system and running a GC heavy test on a single core processor (it is measuring GC time, not allocation time, because GC cannot be moved to separate core). Getting good throughput and low pauses in a GCed application on single core is quite tricky.

As soon as I install a recent version of 64bit MinGW I'll post my comparison.

Anyway, chosing good JVM for the task is important. Companies like Azul or Excelsior claim to deliver much better JVMs than Oracle (but commercial).
Last edited on
Also the choice of hardware is very important. There is specially optimized hardware for running java:
http://www.youtube.com/watch?v=5uljtqyBLxI
(Java on a 1000 Cores - Tales of Hardware / Software CoDesign)
GC time, not allocation time
Come on. Are you serious?
Yes, I'm serious. If GC is offshored to a spare core, it costs almost nothing. Contrary, you cannot move the new/delete work in C++ to a spare core - these operations are running always within the thread that called them. This is a huge advantage of GC - it makes better use of multicore machines.
Last edited on
rapidcoder wrote:
If GC is offshored to a spare core, it costs almost nothing.

It does cost. It occupies a core that could be running your application.

rapidcoder wrote:
Contrary, you cannot move the new/delete work in C++ to a spare core

But it really doesn't matter which core the allocation runs on, it still occupies processor time that could otherwise be used by the program.

rapidcoder wrote:
This is a huge advantage of GC - it makes better use of multicore machines.

I don't see why. C++ can use multiple cores and gain advantages.

Furthermore, this would explain why your results for Java are so good, you are using 2 cores in your tests for Java and only one core for C++. I am using just one core for both.

It does cost. It occupies a core that could be running your application.


The problem is, most applications can't utilise even two cores, and what about 4 cores of my laptop and 12 cores of my PC? We are soon facing a problem of having too many cores that have nothing to do.


But it really doesn't matter which core the allocation runs on, it still occupies processor time that could otherwise be used by the program


It doesn't matter only if your program is 100% scalable i.e. can utilize all the cores in 100%. This is almost never the case.


I don't see why. C++ can use multiple cores and gain advantages.


Yes it can, but it is very hard to get to 100% CPU utilisation. Centralised allocator is often one reason for that. Also, parallel programming in C++ is really hard.

To illustrate this, lets do a challenge: create the most efficient program to compute factorial(500000) on a 4 core machine and let's compare the codes. Yours in C++, mine in any other language I choose. We will also check, if really CPU usage is 100% in this task - I seriously doubt that (both for C++, Java, and whatever you use).

BTW: I run your unmodified memory benchmark (the firs one, with arrays) in C++:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
g++ (GCC) 4.4.7 20110510 (prerelease) [svn/rev.173611 - mingw-w64/oz]
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

C:\Users\...\Temp\Testy>g++ -O3 -m64 alloc.cpp

C:\Users\...\Temp\Testy>a.exe
Run time: 2.761

C:\Users\...\Temp\Testy>a.exe
Run time: 2.854

C:\Users\...\Temp\Testy>a.exe
Run time: 2.854

C:\Users\...\Temp\Testy>a.exe
Run time: 2.808


Java version, modified with s/char/byte/, ran in 0.9 sec (already posted that). And keep in mind it zeroes the arrays, which C++ version doesn't do.
Last edited on
If GC is offshored to a spare core, it costs almost nothing.
Of course, because the thread that's doing the timing is a different one. So, is it alright if I write a benchmark like this?
1
2
3
4
5
6
clock_t t0=clock();
for (int a=0;a<N;a++)
    array[a]=new T;
clock_t t1=clock();
for (int a=0;a<N;a++)
    delete array[a];


it is very hard to get to 100% CPU utilisation
Not really. I've gotten up to 80% with heavy graphical I/O code.
Last edited on
Not really. I've gotten up to 80% with heavy graphical I/O code.


Up to 80%? Huh, so 2 cores of my PC would sit idle then... You wouldn't notice GC.

PS: Just checked with JVisualVM the GC usage of that memory benchmark: it constantly reports <1%.
So it uses 1.01 core, and is 3 times faster than C++ app using 1.00 core.
Last edited on
Huh, so 2 cores of my PC would sit idle then...
Sure, if you follow Oven Logic. Reality, however, is not a linear function.

create the most efficient program to compute factorial(500000) on a 4 core machine
I'll play only if there's no whining and complaining afterwards, like "oh, but mine does better on such and such scenario"; and if it can be on n threads instead, since I can only run 2 concurrent threads.
Why don't you make it one million factorial ... and say one month time ... I wanna give it a go too.

[Edit]: Also, we need to set up the format of the output.
I propose that timing starts at the beginning of the computation and ends after printing it out in decimal.
Last edited on
ends after printing it out in decimal
I don't agree with that. You would only be measuring the speed of the output device.
I propose instead to measure the entire computation, plus the time it takes to free the resources used to perform it.
helios, setting up the format of the computation is key. If I represent the number in base 2^32, converting it to decimal will not be a trivial operation. Conversely, if I store it base 1 000 000 000, converting it to hexadecimal will be non-trivial.

The way we store the number is key to our considerations: otherwise, if you allow any form of notation, I am already done with computing 1000 000! (I don't even need a computer for that - I already wrote it down!).

[Edit:] We can simply put the output to be some standard string class inside memory, and stop the clock when the string is computed. I agree the printing-out-to-the-screen shouldn't be timed.
Last edited on
You can use whatever output format you want, as long as it's an exact representation of the number and it doesn't contain any implicit operations, except those of a positional system.
Are you going to try to roll your own bignums again, though?
Yes, but I will read how to do fast fourier transform, and try to implement it.
Last edited on
Hm. Well, suit yourself.

By the way, I got mine down to 10.5 s and averaging 95% CPU efficiency. I'm fairly sure I'd have to stop system processes to get a higher efficiency reading and that synchronization inside malloc() is not a bottleneck.

EDIT: Oh, I forgot to mention. That's not including conversion to decimal to memory.
Last edited on
That is, 10.5sec for 1 million factorial (sounds like great time!)? Also, did you use gmp, or you did your own bignums? Also, how many threads?
For half a million, using GMP, and 4 per core (8 total).
Pages: 1... 56789