Java...

Pages: 1... 6 789

Well, I kinda doubt I can beat gmp too easily:

http://www.cims.nyu.edu/cgi-systems/info2html?(gmp)FFT%2520Multiplication

They use a combination of FFT, Karatsuba 3-way and high-school multiplication depending on the sizes of the integers multiplied in order to achieve fastest results. I read that some other libraries could do better here and there by employing some floating point trickery (but that works I guess because of how processors are built).

helios (17607)

Alright, I'm done. Here's a VC++ 10.0 project file and executable if anyone wants to run it:
http://www.fileden.com/files/2008/6/23/1971683/gmp.7z

#include <iostream>
#include <string>
#include <ctime>
#include <gmp.h>
#include <boost/thread.hpp>

class parallel_multiplier{
	unsigned id;
	boost::thread *thread;
	void thread_f();
public:
	static unsigned next_id;
	mpz_t integer;
	parallel_multiplier();
	~parallel_multiplier();
	void start(){
		this->thread=new boost::thread(&parallel_multiplier::thread_f,this);
	}
	void end(){
		this->thread->join();
	}
};

unsigned parallel_multiplier::next_id;
unsigned N;
unsigned thread_count;

parallel_multiplier::parallel_multiplier(){
	this->id=next_id++;
	mpz_init(this->integer);
	mpz_set_ui(this->integer,this->id);
	this->start();
}

parallel_multiplier::~parallel_multiplier(){
	if (this->thread)
		delete this->thread;
	mpz_clear(this->integer);
}

void parallel_multiplier::thread_f(){
	for (unsigned a=this->id+thread_count;a<=N;a+=thread_count)
		mpz_mul_ui(this->integer,this->integer,a);
}

void compute_factorial(bool output_result,unsigned n,unsigned threads_per_core){
	thread_count=boost::thread::hardware_concurrency()*threads_per_core;
	N=n;
	parallel_multiplier::next_id=2;
	clock_t t0,t1,t2,t3;
	t0=clock();
	parallel_multiplier *pm=new parallel_multiplier[thread_count];
	for (unsigned a=0;a<thread_count;a++)
		pm[a].end();
	for (unsigned a=1;a<thread_count;a++)
		mpz_mul(pm[0].integer,pm[0].integer,pm[a].integer);
	char *s;
	if (output_result){
		t1=clock();
		s=new char[mpz_sizeinbase(pm[0].integer,10)+2];
		mpz_get_str(s,10,pm[0].integer);
		t2=clock();
	}
	delete[] pm;
	t3=clock();
	if (output_result){
		std::cout <<s<<std::endl;
		delete[] s;
	}
	std::cout <<"Done in "<<double(t3-t0)/double(CLOCKS_PER_SEC)<<" s.\n";
	if (output_result)
		std::cout <<"Not including output time: "<<double(t1-t0+t3-t2)/double(CLOCKS_PER_SEC)<<" s.\n";
}

int main(){
	compute_factorial(1,500000,4);
	return 0;
}

I could have done a few things more nicely, like putting the globals in an object to make compute_factorial() reentrant. But I didn't want to.
I've already verified the output with a reference implementation, so no need to bother.

tition (887)

1
2

	for (unsigned a=this->id+thread_count;a<=N;a+=thread_count)
		mpz_mul_ui(this->integer,this->integer,a);

I understand this is the computation cycle. However, I don't get it why do you initialize with a=this->id+thread_count instead of with a=this->id+1? ~~Doesn't this mean you will miss to multiply by 1,2,..., thread_count?~~

I see, in the constructor you initialize mpz_t integer by the thread id. Where do you handle the times when you compute less than 8! ?

Last edited on

helios (17607)

this->integer is initialized to this->id.
If I looped from this->id+1, this is what two threads would do:
thread 0: 2*3*5*7*...
thread 1: 3*4*6*8*...

EDIT:

Where do you handle the times when you compute less than 8! ?

I don't. The code has a very specific purpose, so I don't care what happens for small operands.

Last edited on

tition (887)

Oh, common, for a few nanoseconds...

[Edit:] Another algorithmic suggestion: sorting out the primes from 1 to 500 000, raising each to the power in which it is contained in 500 000! and then multiplying everything gave a slight (but noticeable) improvement in my python experiments (couldn't judge well on exactly how, because python run times varied greatly depending on which algorithm I ran first).
[Edit:] And what is your time on 1000 000!?

Last edited on

helios (17607)

This works nicely:

1
2
3

thread_count=/*...*/;
if (n<=thread_count)
	thread_count=1;

the power in which it is contained in 500 000!

What does this mean?

couldn't judge well on exactly how, because python run times varied greatly depending on which algorithm I ran first

There might have been some memoization at work. Did you try breaking it up into different scripts and running python once for each?

And what is your time on 1000 000!?

40.5 seconds.

Last edited on

firedraco (6248)

the power in which it is contained in 500 000!

I think he means "raise it to the highest power you can such that the result is < 500 000!"

tition (887)

More like
"raise it to the highest power you can such that the result is a divisor of 500 000!"

For a prime p, that is the number (int) 500 000/p + (int) 500 000/p^2+(int) 500 000/p^3+...

[Edit:]For python: I tried running in separate scripts, and again I got a great variance. Anyways, the two scripts I posted in http://www.cplusplus.com/forum/lounge/48984/2/#msg266825 so you could test on your own system (I guess you have python installed?)

Also, why not

1
2

if (n<thread_count)
	thread_count=n;

Last edited on

helios (17607)

Well, now multiplying takes only 6.6 seconds, but factorizing everything takes 6 seconds, so it's actually taking a little longer.
Can you think of some way to speed this up?

void parallel_factorizer::factorize(unsigned n){
	for (size_t a=0;n>1 && a<primes.size();a++){
		unsigned m=primes[a];
		if (n==m){
			this->factorizations[a]++;
			break;
		}
		while (!(n%m)){
			this->factorizations[a]++;
			n/=m;
		}
	}
}

EDIT: I forgot.

if (n<thread_count) thread_count=n;

I don't want to try to figure out what happens with few threads and small values of n. It's easier to just set the count to 1.

EDIT 2: Got it down to less than 5 seconds:

#include <iostream>
#include <fstream>
#include <string>
#include <ctime>
#include <gmp.h>
#include <boost/thread.hpp>
#include <boost/shared_ptr.hpp>

const unsigned INTEGER_COUNT=10;

class parallel_multiplier{
	unsigned id;
	boost::shared_ptr<boost::thread> thread;
	void thread_f();
public:
	static unsigned next_id;
	mpz_t integer[INTEGER_COUNT];
	parallel_multiplier();
	~parallel_multiplier();
	void start(){
		this->thread.reset(new boost::thread(&parallel_multiplier::thread_f,this));
	}
	void end(){
		this->thread->join();
	}
};

unsigned parallel_multiplier::next_id;
unsigned N;
unsigned thread_count;

parallel_multiplier::parallel_multiplier(){
	this->id=next_id++;
	for (int a=0;a<INTEGER_COUNT;a++){
		mpz_init(this->integer[a]);
		mpz_set_ui(this->integer[a],1);
	}
	this->start();
}

parallel_multiplier::~parallel_multiplier(){
	this->thread.reset();
	for (int a=0;a<INTEGER_COUNT;a++)
		mpz_clear(this->integer[a]);
}

void parallel_multiplier::thread_f(){
	for (unsigned a=this->id,b=0;a<=N;a+=thread_count,b=(b+1)%INTEGER_COUNT)
		mpz_mul_ui(this->integer[b],this->integer[b],a);
	for (int a=1;a<INTEGER_COUNT;a++)
		mpz_mul(this->integer[0],this->integer[0],this->integer[a]);
}

void compute_factorial(bool output_result,unsigned n){
	thread_count=boost::thread::hardware_concurrency();
	if (n<=thread_count)
		thread_count=1;
	N=n;
	parallel_multiplier::next_id=2;
	clock_t t0,t1,t2,t3;
	t0=clock();
	parallel_multiplier *pm=new parallel_multiplier[thread_count];
	for (unsigned a=0;a<thread_count;a++)
		pm[a].end();
	mpz_t *result;
	result=&pm[0].integer[0];
	for (unsigned a=1;a<thread_count;a++)
		mpz_mul(pm[0].integer[0],pm[0].integer[0],pm[a].integer[0]);
	char *s;
	if (output_result){
		t1=clock();
		s=new char[mpz_sizeinbase(*result,10)+2];
		mpz_get_str(s,10,*result);
		t2=clock();
	}
	delete[] pm;
	t3=clock();
	if (output_result){
		std::ofstream output("output.txt");
		output <<s;
		delete[] s;
	}
	std::cout <<"Done in "<<double(t3-t0)/double(CLOCKS_PER_SEC)<<" s.\n";
	if (output_result)
		std::cout <<"Not including output time: "<<double(t1-t0+t3-t2)/double(CLOCKS_PER_SEC)<<" s.\n";
}

int main(){
	compute_factorial(1,500000);
	return 0;
}

Last edited on

closed account (DSLq5Di1)

Some good answers regarding unmanaged vs managed performance here:-
http://stackoverflow.com/questions/145110/c-performance-vs-java-c

tition (887)

[Edit:]@helios So what did you do to speed up the computation twice? I don't see any significant difference between the last and the previous version.

The time to figure out all primes and their factors should be ultra-quick. I get all primes smaller than 1000 000 and the powers in which they in which they participate in 1000 000! in less than 0.2 seconds (1 core, 1.4Ghz). ~~I commented out all lines that were specific to my system.~~

Sorry for not posting a modification of your code but I simply can't get boost to run. After half an hour of fight I managed to set up gmp. However, an hour of fighting with boost and googling, I still can't get it to run. I guess I will need another 2-3 years of C++ experience to be able to set up boost...

This is one serious objective critique to C++ however: installing stuff is a great pain in the butt. Instead of copying the .cpp and .h files in a folder, typing the proper includes, and not worrying for anything, I have to take care of details such as .so folder locations, etc

On the same note: where are the boost lib file locations on Ubuntu, and does anyone know how to install boost on Ubuntu?

[edit:] Here are the two versions: with and without separating primes. Haven't worked on making it parallel (can do with pthreads.h).

Run time with treating each prime separately: 192 seconds.
Run time with the straightforward implementation: 215 seconds.

//implementation raising each prime to power and then multiplying
#include<iostream>
#include<vector>
#include<sys/time.h>
#include "gmp.h"

void GetAllPrimesAndPowersInXFactorial
  (unsigned int X, std::vector<unsigned int>& outputPrimes, std::vector<unsigned int>& outputPrimePowers)
{ std::vector<int> theSieve;
  theSieve.resize(X+1);
  for (unsigned int i=0; i<X+1; i++)
    theSieve[i]=1;
  outputPrimes.resize(0);
  outputPrimes.reserve(X/2);
  for (unsigned int i=2; i<=X; i++)
    if (theSieve[i]!=0)
    { outputPrimes.push_back(i);
      for (unsigned int j=i; j<=X; j+=i)
        theSieve[j]=0;
    }
  outputPrimePowers.resize(outputPrimes.size());
  for (unsigned int i=0; i<outputPrimes.size(); i++)
  { outputPrimePowers[i]=0;
    unsigned int currentPower=outputPrimes[i];
    do
    { outputPrimePowers[i]+=X/currentPower;
      currentPower*=outputPrimes[i];
    }
    while (currentPower<=X);
  }
}

timeval ComputationStartGlobal, LastMeasureOfCurrentTime;
double GetElapsedTimeInSeconds()
{ gettimeofday(&LastMeasureOfCurrentTime, NULL);
  int miliSeconds =(LastMeasureOfCurrentTime.tv_sec- ComputationStartGlobal.tv_sec)*1000+
(LastMeasureOfCurrentTime.tv_usec- ComputationStartGlobal.tv_usec)/1000;
  return ((double) miliSeconds)/1000;
}

int main()
{ std::vector <unsigned int> thePrimes, thePrimePowers;
  gettimeofday(&ComputationStartGlobal, NULL);
  double y=GetElapsedTimeInSeconds();
  unsigned int x=500000;
  GetAllPrimesAndPowersInXFactorial(x, thePrimes, thePrimePowers);
  mpz_t Accumulator, buffer1, buffer2;
  mpz_init(buffer1);
  mpz_init(buffer2);
  mpz_init(Accumulator);
  mpz_t* xFactorial=&buffer1;
  mpz_t* xFactorialBuffer=&buffer2;
  mpz_set_ui(*xFactorial, 1);
//to get this to run in parallel one must parallelize the following cycle
  for (unsigned int i=0; i<thePrimes.size(); i++)
  { mpz_ui_pow_ui(Accumulator, thePrimes[i], thePrimePowers[i]);
    mpz_mul(*xFactorialBuffer, Accumulator, *xFactorial);
    std::swap(xFactorialBuffer, xFactorial);
  }
  char *s;
  s=new char[mpz_sizeinbase(*xFactorial,10)+2];
  mpz_get_str(s, 10, *xFactorial);
  std::cout << "time to compute x!: " << GetElapsedTimeInSeconds()-y;
  //std::cout <<"\nx! equals: " << s;
  delete [] s;
  std::cin >> x;
  return 0;
}

//straightforward implementation
#include<iostream>
#include<sys/time.h>
#include "gmp.h"

timeval ComputationStartGlobal, LastMeasureOfCurrentTime;
double GetElapsedTimeInSeconds()
{ gettimeofday(&LastMeasureOfCurrentTime, NULL);
  int miliSeconds =(LastMeasureOfCurrentTime.tv_sec- ComputationStartGlobal.tv_sec)*1000+
(LastMeasureOfCurrentTime.tv_usec- ComputationStartGlobal.tv_usec)/1000;
  return ((double) miliSeconds)/1000;
}

int main()
{ //std::vector <unsigned int> thePrimes, thePrimePowers;
  gettimeofday(&ComputationStartGlobal, NULL);
  double y=GetElapsedTimeInSeconds();
  unsigned int x=500000;
  mpz_t Accumulator, buffer1, buffer2;
  mpz_init(buffer1);
  mpz_init(buffer2);
  mpz_init(Accumulator);
  mpz_t* xFactorial=&buffer1;
  mpz_t* xFactorialBuffer=&buffer2;
  mpz_set_ui(*xFactorial, 1);
  for (unsigned int i=1; i<=x; i++)
  { mpz_mul_ui(*xFactorialBuffer, *xFactorial, i);
    std::swap(xFactorialBuffer, xFactorial);
  }
  char *s;
  s=new char[mpz_sizeinbase(*xFactorial,10)+2];
  mpz_get_str(s, 10, *xFactorial);
  std::cout << "time to compute x!: " << GetElapsedTimeInSeconds()-y;
  //std::cout <<"\nx! equals: " << s;
  std::cout << "\n\n\n waiting! hopefully!";
  delete [] s;
  std::cin >> x;
  return 0;
}

Last edited on

rapidcoder (1010)

Ok, I must admit parallelism in Scala rocks, but BigInteger in Java sucks. It seems multiplication has an O(n^2) complexity. 90% of time is eaten by the last multiplication of two big integers, which ofc cannot be easily parallelised without rolling your own multiplication. So I have to look for a better bignum library.

BTW: I get first million of primes in 0.079 s, but I haven't optimised nor parallelised that - it is not a bottleneck.

------------------------
Added later:

Here is my parallel code.
It differs from non-parallel code with only one method call.

import org.jscience.mathematics.number.LargeInteger

def sieve(n: Int) = {
  val primes = new collection.mutable.BitSet() + 2 ++ (3 to (n, 2))  
  primes.view.map({ p => primes --= (p + p) to (n, p); p }).force
}

def multiplicity(n: Int, p: Int) = {
  var q = n
  var m = 0
  while (q >= p) {
    q /= p
    m += q
  }
  m	
}

def multiply(n: Int, x: Iterable[Int]) = 
  x.map(p => LargeInteger.valueOf(p).pow(multiplicity(n, p)))
		
def factorial(n: Int) =        
  (multiply(n, sieve(n))).toSeq.par.reduce(_ times _)

def measureTime(code: => Unit) { 
  val start = System.currentTimeMillis
  code
  val end = System.currentTimeMillis
  println((end - start) / 1000.0) 
}

Timings on Core i5 2.6 GHz:

scala> measureTime(factorial(100000))
0.477

scala> measureTime(factorial(500000))
10.037

scala> measureTime(factorial(1000000))
65.666

The multiplication is implemented using Karatsuba algorithm, so it could be better if I found a better library, e.g. with Toom-Cook algorithm. The CPU utilisation stays at about 85-98% while running this (all 2 double-cores of i5).

BTW: Helios, what hardware do you run your tests?

BTW2: Seems that prime factorization does little good.
Another version of the complete code:

import org.jscience.mathematics.number.LargeInteger

def factorial2(n: Int) =
   (1 to n).map(LargeInteger.valueOf(_: Int)).par.reduce(_ times _) 

def measureTime(code: => Unit) { 
  val start = System.currentTimeMillis
  code
  val end = System.currentTimeMillis
  println((end - start) / 1000.0) 
}

And timings:

scala> measureTime(factorial2(100000))
0.534

scala> measureTime(factorial2(500000))
10.268

scala> measureTime(factorial2(1000000))
44.375

The second (simpler) version seem to use my CPU in a more efficient way, especially for the 1000000! case - it doesn't drop below 95%, and most of the time stays on 100%.

Last edited on

tition (887)

[Edit:]@rapidcoder: do you print the factorial? The below code suggests strongly that might be a bottleneck...
Hahaha what I just found with the built-in gmp factorial computation:

#include<iostream>
#include<sys/time.h>
#include "gmp.h"
timeval ComputationStartGlobal, LastMeasureOfCurrentTime;
double GetElapsedTimeInSeconds()
{ gettimeofday(&LastMeasureOfCurrentTime, NULL);
  int miliSeconds =(LastMeasureOfCurrentTime.tv_sec- ComputationStartGlobal.tv_sec)*1000+
(LastMeasureOfCurrentTime.tv_usec- ComputationStartGlobal.tv_usec)/1000;
  return ((double) miliSeconds)/1000;
}

int main()
{ gettimeofday(&ComputationStartGlobal, NULL);
  double y=GetElapsedTimeInSeconds();
  unsigned int x=500000;
  mpz_t xFactorial;
  mpz_init(xFactorial);
  mpz_fac_ui(xFactorial, x);
  double timeToComputeXF=GetElapsedTimeInSeconds()-y;
  y=GetElapsedTimeInSeconds();
  char *s;
  s=new char[mpz_sizeinbase(xFactorial,10)+2];
  mpz_get_str(s, 10, xFactorial);
  double timeToPrintOutXFinMmemory=GetElapsedTimeInSeconds()-y;
  std::cin >> x;
  std::cout <<"\nx! equals: " << s;
  std::cout << "\ntime to print out x! (in memory): " << timeToPrintOutXFinMmemory;
  std::cout << "\ntime to compute x!: " << timeToComputeXF;
  delete [] s;
  std::cin>> x;
  return 0;
}

Machine: 1.4Ghz single core processor.
Output:
time to print out x!(in memory): 7.574
time to compute x!: 1.539

For one million factorial:
Output:
time to print out x!(in memory): 20.538
time to compute x!: 3.867

So my gut feeling that converting the number to string is the real bottle neck is right...

Last edited on

rapidcoder (1010)

Yes, printing out is slow. I measured only computing. The bottleneck in computing is the multiplication. For 1000000! Karatsuba is way too slow.

helios (17607)

I'm running a Core 2 Duo E6300 @ 1.86 GHz.

So what did you do to speed up the computation twice? I don't see any significant difference between the last and the previous version.

I reduced the number of threads from 4 to 1 per core and increased the number of accumulators each thread keeps from 1 to 10. In theory, this reduces the amount of context switching needed, while also allowing to partition the input into more pieces.
When operands are bigints, n! actually requires more operations than a!*b!, where a and b are, correspondingly, all even and odd integers in [1;n]. Partitioning further is even faster, but there's a point of diminishing returns.

tition (887)

Well, anyways folks, as expected, the built-in gmp function beats all of us (or does it beat your multithreaded version, helios? can you test?). Now here is a poke at java: gmp is native in C (and also in C++), but is it in Java?

Last edited on

helios (17607)

Hm... Well, now thanks to you, tition, it's taking only 3.5 seconds, but I seem to have screwed up somewhere, so it's not returning the same value anymore.

rapidcoder (1010)

Now here is a poke at java: gmp is native in C (and also in C++), but is it in Java?

It is not, but if pressed hard, you could get easily to it through JNI, getting the best of the two worlds - very fast multiplication from GMP and clean and easy parallelism of Scala.

So we came to some conclusions:
1. Java BigInteger performance sucks
2. Java arbitrary precision libraries suck in general, albeit less than BigInteger
3. Point 2. is probably caused by the fact there are very few arbitrary precision libraries for Java.
4. Parallel code utilising all CPU cores is IMHO much cleaner in Scala.
5. If you are doing heavy numerical stuff, it is very important to check the performance/quality of existing libraries.
6. GMP is hard to install, has a messy API (at least compared to BigInt), but is awesomely fast. :)

Last edited on

helios (17607)

Ah-hah! There was a bug in your factorization code, tition. It doesn't account for integer overflow.

const double log_of_2=0.69314718055994530941723212145818;
inline double log2(double x){
	return log(x)/log_of_2;
}
const double max_bits=ceil(log2((double)UINT_MAX));

void factorize_primes(std::vector<unsigned> &factorizations,const std::vector<unsigned> &primes,unsigned n){
	for (size_t a=0;a<primes.size();a++){
		unsigned p=primes[a],
			power=p;
		double log2_of_p=log2(p),
			exponent=log2_of_p;
		do{
			factorizations[a]+=n/power;
			exponent+=log2_of_p;
			if (exponent>=max_bits)
				break;
			power*=p;
		}while (power<=n);
	}
}

_{7. JRE programmers will grab onto any excuses they can to justify why their code runs slower than native implementations.

*coolface*}

Last edited on

Cheraphy (1730)

So, I'm probably very wrong, but this particular operation we have here seems like it would be best executed with Lisp.

Pages: 1... 6 789