Help with k arm bandit problem

Hello, My friend and I are working on the k arm bandit problem using a upper confidence bound however we don't think the program is properly computing the upper confidence bound. Here is how we compute it:

double upperbound = sqrt((log((double)totalpulls) / slots[i]->numpulled) * min);

totalpulls is how many slots have been tried so far and numpulled is how many times a specific arm has been pulled. The problem we have is that regardless of the winning probability of each arm the upper confidence bounds all tend be the same and the program hardly doesn't exploit at all, just explores. Here is the method for determining which arm we pull:

double max = 0.0;
index = 0;
for(int i = 0; i < size; i++){
double mean = (double) slots[i]->totalpayout / slots[i]->numpulled;
double Vj = ((1.0 / slots[i]->numpulled) * slots[i]->totalpayout) - (mean * mean)
+ sqrt(2 * log((double)totalpulls) / slots[i]->numpulled);
double min = 1.0 / 4;
if (Vj < min) {
min = Vj;
}
double upperbound = sqrt((log((double)totalpulls) / slots[i]->numpulled) * min);
if(upperbound > max){
max = upperbound;
index = i;
}
}
slot = index;
}

The program simply pulls the arms in order so if there were 5 arms to pull it would pull them in order so the output would look like:
pulled 0, pulled 1, pulled 2, pulled 3, pulled 4, pulled 0 etc.
The program should instead be going with the arm that gives it the highest payout. Any assistance anyone could give towards helping understand whats wrong with our program would be greatly appreciated.
If anyone would want to see the original equation it is on page 245 from this paper http://www.springerlink.com/content/l7v1647363415h1t/
(this is link to where you can download the pdf).
closed account (zwA4jE8b)
So what exactly is going on here, is the for loop 'pulling' each arm to find out which gives the biggest payout?

also just for ease of readability stuff like..

double min = 1.0 / 4; can be declared outside the loop, because it is constant. also you can declare your variables outside the loop so there are not being reinitialized every loop.

That might be one of your problems.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
double max = 0.0;
double min = 1.0 / 4;
double mean, Vj, upperbound;

index = 0;
	for(int i = 0; i < size; i++)
	{
		mean = (double) slots[i]->totalpayout / slots[i]->numpulled;
		
		Vj = ((1.0 / slots[i]->numpulled) * slots[i]->totalpayout) - (mean * mean)
			+ sqrt(2 * log((double)totalpulls) / slots[i]->numpulled);
		
		if (Vj < min)
			min = Vj;
		
		upperbound = sqrt((log((double)totalpulls) / slots[i]->numpulled) * min);
		
		if(upperbound > max)
		{
		max = upperbound;
		index = i;
		}
	}
slot = index;
}


Also, numpulled and totalpulls is not being updated, as far as I can see.

What does slots[i]->numpulled mean? I have not seen that syntax before, is slots a structure?
Last edited on
Thanks for the reply and advice. I moved the declarations out of the loop but unfortunately that didn't fix the problem.

To answer your questions about numslot and totalpulls, I didn't post all of our code, just the part where we calculate the upper confidence bound. So we do update numpulls and totalpulls. Also slots is an array of objects which have the elements totalpayout and numpulled. The "->" is to access that element of the object stored at index "i" in the array.

As for what is going on, this loop is supposed to find the slot machine that gives the best payout by using an upper confidence bound to determine which arm is the best to pull.
Topic archived. No new replies allowed.