Hello, My friend and I are working on the k arm bandit problem using a upper confidence bound however we don't think the program is properly computing the upper confidence bound. Here is how we compute it:
double upperbound = sqrt((log((double)totalpulls) / slots[i]->numpulled) * min);
totalpulls is how many slots have been tried so far and numpulled is how many times a specific arm has been pulled. The problem we have is that regardless of the winning probability of each arm the upper confidence bounds all tend be the same and the program hardly doesn't exploit at all, just explores. Here is the method for determining which arm we pull:
double max = 0.0;
index = 0;
for(int i = 0; i < size; i++){
double mean = (double) slots[i]->totalpayout / slots[i]->numpulled;
double Vj = ((1.0 / slots[i]->numpulled) * slots[i]->totalpayout) - (mean * mean)
+ sqrt(2 * log((double)totalpulls) / slots[i]->numpulled);
double min = 1.0 / 4;
if (Vj < min) {
min = Vj;
}
double upperbound = sqrt((log((double)totalpulls) / slots[i]->numpulled) * min);
if(upperbound > max){
max = upperbound;
index = i;
}
}
slot = index;
}
The program simply pulls the arms in order so if there were 5 arms to pull it would pull them in order so the output would look like:
pulled 0, pulled 1, pulled 2, pulled 3, pulled 4, pulled 0 etc.
The program should instead be going with the arm that gives it the highest payout. Any assistance anyone could give towards helping understand whats wrong with our program would be greatly appreciated.
If anyone would want to see the original equation it is on page 245 from this paper
http://www.springerlink.com/content/l7v1647363415h1t/
(this is link to where you can download the pdf).