1

I am reading Reinforcement Learning: An Introduction by Sutton and Barto. They have several graphs that plot either average reward vs. number of steps or %optimal action vs. number of steps for an $n$-armed bandit problem. I don't understand why these graphs are so noisy. Why should the average reward over 2000 different trials be so different from one step to the next? Would it be smoother with more trials?

Procedure: They ran 2000 10-armed bandit tasks. The action values, $q(a)$, were chosen from a $N[0,1]$ distribution, and the reward at the $t$th time step was given by a $N[q(a),1]$ distrubution.

enter image description here

  • Please add your graphs of result to discuss about it. Take picture and attach to your question. – BarzanHayati Sep 03 '19 at 17:30
  • As you say, the action values are random, and that level of randomness is fixed, meaning no amount of trials will completely remove the randomness (it's not e.g. Bayesian). – user3658307 Sep 04 '19 at 03:50

0 Answers0