Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I have a intuitive question on the convergence of Q- learning. In Q learning for

ID: 651787 • Letter: I

Question

I have a intuitive question on the convergence of Q- learning. In Q learning for each step a Q- value is learned for the state-action pair where the action is selected according to the ?-greedy policy determined by Q values.

Now my question is due to ?-greedy policy (i.e. due to exploration) is it not possible that the Q values oscillates and does not converge ? each time I am giving chance to non-greedy action his value is improved and may be higher than the value of greedy action after some steps and this becomes greedy now and the same thing continues. Is choosing a small ? enough to prevent this ?

Explanation / Answer

The assumptions for Q-learning's convergence proof require that:

[...] all state-action pairs be visited infinitely often

So the convergence is understood in a mathematical sense - the Q-function converges to its fixed-point as the number of steps approaches infinity.

In practice, even after your Q-function will be good enough to induce an optimal policy, you will observe "unnecessary" updates each time a non-greedy action is chosen. To avoid that, you can set the learning rate to 0 after the acceptable quality of the solution is achieved.

Also, your intuition about modifying the ? parameter, in case of ?-greedy policy, seems right. Setting ? to 0 will have practically the same effect as setting the learning rate to 0 in the situation described above - in case of a non-stochastic environment.