As the number of training increases, the total reward value increases from a negative value at the beginning to a positive value, and finally within a limited number of times, the system can achieve convergence under different user settings
With the increase of training times, the total reward value increases from the initial negative value to positive value. Finally, the system can converge under different user settings within a limited number of times
With the increase of training times, the total reward value increases from negative value to positive value at the beginning, and finally, within a limited number of times, the system can converge under different user settings.