CSpace
Adaptive moving average Q-learning
Tan, Tao1; Xie, Hong1; Xia, Yunni2; Shi, Xiaoyu3; Shang, Mingsheng3
2024-08-12
摘要A variety of algorithms have been proposed to address the long-standing overestimation bias problem of Q-learning. Reducing this overestimation bias may lead to an underestimation bias, such as double Q-learning. However, it is still unclear how to make a good balance between overestimation and underestimation. We present a simple yet effective algorithm to fill in this gap and call Moving Average Q-learning. Specifically, we maintain two dependent Q-estimators. The first one is used to estimate the maximum expected Q-value. The second one is used to select the optimal action. In particular, the second estimator is the moving average of historical Q-values generated by the first estimator. The second estimator has only one hyperparameter, namely the moving average parameter. This parameter controls the dependence between the second estimator and the first estimator, ranging from independent to identical. Based on Moving Average Q-learning, we design an adaptive strategy to select the moving average parameter, resulting in AdaMA (Adaptive Moving Average) Q-learning. This adaptive strategy is a simple function, where the moving average parameter increases monotonically with the number of state-action pairs visited. Moreover, we extend AdaMA Q-learning to AdaMA DQN in high-dimensional environments. Extensive experiment results reveal why Moving Average Q-learning and AdaMA Q-learning can mitigate the overestimation bias, and also show that AdaMA Q-learning and AdaMA DQN outperform SOTA baselines drastically. In particular, when compared with the overestimated value of 1.66 in Q-learning, AdaMA Q-learning underestimates by 0.196, resulting in an improvement of 88.19%.
关键词Moving Average Q-learning AdaMA Q-learning AdaMA DQN
DOI10.1007/s10115-024-02190-8
发表期刊KNOWLEDGE AND INFORMATION SYSTEMS
ISSN0219-1377
页码29
通讯作者Xie, Hong(xiehong2018@foxmail.com)
收录类别SCI
WOS记录号WOS:001289423300002
语种英语