基于分布式强化学习方法解决后继特征中的低估问题

doi:10.16088/j.issn.1001-6600.2024122302

摘要/Abstract

摘要： 后继特征(successor features, SFs)和广义策略改进(generalized policy improvement, GPI)的框架具备在不同任务之间实现强化学习(RL)中零样本迁移的潜力。本文研究SFs&GPI中过低估计现象:为了阐明这个问题,在理论上,证明估计Q值与真实Q值之间的期望差,其在定理中是非正的;在实验上,验证在训练过程中新任务的估计Q值低于真实Q值。为解决该问题,将分布式强化学习的概念引入到SFs&GPI中,并建立分布后继特征(distributional SFs, DSFs)和分布广义策略改进(distributional GPI, DGPI),缩小了低估差距。MuJoCo环境上的实验结果表明,基于DSFs&DGPI的算法缓解了基于SFs&GPI算法的价值估计偏差,且具备更大的迁移潜力和更稳定的迁移效果。

关键词: 分布式强化学习, 后继特征, 广义策略改进, 偏差估计, 过低估计

Abstract: The framework of successor features(SFs) and generalized policy improvement(GPI) is recognized as a potential approach for achieving zero-shot transfer in reinforcement learning(RL) among different tasks. This paper investigates the underestimation phenomenon in SFs&GPI: Firstly, it is observed that the estimated Q-value is lower than the true Q-value for the new task during the training process. Then, to shed light on this issue, the expected gap between the estimated and true Q-value is theoretically analyzed, which is proven to be non-positive. Finally, the concepts of distributional RL are integrated into SFs&GPI, leading to the establishment of distributional successor features(DSFs) and distributional generalized policy improvement(DGPI), through which the underestimation gap is effectively narrowed. Experimental results on MuJoCo show that the DSFs&DGPI-based algorithm reduces value estimation bias, enhances transfer potential, and improves transfer stability compared to the SFs&GPI-based approach.

Key words: distributional reinforcement learning, successor features, generalized policy improvement, estimation bias, underestimation bias

中图分类号: TP18

卢梦筱, 张阳春, 章晓峰. 基于分布式强化学习方法解决后继特征中的低估问题[J]. 广西师范大学学报（自然科学版）, 2025, 43(6): 107-119.

LU Mengxiao,ZHANG Yangchun,ZHANG Xiaofeng. Controlling Value Estimation Biasin Successor Features by Distributional Reinforcement Learning[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 107-119.

参考文献

[1] 刘潇, 刘书洋, 庄韫恺, 等. 强化学习可解释性基础问题探索和方法综述[J]. 软件学报, 2023, 34(5): 2300-2316. DOI: 10.13328/j.cnki.jos.006485.
[2] 罗彪, 胡天萌, 周育豪, 等. 多智能体强化学习控制与决策研究综述[J]. 自动化学报, 2025, 51(3): 510-539. DOI: 10.16383/j.aas.c240392.
[3] 陈秀锋, 王成鑫, 赵凤阳, 等. 改进DQN算法的单点交叉口信号控制方法[J]. 广西师范大学学报(自然科学版), 2024, 42(6): 81-88. DOI: 10.16088/j.issn.1001-6600.2023110105.
[4] 揭慧鑫, 刘勇, 马良. 基于新型多目标深度强化学习模型求解固定式-移动式-无人机式协同配送的AED选址问题[J]. 计算机应用研究, 2025, 42(5): 1370-1377. DOI:10.19734/j.issn.1001-3695.2024.10.0358.
[5] 孔梦燕, 张亚生, 董飞虎. 基于深度强化学习的低轨卫星网络算力路由研究[J]. 计算机测量与控制, 2025, 33(2): 286-292, 316. DOI: 10.16526/j.cnki.11-4762/tp.2025.02.036.
[6] 周铭. 智能信息系统中的强化学习算法在推荐系统中的应用[J]. 信息系统工程, 2024(8): 52-55. DOI: 10.3969/j.issn.1001-2362.2024.08.015.
[7] 张一博, 高丙朋. 基于深度强化学习的AUV路径规划研究[J]. 东北师大学报(自然科学版), 2025, 57(1): 53-62. DOI: 10.16163/j.cnki.dslkxb202312260002.
[8] 谭灏南. 基于强化学习的DDoS攻击检测与缓解研究[D]. 广州: 广州大学, 2024. DOI: 10.27040/d.cnki.ggzdu.2024.001321.
[9] 刘胜全, 刘博. 基于深度强化学习的工业网络入侵检测研究[J]. 东北师大学报(自然科学版), 2024, 56(1): 80-86. DOI: 10.16163/j.cnki.dslkxb202210290001.
[10] 张有兵, 林一航, 黄冠弘, 等. 深度强化学习在微电网系统调控中的应用综述[J]. 电网技术, 2023, 47(7): 2774-2788. DOI: 10.13335/j.1000-3673.pst.2022.0490.
[11] 李一江. 微电网中基于深度强化学习的能源优化管理方案的研究[D]. 南京: 南京邮电大学, 2023. DOI: 10.27251/d.cnki.gnjdc.2023.001827.
[12] 陈帅. 基于强化学习的微电网能量管理与调度[D]. 北京: 北京科技大学, 2023. DOI: 10.26945/d.cnki.gbjku.2023.000345.
[13] 袁梦婷. 基于深度强化学习的无人机避障航迹规划方法研究[D]. 成都: 四川大学, 2023. DOI: 10.27342/d.cnki.gscdu.2023.000594.
[14] 李子涵. 基于强化学习的无人机集群对抗仿真研究[D]. 西安: 西安工业大学, 2023. DOI: 10.27391/d.cnki.gxagu.2023.000627.
[15] 张磊. 基于强化学习的多无人机协同控制算法研究[D]. 长春: 中国科学院大学(中国科学院长春光学精密机械与物理研究所), 2023. DOI: 10.27522/d.cnki.gkcgs.2023.000121.
[16] MOERLAND T M, BROEKENS J, PLAAT A, et al. Model-based reinforcement learning: a survey[J]. Foundations and Trends in Machine Learning, 2023, 16(1): 1-118. DOI: 10.1561/2200000086.
[17] 乌兰, 刘全, 黄志刚, 等. 离线强化学习研究综述[J]. 计算机学报, 2025, 48(1): 156-187. DOI: 10.11897/SP.J.1016.2025.00156.
[18] 汤瑞航, 黄初华, 秦进. 一种基于确定性环境模型的离线强化学习方法[J]. 计算机应用研究, 2025, 42(5): 1352-1355. DOI: 10.19734/j.issn.1001-3695.2024.10.0357.
[19] BARRETO A, DABNEY W, MUNOS R, et al. Successor features for transfer in reinforcement learning[C] //Advances in Neural Information Processing Systems 30 (NIPS 2017). Red Hook, NY: Curran Associates Inc., 2017: 4058-4068.
[20] CARVALHO W C, SARAIVA A, FILOS A, et al. Combining behaviors with the successor features keyboard[C] //Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Red Hook, NY: Curran Associates Inc., 2024: 436.
[21] LIU Y T, AHMAD A. Multi-task reinforcement learning in continuous control with successor feature-based concurrent composition[C] //2024 European Control Conference (ECC). Piscataway, NJ: IEEE, 2024: 3860-3867. DOI: 10.23919/ECC64448.2024.10591301.
[22] BORSA D, BARRETO A, QUAN J, et al. Universal successor features approximators[EB/OL]. (2018-12-18)[2024-12-23]. https://arxiv.org/abs/1812.07626. DOI: 10.48550/arXiv.1812.07626.
[23] CARVALHO W, FILOS A, LEWIS R L, et al. Composing task knowledge with modular successor feature approximators[EB/OL]. (2023-08-25)[2024-12-23]. https://arxiv.org/abs/2301.12305. DOI: 10.48550/arXiv.2301.12305.
[24] FENG Z Y, ZHANG B W, BI J X, et al. Safety-constrained policy transfer with successor features[C] //2023 IEEE International Conference on Robotics and Automation (ICRA). Piscataway, NJ: IEEE, 2023: 7219-7225. DOI: 10.1109/ICRA48891.2023.10161256.
[25] JAIN A K, WILTZER H, FAREBROTHER J, et al. Non-adversarial inverse reinforcement learning via successor feature matching[EB/OL]. (2024-11-11)[2024-12-23]. https://arxiv.org/abs/2411.07007v1. DOI: 10.48550/arXiv.2411.07007.
[26] NEMECEK M, PARR R. Policy caches with successor features[C] //Proceedings of the 38th International Conference on Machine Learning: PMLR 139. Cambridge, MA: JMLR, 2021: 8025-8033.
[27] HUNT J, BARRETO A, LILLICRAP T, et al. Composing entropic policies using divergence correction[C] //Proceedings of the 36th International Conference on Machine Learning: PMLR 97. Cambridge, MA: JMLR, 2019: 2911-2920.
[28] BELLEMARE M G, DABNEY W, ROWLAND M. Distributional reinforcement learning[M]. Cambridge, MA: MIT Press, 2023. DOI: 10.7551/mitpress/14207.001.0001.
[29] KUZNETSOV A, SHVECHIKOV P, GRISHIN A, et al. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics[C] //Proceedings of the 37th International Conference on Machine Learning: PMLR 119. Cambridge, MA: JMLR, 2020: 5556-5566.
[30] THÉATE T, ERNST D. Risk-sensitive policy with distributional reinforcement learning[J]. Algorithms, 2023, 16(7): 325. DOI: 10.3390/a16070325.
[31] LUIS C E, BOTTERO A G, VINOGRADSKA J, et al. Value-distributional model-based reinforcement learning[EB/OL]. (2024-09-03)[2024-12-23]. https://arxiv.org/abs/2308.06590. DOI: 10.48550/arXiv.2308.06590.
[32] BELLEMARE M G, DABNEY W, MUNOS R. A distributional perspective on reinforcement learning[C] //Proceedings of the 34th International Conference on Machine Learning: PMLR 70. Cambridge, MA: JMLR, 2017: 449-458.
[33] SZEPESVRI C. Algorithms for reinforcement learning[M]. Cham: Springer Nature Switzerland AG, 2010. DOI: 10.1007/978-3-031-01551-9.
[34] DUAN J L, GUAN Y, LI S E, et al. Distributional soft actor-critic: off-policy reinforcement learning for addressing value estimation errors[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(11): 6584-6598. DOI: 10.1109/TNNLS.2021.3082568.
[35] MLLER A. Integral probability metrics and their generating classes of functions[J]. Advances in Applied Probability, 1997, 29(2): 429-443. DOI: 10.2307/1428011.
[36] DABNEY W, ROWLAND M, BELLEMARE M, et al. Distributional reinforcement learning with quantile regression[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 2892-2901. DOI: 10.1609/aaai.v32i1.1179191.
[37] DABNEY W, OSTROVSKI G, SILVER D, et al. Implicit quantile networks for distributional reinforcement learning[C] //Proceedings of the 35th International Conference on Machine Learning: PMLR 80. Cambridge, MA: JMLR, 2018: 1096-1105.
[38] YANG D, ZHAO L, LIN Z C, et al. Fully parameterized quantile function for distributional reinforcement learning[C] //Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Red Hook, NY: Curran Associates Inc., 2019: 556.
[39] COLLINS J R. Robust estimation of a location parameter in the presence of asymmetry[J]. The Annals of Statistics, 1976, 4(1): 68-85. DOI: 10.1214/aos/1176343348.
[40] ALEGRE L N, FELTEN F, TALBI E G, et al. MO-Gym: a library of multi-objective reinforcement learning environments[C] //Proceedings of the 34th Benelux Conference on Artificial Intelligence. Lamot Mechelen: BNAIC/Benelearn, 2022: 1-4.
[41] GIMELFARB M, BARRETO A, SANNER S, et al. Risk-aware transfer in reinforcement learning using successor features[C] //Advances in Neural Information Processing Systems 34 (NeurIPS 2021). Red Hook, NY: Curran Associates Inc., 2021, 34: 17298-17310.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed