Journal of Guangxi Normal University(Natural Science Edition) ›› 2025, Vol. 43 ›› Issue (6): 107-119.doi: 10.16088/j.issn.1001-6600.2024122302

• Intelligence Information Processing • Previous Articles     Next Articles

Controlling Value Estimation Biasin Successor Features by Distributional Reinforcement Learning

LU Mengxiao1,ZHANG Yangchun1*,ZHANG Xiaofeng2   

  1. 1. School of Science, Shanghai University, Shanghai 200444, China;
    2. Newtouch Center for Mathematics, Shanghai University, Shanghai 200444, China
  • Received:2024-12-23 Revised:2025-03-11 Published:2025-11-19

Abstract: The framework of successor features(SFs) and generalized policy improvement(GPI) is recognized as a potential approach for achieving zero-shot transfer in reinforcement learning(RL) among different tasks. This paper investigates the underestimation phenomenon in SFs&GPI: Firstly, it is observed that the estimated Q-value is lower than the true Q-value for the new task during the training process. Then, to shed light on this issue, the expected gap between the estimated and true Q-value is theoretically analyzed, which is proven to be non-positive. Finally, the concepts of distributional RL are integrated into SFs&GPI, leading to the establishment of distributional successor features(DSFs) and distributional generalized policy improvement(DGPI), through which the underestimation gap is effectively narrowed. Experimental results on MuJoCo show that the DSFs&DGPI-based algorithm reduces value estimation bias, enhances transfer potential, and improves transfer stability compared to the SFs&GPI-based approach.

Key words: distributional reinforcement learning, successor features, generalized policy improvement, estimation bias, underestimation bias

CLC Number:  TP18
[1] 刘潇, 刘书洋, 庄韫恺, 等. 强化学习可解释性基础问题探索和方法综述[J]. 软件学报, 2023, 34(5): 2300-2316. DOI: 10.13328/j.cnki.jos.006485.
[2] 罗彪, 胡天萌, 周育豪, 等. 多智能体强化学习控制与决策研究综述[J]. 自动化学报, 2025, 51(3): 510-539. DOI: 10.16383/j.aas.c240392.
[3] 陈秀锋, 王成鑫, 赵凤阳, 等. 改进DQN算法的单点交叉口信号控制方法[J]. 广西师范大学学报(自然科学版), 2024, 42(6): 81-88. DOI: 10.16088/j.issn.1001-6600.2023110105.
[4] 揭慧鑫, 刘勇, 马良. 基于新型多目标深度强化学习模型求解固定式-移动式-无人机式协同配送的AED选址问题[J]. 计算机应用研究, 2025, 42(5): 1370-1377. DOI:10.19734/j.issn.1001-3695.2024.10.0358.
[5] 孔梦燕, 张亚生, 董飞虎. 基于深度强化学习的低轨卫星网络算力路由研究[J]. 计算机测量与控制, 2025, 33(2): 286-292, 316. DOI: 10.16526/j.cnki.11-4762/tp.2025.02.036.
[6] 周铭. 智能信息系统中的强化学习算法在推荐系统中的应用[J]. 信息系统工程, 2024(8): 52-55. DOI: 10.3969/j.issn.1001-2362.2024.08.015.
[7] 张一博, 高丙朋. 基于深度强化学习的AUV路径规划研究[J]. 东北师大学报(自然科学版), 2025, 57(1): 53-62. DOI: 10.16163/j.cnki.dslkxb202312260002.
[8] 谭灏南. 基于强化学习的DDoS攻击检测与缓解研究[D]. 广州: 广州大学, 2024. DOI: 10.27040/d.cnki.ggzdu.2024.001321.
[9] 刘胜全, 刘博. 基于深度强化学习的工业网络入侵检测研究[J]. 东北师大学报(自然科学版), 2024, 56(1): 80-86. DOI: 10.16163/j.cnki.dslkxb202210290001.
[10] 张有兵, 林一航, 黄冠弘, 等. 深度强化学习在微电网系统调控中的应用综述[J]. 电网技术, 2023, 47(7): 2774-2788. DOI: 10.13335/j.1000-3673.pst.2022.0490.
[11] 李一江. 微电网中基于深度强化学习的能源优化管理方案的研究[D]. 南京: 南京邮电大学, 2023. DOI: 10.27251/d.cnki.gnjdc.2023.001827.
[12] 陈帅. 基于强化学习的微电网能量管理与调度[D]. 北京: 北京科技大学, 2023. DOI: 10.26945/d.cnki.gbjku.2023.000345.
[13] 袁梦婷. 基于深度强化学习的无人机避障航迹规划方法研究[D]. 成都: 四川大学, 2023. DOI: 10.27342/d.cnki.gscdu.2023.000594.
[14] 李子涵. 基于强化学习的无人机集群对抗仿真研究[D]. 西安: 西安工业大学, 2023. DOI: 10.27391/d.cnki.gxagu.2023.000627.
[15] 张磊. 基于强化学习的多无人机协同控制算法研究[D]. 长春: 中国科学院大学(中国科学院长春光学精密机械与物理研究所), 2023. DOI: 10.27522/d.cnki.gkcgs.2023.000121.
[16] MOERLAND T M, BROEKENS J, PLAAT A, et al. Model-based reinforcement learning: a survey[J]. Foundations and Trends in Machine Learning, 2023, 16(1): 1-118. DOI: 10.1561/2200000086.
[17] 乌兰, 刘全, 黄志刚, 等. 离线强化学习研究综述[J]. 计算机学报, 2025, 48(1): 156-187. DOI: 10.11897/SP.J.1016.2025.00156.
[18] 汤瑞航, 黄初华, 秦进. 一种基于确定性环境模型的离线强化学习方法[J]. 计算机应用研究, 2025, 42(5): 1352-1355. DOI: 10.19734/j.issn.1001-3695.2024.10.0357.
[19] BARRETO A, DABNEY W, MUNOS R, et al. Successor features for transfer in reinforcement learning[C] //Advances in Neural Information Processing Systems 30 (NIPS 2017). Red Hook, NY: Curran Associates Inc., 2017: 4058-4068.
[20] CARVALHO W C, SARAIVA A, FILOS A, et al. Combining behaviors with the successor features keyboard[C] //Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Red Hook, NY: Curran Associates Inc., 2024: 436.
[21] LIU Y T, AHMAD A. Multi-task reinforcement learning in continuous control with successor feature-based concurrent composition[C] //2024 European Control Conference (ECC). Piscataway, NJ: IEEE, 2024: 3860-3867. DOI: 10.23919/ECC64448.2024.10591301.
[22] BORSA D, BARRETO A, QUAN J, et al. Universal successor features approximators[EB/OL]. (2018-12-18)[2024-12-23]. https://arxiv.org/abs/1812.07626. DOI: 10.48550/arXiv.1812.07626.
[23] CARVALHO W, FILOS A, LEWIS R L, et al. Composing task knowledge with modular successor feature approximators[EB/OL]. (2023-08-25)[2024-12-23]. https://arxiv.org/abs/2301.12305. DOI: 10.48550/arXiv.2301.12305.
[24] FENG Z Y, ZHANG B W, BI J X, et al. Safety-constrained policy transfer with successor features[C] //2023 IEEE International Conference on Robotics and Automation (ICRA). Piscataway, NJ: IEEE, 2023: 7219-7225. DOI: 10.1109/ICRA48891.2023.10161256.
[25] JAIN A K, WILTZER H, FAREBROTHER J, et al. Non-adversarial inverse reinforcement learning via successor feature matching[EB/OL]. (2024-11-11)[2024-12-23]. https://arxiv.org/abs/2411.07007v1. DOI: 10.48550/arXiv.2411.07007.
[26] NEMECEK M, PARR R. Policy caches with successor features[C] //Proceedings of the 38th International Conference on Machine Learning: PMLR 139. Cambridge, MA: JMLR, 2021: 8025-8033.
[27] HUNT J, BARRETO A, LILLICRAP T, et al. Composing entropic policies using divergence correction[C] //Proceedings of the 36th International Conference on Machine Learning: PMLR 97. Cambridge, MA: JMLR, 2019: 2911-2920.
[28] BELLEMARE M G, DABNEY W, ROWLAND M. Distributional reinforcement learning[M]. Cambridge, MA: MIT Press, 2023. DOI: 10.7551/mitpress/14207.001.0001.
[29] KUZNETSOV A, SHVECHIKOV P, GRISHIN A, et al. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics[C] //Proceedings of the 37th International Conference on Machine Learning: PMLR 119. Cambridge, MA: JMLR, 2020: 5556-5566.
[30] THÉATE T, ERNST D. Risk-sensitive policy with distributional reinforcement learning[J]. Algorithms, 2023, 16(7): 325. DOI: 10.3390/a16070325.
[31] LUIS C E, BOTTERO A G, VINOGRADSKA J, et al. Value-distributional model-based reinforcement learning[EB/OL]. (2024-09-03)[2024-12-23]. https://arxiv.org/abs/2308.06590. DOI: 10.48550/arXiv.2308.06590.
[32] BELLEMARE M G, DABNEY W, MUNOS R. A distributional perspective on reinforcement learning[C] //Proceedings of the 34th International Conference on Machine Learning: PMLR 70. Cambridge, MA: JMLR, 2017: 449-458.
[33] SZEPESVRI C. Algorithms for reinforcement learning[M]. Cham: Springer Nature Switzerland AG, 2010. DOI: 10.1007/978-3-031-01551-9.
[34] DUAN J L, GUAN Y, LI S E, et al. Distributional soft actor-critic: off-policy reinforcement learning for addressing value estimation errors[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(11): 6584-6598. DOI: 10.1109/TNNLS.2021.3082568.
[35] MLLER A. Integral probability metrics and their generating classes of functions[J]. Advances in Applied Probability, 1997, 29(2): 429-443. DOI: 10.2307/1428011.
[36] DABNEY W, ROWLAND M, BELLEMARE M, et al. Distributional reinforcement learning with quantile regression[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 2892-2901. DOI: 10.1609/aaai.v32i1.1179191.
[37] DABNEY W, OSTROVSKI G, SILVER D, et al. Implicit quantile networks for distributional reinforcement learning[C] //Proceedings of the 35th International Conference on Machine Learning: PMLR 80. Cambridge, MA: JMLR, 2018: 1096-1105.
[38] YANG D, ZHAO L, LIN Z C, et al. Fully parameterized quantile function for distributional reinforcement learning[C] //Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Red Hook, NY: Curran Associates Inc., 2019: 556.
[39] COLLINS J R. Robust estimation of a location parameter in the presence of asymmetry[J]. The Annals of Statistics, 1976, 4(1): 68-85. DOI: 10.1214/aos/1176343348.
[40] ALEGRE L N, FELTEN F, TALBI E G, et al. MO-Gym: a library of multi-objective reinforcement learning environments[C] //Proceedings of the 34th Benelux Conference on Artificial Intelligence. Lamot Mechelen: BNAIC/Benelearn, 2022: 1-4.
[41] GIMELFARB M, BARRETO A, SANNER S, et al. Risk-aware transfer in reinforcement learning using successor features[C] //Advances in Neural Information Processing Systems 34 (NeurIPS 2021). Red Hook, NY: Curran Associates Inc., 2021, 34: 17298-17310.
[1] LIU Songkai, ZENG Yucong, ZHANG Lei, LI Yanzhang, WANG Qiujie, LIU Longcheng, CHEN Ping, ZHAO Wenbo. Transient Stability Preventive Control Method Based on Deep Extreme Learning Machine [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(5): 64-74.
[2] TIAN Sheng, XIONG Chenyin, LONG Anyang. Point Cloud Classification Method of Urban Roads Based on Improved PointNet++ [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 1-14.
[3] SONG Mingkai, ZHU Chengjie. Research on Fault Location of Distribution Network Based on H-WOA-GWO and Region Correction Strategies [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 24-37.
[4] CHEN Yu, CHEN Lei, ZHANG Yi, ZHANG Zhirui. Wind Speed Prediction Model Based on QMD-LDBO-BiGRU [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 38-57.
[5] HAN Shuo, JIANG Linfeng, YANG Jianbin. Attention-based PINNs Method for Solving Saint-Venant Equations [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(4): 58-68.
[6] LI Fanghao, LIU Liqun, WU Qingfeng. Microgrid Fault Location Based on Sniffing Strategy Slime Mould Algorithm [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(2): 30-41.
[7] LIU Junjie, MA Kai, HUANG Zehua, TIAN Miao, QIU Qinjun , TAO Liufeng, XIE Zhong. Geological Structure Recognition Based on Transfer Learning and Channel Prior Attention Mechanism [J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(2): 107-120.
[8] TIAN Sheng, CHEN Dong. A Joint Eco-driving Optimization Research for Connected Fuel Cell Hybrid Vehicle via Deep Reinforcement Learning [J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(6): 67-80.
[9] CAO Feng, WANG Jiafan, YI Jianbing, LI Jun. A Multi-clause Dynamic Deduction Algorithm Based on Clause Stability and Its Application [J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(6): 164-176.
[10] HUANG Mantong, YU Xin. A Neural Network Algorithm Based on Penalty Function Method for Solving Non-smooth Pseudoconvex Optimization Problems and Its Applications [J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(5): 101-109.
[11] ZHENG Xiubin, CHEN Jun. Parameters Identification of Photovoltaic Cells Based on Improved Dung Beetle Optimization Algorithm [J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(4): 51-63.
[12] LÜ Hui, LÜ Weifeng. Fundus Hemorrhagic Spot Detection Algorithm Based on Improved YOLOv5 [J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(3): 99-107.
[13] HUANG Wei, WEI Duqu. Synchronization Behavior of Memristor Morris-Lecar Neural Networks [J]. Journal of Guangxi Normal University(Natural Science Edition), 2024, 42(1): 67-78.
[14] WANG Shanshan, HE Jiawen, WU Ni, ZHU Wei, LAN Xin. Combined Model for Wind Power Prediction Based on GRA-ISSA-SVR-EC [J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(4): 61-73.
[15] JIANG Yibo, LIU Huijia, WU Tian. Research on Identification of Lightning Overvoltage in Transmission Line by Improved Residual Network [J]. Journal of Guangxi Normal University(Natural Science Edition), 2023, 41(4): 74-83.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LIU Xiaojuan, LIN Lu, HU Yucong, PAN Lei. Research on the Influence of Land Use Types Surrounding Stations on Subway Passenger Satisfaction[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 1 -12 .
[2] HAN Huabin, GAO Bingpeng, CAI Xin, SUN Kai. Fault Diagnosis of Wind Turbine Blade Icing Based on HO-CNN-BiLSTM-Transformer Model[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 13 -28 .
[3] CHEN Jianguo, LIANG Enhua, SONG Xuewei, QIN Zhangrong. Lattice Boltzmann Simulation for the Aqueous Humour Dynamics of the Human Eye Based on 3D Reconstruction of OCT Images[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 29 -41 .
[4] LI Hao, HE Bing. Droplet Rebound Behavior on Grooves Surface[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 42 -53 .
[5] LING Fu, ZHANG Yonggang, WEN Binghai. Study on Curve Boundary Algorithm of Multiphase Lattice Boltzmann Method Based on Interpolation[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 54 -68 .
[6] XIE Sheng, MA Haifei, ZHANG Canlong, WANG Zhiwen, WEI Chunrong. Multi-resolution Feature Grounding for Cross-Modal Person Retrieval[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 69 -79 .
[7] WEI Zishu, CHEN Zhigang, WANG Yanxue, Hasitieer Madetihan. Lightweight Bearing Defect Detection Algorithm Based on SBSI-YOLO11[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 80 -91 .
[8] YI Jianbing, HU Yayi, CAO Feng, LI Jun, PENG Xin, CHEN Xin. Design of Lightweight Pulmonary Nodules Detection Network on CT Images with Dynamic Channel Pruning[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 92 -106 .
[9] JIANG Yunlu, LU Huijie, HUANG Xiaowen. Application Research of Penalized Weighted Composite Quantile Regression Method in Fixed Effects Panel Data[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 120 -127 .
[10] DENG Jinna, LIU Qiumei, CHEN Yiming, YANG Aimin. Numerical Simulation and Stability Analysis of Two Kinds of Viscoelastic Moving Plates[J]. Journal of Guangxi Normal University(Natural Science Edition), 2025, 43(6): 128 -139 .