Exploration via Embracing Diversity in Reinforcement Learning for Sparse-Reward Procedurally-Generated Tasks - Details

author：

Xu, Pei (Xu, Pei.) ^[1] | Chen, Hao (Chen, Hao.) ^[2] | Yang, Wenjie (Yang, Wenjie.) ^[3] | Huang, Kaiqi (Huang, Kaiqi.) ^[4]

Indexed by：

Scopus SCIE

Abstract：

A　key　challenge　in　reinforcement　learning　is　how　to　guide　agents　to　efficiently　explore　sparse　reward　environments.　In　order　to　overcome　this　challenge,　the　state-of-the-art　methods　introduce　additional　intrinsic　rewards　based　on　state-related　information,　such　as　the　novelty　of　states.　Unfortunately,　these　methods　frequently　fail　in　procedurally-generated　tasks,　where　a　different　environment　is　generated　in　each　episode　so　that　the　agent　is　not　likely　to　visit　the　same　state　more　than　once.　Recently,　some　exploration　methods　designed　specifically　for　procedurally-generated　tasks　have　been　proposed.　However,　they　still　only　consider　state-related　information,　which　leads　to　relatively　inefficient　exploration.　In　this　work,　we　propose　a　novel　exploration　method,　which　utilizes　cross-episode　policy-related　information　and　intraepisode　state-related　information　to　jointly　encourage　exploration　in　procedurally-generated　tasks.　In　term　of　policy-related　information,　we　first　use　an　imitator-based　unbalanced　policy　diversity　to　measure　the　difference　between　the　agent＇s　current　policy　and　the　agent＇s　previous　policies,　and　then　encourage　the　agent　to　maximize　this　difference.　In　term　of　state-related　information,　we　encourage　the　agent　to　maximize　the　state　diversity　within　an　episode,　thereby　visiting　as　many　different　states　as　possible　in　an　episode.　We　show　that　our　method　significantly　improves　sample　efficiency　over　state-of-the-art　methods　on　three　challenging　benchmarks,　including　MiniGrid,　MiniWorld,　and　the　sparse-reward　version　of　Procgen.

Keyword：

Benchmark testing Current measurement Cybernetics Deep reinforcement learning Diversity reception exploration Faces Optimization procedurally-generated task Q-learning sparse reward Three-dimensional displays Training

Community：

[ 1 ] [Xu, Pei]Chinese Acad Sci, Inst Automat, Key Lab Cognit & Decis Intelligence Complex Syst, Beijing 100190, Peoples R China
[ 2 ] [Huang, Kaiqi]Chinese Acad Sci, Inst Automat, Key Lab Cognit & Decis Intelligence Complex Syst, Beijing 100190, Peoples R China
[ 3 ] [Chen, Hao]Univ Chinese Acad Sci, Sch Emergency Management Sci & Engn, Beijing 100190, Peoples R China
[ 4 ] [Yang, Wenjie]Fuzhou Univ, Coll Comp & Data Sci, Fuzhou 350100, Peoples R China
[ 5 ] [Huang, Kaiqi]Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China