berzerk | boxing | breakout | crazy climber | montezuma revenge | pitfall | private eye | riverraid | skiing | solaris | video pinball | Frostbite | |
HUMAN | 2630.4 | 12.1 | 30.5 | 35829.4 | 4753.3 | 6463.7 | 69571.3 | 17118.0 | -4336.9 | 12326.7 | 17667.9 | 4202.80 |
DreamerV3 400K | 78 | 31 | 97190 | 882 | 909 | |||||||
DreamerV3 200M | 1245 | 99 | 300 | 149986 | 2512 | –0 | 5538 | 15758 | –9623 | 2453 | 17416 | 19991 |
MEME @ 1B | 45729.94 ± 13228.29 | 99.86 ± 0.11 | 475.87 ± 53.73 | 291033.20 ± 5966.79 | 9429.20 ± 1485.32 | 7820.94 ± 16815.61 | 100775.10 ± 15.57 | 67631.40 ± 4517.53 | −3305.77 ± 8.09 | 28386.28 ± 2381.29 | 759284.69 ± 37920.13 | 498640.46 ± 38753.40 |
GDI-H3(200M Frame) | 14649 | 100 | 864 | 241170 | 2500 | -4.345 | 15100 | 28349 | -6025 | 9105 | 978190 | 11330 |
Agent57 | 61507.83 ± 26539.54 | 100.00 ± 0.00 | 790.40 ± 60.05 | 565909.85 ± 89183.85 | 9352.01 ± 2939.78 | 18756.01 ± 9783.91 | 79716.46 ± 29515.48 | 63318.67 ± 5659.55 | -4202.60 ± 607.85 | 44199.93 ± 8055.50 | 992340.74 ± 12867.87 | 4334.70 |
BECOME A PROFICIENT PLAYER WITH LIMITED DATA THROUGH WATCHING PURE VIDEOS | ||||||||||||
EfficientZero | ||||||||||||
MuZero | 85932.60 | 100.00 | 864.00 | 458315.40 | 0.00 | 0.00 | 15299.98 | 323417.18 | -29968.36 | 56.62 | 981791.88 | 631378.53 |
Go-Explore(domain knowledge) 実装 | 666474 | 59494 | ||||||||||
Simulated Policy Learning(SimPLe) | ||||||||||||
R2D3 | ||||||||||||
R2D2 | 53318.7 | 98.5 | 837.7 | 366690.7 | 2061.3 | 0.0 | 5322.7 | 45632.1 | -30021.7 | 3787.2 | 999383.2 | 315456.4 |
RND 実装 | 8152 | -3 | 8666 | 3282 | ||||||||
APE-X | 57196.7 | 100.0 | 800.9 | 320426.0 | 2500.0 | -0.6 | 49.8 | 63864.4 | -10789.9 | 2892.9 | 546197.4 | 9328.6 |
IMPALA deep | 1852.7 | 100.0 | 787.3 | 136950.0 | 0.0 | -1.7 | 98.5 | 29608.0 | -10180.4 | 2365.0 | 572898.3 | |
DQN-PixelCNN | 15806.5 | 5501.5 | ||||||||||
NoisyNet-DuelingDQN | 1896 ± 604 | 100 ± 0 | 263 ± 20 | 171171 ± 2095 | 57 ± 15 | 0 ± 0 | 279 ± 109 | 23134 ± 1434 | -7550.0 | 6522 ± 750 | 870954 ± 135363 | |
ACKTR | 735.7 | 150444.0 | -1.1 | 17762.8 | 2368.6 | 100496.6 | ||||||
A3C/A2C | 496 ± 56 | 134783 ± 5495 | 14 ± 12 | 0 ± 0 | 3781 ± 2994 | 8135 ± 483 | -12972 ± 2846 | 12380 ± 519 | 229402 ± 153801 |
Task\手法 | DDPG | SAC | AWR | MEEE | TD3 | ADER |
Ant-v2 | 72 ± 1550 | 5909 ± 371 | 5067 ± 256 | |||
HalfCheetah-v2 | 10563 ± 382 | 9297 ± 1206 | 9136 ± 184 | |||
Hopper-v2 | 855 ± 282 | 2769 ± 552 | 3405 ± 121 | |||
Humanoid-v2 | 4382 ± 423 | 8048 ± 700 | 4996 ± 697 | |||
LunarLander-v2 | − | − | 229 ± 2 | |||
Walker2d-v2 | 401 ± 470 | 5805 ± 587 | 5813 ± 483 |
- https://www.deepmind.com/tags/reinforcement-learni...
- OpenAI Baselines: ACKTR & A2C
- Proximal Policy Optimization (PPO)
- awesome-deep-reinforcement-learning
- Ray RLlib: A Framework for Distributed Reinforcement Learning
- DQNからRainbowまで 〜深層強化学習の最新動向〜
- Deep Reinforcement Learning Doesn't Work Yet
- ゼロからDeepまで学ぶ強化学習
- Preserving Outputs Precisely while Adaptively Rescaling Targets
コメントをかく