OpenAI Gym入門 / 状態空間と行動空間

2019年7月29日 20:51

1. 状態空間と行動空間

「OpenAI Gym」が提供する「環境」は、それぞれ異なる「入力」と「出力」を持っています。入力の型は「状態空間(観察空間)」、出力の型は「行動空間」と呼びます。

各環境の入力と出力は次のようになります。

◎CartPole-v1
棒のバランスゲーム「CartPole」の入力と出力は次の通りです。

・入力
　・カート位置（-2.4～2.4）
　・カート速度（-Inf～Inf）
　・棒の角度（-41.8°～41.8°）
　・棒の角速度（-Inf～Inf）
・出力
　・行動(0:左移動, 1:右移動)

◎Breakout-v0
Atariのブロック崩し「Breakout」の入力と出力は次の通りです。
画面サイズが210x160でRGBカラーを使うので入力配列のシェイプは(210, 160,3)になります。

・入力
　・210x160x3の画面イメージ(0〜255)
・出力
　・行動(0:NOOP, 1:Fire, 2:Right, 3:Left)

◎BipedalWalker-v2
2Dの物理シミュレーション「Breakout」の入力と出力は次の通りです。
出力は4つの関節の「トルク/速度」になります。

・入力
   ・ボディ角度（0～2*π）
   ・ボディ角速度（-Inf～Inf）
   ・x速度（-Inf～Inf）
   ・y速度（-Inf～Inf）
   ・股関節1角度（-Inf～Inf）
   ・股関節1速度（-Inf～Inf）
   ・膝関節1角度（-Inf～Inf）
   ・膝関節1速度（-Inf～Inf）
   ・脚1の地面との接触（-Inf～Inf）
   ・股関節2角度（-Inf～Inf）
   ・股関節2速度（-Inf～Inf）
   ・膝関節2角度（-Inf～Inf）
   ・膝関節2速度（-Inf～Inf）
   ・脚2の地面との接触（0～1）
   ・ライダー距離測定（-Inf～Inf） x 10
・出力
   ・股関節1 (トルク/速度)（-1～1）
   ・膝関節1 (トルク/速度)（-1～1）
   ・股関節2 (トルク/速度)（-1～1）
   ・膝関節2 (トルク/速度)（-1～1）

他の環境を見ると、違いはさらにいろいろあります。

2. 状態空間と行動空間の型

「OpenAI Gym」は、次の6つの空間の型をサポートしています。
「Box」(連続値)と「Discrete」(離散値)が、最も一般的に使用される型になります。特に「状態空間」は多くが「Box」です。「行動空間」は「Discrete」の方が「Box」より、学習が容易になります。

◎Box
範囲[low、high]の連続値、Float型のn次元配列。

gym.spaces.Box(low=-100, high=100, shape=(2,))

◎Discrete
範囲[0、n-1]の離散値、Int型の数値。

gym.spaces.Discrete(4)

◎MultiBinary
ステップ毎に任意の行動を任意の組み合わせで使用できる行動リスト。

gym.spaces.MultiBinary(5)

◎MultiDiscrete
ステップ毎に各離散セットの1つの行動のみ使用できる行動リスト。

gym.spaces.MultiDiscrete([-10,10], [0,1])

3. 環境の観測空間と行動空間を調べる

「環境」がどのような「状態空間」と「行動空間」を使っているかを調べるコードは次の通りです。Box/Discreteの場合は最大値と最小値も表示しています。

#!/usr/bin/env python
import gym
from gym.spaces import *

# 環境ID
ENV_ID = 'CartPole-v1'

# 空間の出力
def print_spaces(label, space):
   # 空間の出力
   print(label, space)

   # Box/Discreteの場合は最大値と最小値も表示
   if isinstance(space, Box):
       print('    最小値: ', space.low)
       print('    最大値: ', space.high)
   if isinstance(space, Discrete):
       print('    最小値: ', 0)
       print('    最大値: ', space.n-1)

# 環境の生成
env = gym.make(ENV_ID)

# 状態空間と行動空間の型の出力
print('環境ID: ', ENV_ID)
print_spaces('状態空間: ', env.observation_space)
print_spaces('行動空間: ', env.action_space)

各環境の出力は次のようになります。

◎CartPole-v1

環境ID:  CartPole-v1
状態空間:  Box(4,)
   最小値:  [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
   最大値:  [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
行動空間:  Discrete(2)
   最小値:  0
   最大値:  1

◎Breakout-v0

環境ID:  Breakout-v0
状態空間:  Box(210, 160, 3)
   最小値:  [[[0 0 0]
 [0 0 0]
 [0 0 0]
 ...
 [0 0 0]
 [0 0 0]
 [0 0 0]]]
   最大値:  [[[255 255 255]
 [255 255 255]
 [255 255 255]
 ...
 [255 255 255]
 [255 255 255]
 [255 255 255]]]
行動空間:  Discrete(4)
   最小値:  0
   最大値:  3

◎BipedalWalker-v2

環境ID：  BipedalWalker-v2
状態空間:  Box(24,)
   最小値:  [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf]
   最大値:  [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf
inf inf inf inf inf inf]
行動空間:  Box(4,)
   最小値:  [-1. -1. -1. -1.]
   最大値:  [1. 1. 1. 1.]

この記事が気に入ったらサポートをしてみませんか？