AWS DeepRacerを始める

2019-06-10 aws machinelearning

AWS DeepRacerは自走する1/18スケールのレーシングカーで、 SageMakerやRoboMakerなどを使って強化学習し、実機を走らせたりバーチャルのDeepRacerリーグで競うことができる。カメラの画像の処理や、強化学習のアルゴリズムの実装の必要はなく、報酬関数だけで動いてくれるので敷居が低い。

強化学習とDQN(Deep Q-network) - sambaiz-net

設定項目

Action space

取りうるアクションである速度とステアリングの組み合わせのリスト。次の項目から生成される。

Maximum steering angle (1 - 30)
Steering angle granularity (3, 5, 7)
Maximum speed (0.8 - 8)
Speed granularity (1, 2, 3)
Loss type (Mean square error, Huber)
Number of experience episodes between each policy-updating iteration (5 - 100)

Reward function

強化学習の報酬関数。次の入力パラメータを用いて実装する。

{
    "all_wheels_on_track": Boolean,    # flag to indicate if the vehicle is on the track
    "x": float,                        # vehicle's x-coordinate in meters
    "y": float,                        # vehicle's y-coordinate in meters
    "distance_from_center": float,     # distance in meters from the track center 
    "is_left_of_center": Boolean,      # Flag to indicate if the vehicle is on the left side to the track center or not. 
    "heading": float,                  # vehicle's yaw in degrees
    "progress": float,                 # percentage of track completed
    "steps": int,                      # number steps completed
    "speed": float,                    # vehicle's speed in meters per second (m/s)
    "steering_angle": float,          # vehicle's steering angle in degrees
    "track_width": float,              # width of the track
    "waypoints": [[float, float], … ], # list of [x,y] as milestones along the track center
    "closest_waypoints": [int, int]    # indices of the two nearest waypoints.
}

Hyperparameters

学習アルゴリズムとしてサポートされているのはPPO(Proximal Policy Optimization)のみで、次のハイパーパラメータが設定できる。

Gradient descent batch size (32, 64, 128, 256, 512)
Number of epochs (3 - 10)
Learning rate (1e-8 - 1e-3)
Entropy (0 - 1)
Discount factor (0 - 1)

Stop conditions (5 - 480 mins)

最大学習時間。学習に$3/hourくらいかかるようなのであまり長く走らせ過ぎると大変。

サンプルモデルの提出

サンプルモデルSample-Follow-center-lineで現在開催されているKumo Torakkuレースに参加してみる。その名の通り中央線に近いほど高い報酬を与える。

def reward_function(params):
    '''
    Example of rewarding the agent to follow center line
    '''
    
    # Read input parameters
    track_width = params['track_width']
    distance_from_center = params['distance_from_center']
    
    # Calculate 3 markers that are at varying distances away from the center line
    marker_1 = 0.1 * track_width
    marker_2 = 0.25 * track_width
    marker_3 = 0.5 * track_width
    
    # Give higher reward if the car is closer to center line and vice versa
    if distance_from_center <= marker_1:
        reward = 1.0
    elif distance_from_center <= marker_2:
        reward = 0.5
    elif distance_from_center <= marker_3:
        reward = 0.1
    else:
        reward = 1e-3  # likely crashed/ close to off track
    
    return float(reward)

Submitしたところ1分ちょっとで完走した。ランキングの上の方は10秒台なのでまだまだだ。

モデルの作成

とりあえず同じReward functionでAction spaceのMaximum speedを最大の8にして学習させてみる。学習中にもTotal rewardと併せてシミュレーション映像が流れるので徐々にうまく走れるようになっていくのが見られる

はずだったんだが、最後までうまく走れずに時間切れ。Submitしたが当然完走できずに終わり。残念だ。