ベイズ最適化でランダムフォレストとXGBoostの良いハイパーパラメータを探す

(2018-06-10)

機械学習の良いハイパーパラメータを探す方法として、scikit-learnにもあるグリッドサーチがあるが、これは総当たりで試すもので時間がかかる。

それに対してベイズ最適化は、まず現在の最大値を超える確率や期待値を出力とする獲得関数を決めて、ガウス過程(GP)に従うと仮定する。 ガウス過程は回帰関数の確率モデルで、任意の入力(x1, x2, … , xn)に対応する出力(y1, y2, …, yn)がガウス分布(=正規分布)に従うというもの。 これによって予測されるまだ試していない入力での期待値や分散から次に試す値を決めて効率的に探すことができる。

今回はKaggleのTitanicのチュートリアルを、チューニングなしのランダムフォレストとXGBoostで解いたときの結果と比較して、ベイズ最適化によるハイパーパラメータで精度が向上するか確認する。

KaggleのTitanicのチュートリアルをランダムフォレストで解く - sambaiz-net

KaggleのTitanicのチュートリアルをXGBoostで解く - sambaiz-net

ランダムフォレスト

Pythonのベイズ最適化のライブラリ、BayesianOptimizationを使う。

$ pip install bayesian-optimization

RandomForestClassifierのハイパーパラメータ

  • n_estimators: 木の数
  • min_samples_split: ノードを分割するのに必要な最小サンプル数
  • max_features: 分割するときに考慮する特徴量の割合

の値を探すため、BayesianOptimizationに最大化したい値(精度)とパラメータの範囲を渡す。

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from bayes_opt import BayesianOptimization

import pandas as pd

def preprocess(df):
    df['Fare'] = df['Fare'].fillna(df['Fare'].mean())
    df['Age'] = df['Age'].fillna(df['Age'].mean())
    df['Embarked'] = df['Embarked'].fillna('Unknown')
    df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)
    df['Embarked'] = df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2, 'Unknown': 3} ).astype(int)
    df = df.drop(['Cabin','Name','PassengerId','Ticket'],axis=1)
    return df

df = preprocess(pd.read_csv('./train.csv'))
train_x = df.drop('Survived', axis=1)
train_y = df.Survived

def randomforest_cv(n_estimators, min_samples_split, max_features):
    val = cross_val_score(
        RandomForestClassifier(
            n_estimators=int(n_estimators),
            min_samples_split=int(min_samples_split),
            max_features=max_features,
            random_state=0
        ),
        train_x, train_y,
        scoring = 'accuracy',
        cv = 3, # 3-fold
        n_jobs = -1 # use all CPUs
    ).mean()
    return val

randomforest_cv_bo = BayesianOptimization(
    randomforest_cv,
    {'n_estimators': (10, 250),
    'min_samples_split': (2, 25),
    'max_features': (0.1, 0.999)}
)

gp_params = {"alpha": 1e-5}
randomforest_cv_bo.maximize(n_iter=50, **gp_params)
print(randomforest_cv_bo.res['max']['max_val'])
print(randomforest_cv_bo.res['max']['max_params'])

まずinit_points回ランダムな値で試して、 それらの結果を起点にベイズ最適化で探していく。

Initialization
-------------------------------------------------------------------------------------
 Step |   Time |      Value |   max_features |   min_samples_split |   n_estimators | 
    1 | 00m00s |    0.82267 |         0.7470 |             23.8836 |       103.9041 | 
    2 | 00m00s |    0.81818 |         0.2784 |             12.4106 |       188.5984 | 
    3 | 00m00s |    0.82267 |         0.4745 |              9.1743 |        16.8421 | 
    4 | 00m00s |    0.82267 |         0.6617 |              4.7600 |       222.8920 | 
    5 | 00m00s |    0.81033 |         0.3057 |              9.2044 |        42.1871 | 
Bayesian Optimization
-------------------------------------------------------------------------------------
 Step |   Time |      Value |   max_features |   min_samples_split |   n_estimators | 
    6 | 00m08s |    0.82043 |         0.5444 |             24.5978 |       249.8593 | 
    7 | 00m08s |    0.81033 |         0.3853 |             24.8421 |       249.9177 | 
    8 | 00m07s |    0.79012 |         0.8454 |              2.2098 |        10.0838 | 
    9 | 00m05s |    0.81257 |         0.6389 |             24.9582 |        10.1302 | 
...
   54 | 00m13s |    0.82379 |         0.9751 |             24.9576 |        73.5207 | 
   55 | 00m13s |    0.82043 |         0.9698 |             24.9442 |        65.3054 | 
0.82379349046
{'n_estimators': 73.520665913948847, 'min_samples_split': 24.957568460557685, 'max_features': 0.97511242524537167}

少し精度がよくなり、性別の影響がかなり強くなった。

0.810169491525
Sex             0.445553
Fare            0.180365
Pclass          0.162260
Age             0.147300
Embarked        0.037309
SibSp           0.014159
Parch           0.013054

XGBoost

XGBoostでは

  • learning_rate
  • max_depth
  • subsample
  • colsample_bytree
  • min_child_weight
  • gamma
  • alpha

を探す。

import pandas as pd
import xgboost as xgb
from bayes_opt import BayesianOptimization

df = preprocess(pd.read_csv('./train.csv'))
train_x = df.drop('Survived', axis=1)
train_y = df.Survived
xgtrain = xgb.DMatrix(train_x, label=train_y)

def xgboost_cv(
    learning_rate,
    max_depth,
    subsample,
    colsample_bytree,
    min_child_weight,
    gamma,
    alpha):
    
    params = {}
    params['learning_rate'] = learning_rate
    # maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting.
    params['max_depth'] = int(max_depth) 
    #  subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
    params['subsample'] = subsample
    # subsample ratio of columns when constructing each tree.
    params['colsample_bytree'] = colsample_bytree 
    # minimum sum of instance weight (hessian) needed in a child. The larger, the more conservative the algorithm will be.
    params['min_child_weight'] = min_child_weight
    # minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
    params['gamma'] = gamma 
    # L1 regularization term on weights, increase this value will make model more conservative. 
    params['alpha'] = alpha 
    params['objective'] = 'binary:logistic'

    cv_result = xgb.cv(
        params,
        xgtrain,
        num_boost_round=10, 
        nfold=3,
        seed=0,
        # Validation error needs to decrease at least every <stopping_rounds> round(s) to continue training.
        # callbacks=[xgb.callback.early_stop(20)]
    )

    return 1.0 - cv_result['test-error-mean'].values[-1]


xgboost_cv_bo = BayesianOptimization(xgboost_cv, 
                             {
                                 'learning_rate': (0.1, 0.9),
                                 'max_depth': (5, 15),
                                 'subsample': (0.5, 1),
                                 'colsample_bytree': (0.1, 1),
                                 'min_child_weight': (1, 20),
                                 'gamma': (0, 10),
                                 'alpha': (0, 10),
                             })

xgboost_cv_bo.maximize(n_iter=50)
Initialization
---------------------------------------------------------------------------------------------------------------------------------------------
 Step |   Time |      Value |     alpha |   colsample_bytree |     gamma |   learning_rate |   max_depth |   min_child_weight |   subsample | 
    1 | 00m00s |    0.77441 |    5.3061 |             0.4302 |    1.1512 |          0.4484 |      8.4632 |            16.3455 |      0.5688 | 
    2 | 00m00s |    0.79349 |    6.5298 |             0.2519 |    8.8275 |          0.8277 |      9.6443 |             8.1207 |      0.5426 | 
    3 | 00m00s |    0.79349 |    0.4152 |             0.7973 |    7.5153 |          0.7173 |      8.2608 |            12.5433 |      0.7952 | 
    4 | 00m00s |    0.76768 |    3.9047 |             0.9619 |    2.0264 |          0.8893 |      9.8001 |            18.0125 |      0.8254 | 
    5 | 00m00s |    0.77217 |    3.0779 |             0.2957 |    3.1872 |          0.4871 |      8.8120 |            10.6444 |      0.6602 | 
Bayesian Optimization
---------------------------------------------------------------------------------------------------------------------------------------------
 Step |   Time |      Value |     alpha |   colsample_bytree |     gamma |   learning_rate |   max_depth |   min_child_weight |   subsample | 
    6 | 00m27s |    0.78676 |    8.5373 |             1.0000 |   10.0000 |          0.9000 |      5.0000 |            20.0000 |      0.5000 | 
    7 | 00m38s |    0.76768 |   10.0000 |             1.0000 |   10.0000 |          0.1000 |      5.0000 |             1.0000 |      1.0000 | 
    8 | 00m16s |    0.67901 |    0.2277 |             0.1000 |   10.0000 |          0.1000 |     15.0000 |            19.9844 |      0.5000 | 
    9 | 00m36s |    0.77666 |    0.0000 |             0.1000 |   10.0000 |          0.9000 |      5.0000 |             1.0000 |      0.5000 | 
...

   55 | 00m20s |    0.81818 |    0.3008 |             1.0000 |    2.5199 |          0.9000 |     13.8205 |             3.3037 |      1.0000 | 
0.833894666667
{'learning_rate': 0.46665290052625796, 'max_depth': 14.985905144970891, 'subsample': 0.96857695798880505, 'colsample_bytree': 0.74722905651892868, 'min_child_weight': 1.1211600650692968, 'gamma': 0.44876616653489076, 'alpha': 0.13669004333540569}

精度は微増した。元々のパラメータもそう悪くはなかったのかもしれない。

0.803389830508

参考

機械学習のためのベイズ最適化入門

ガウス過程の基礎と教師なし学習