ベイズ最適化でランダムフォレストとXGBoostの良いハイパーパラメータを探す

machinelearningpython

ベイズ最適化で良いハイパーパラメータを総当りのグリッドサーチより効率的に探す。

ベイズ最適化はSMBO(Sequential Model-based Global Optimization)と呼ばれる、目的関数を近似するモデルと、ある値を探索すべきか評価する関数を用いて探索を進め、 実際の目的関数での評価とモデルの修正を行っていく手法の一種。 ベイズ最適化ではモデルとして、直近探索したパラメータとスコアの組の集合Dの条件付き事後確率分布P(y|x, D)を用いる。 確率モデルは、任意の入力(x1, x2, ... , xn)に対応する出力(y1, y2, ..., yn)がガウス分布(=正規分布)に従うガウス過程(GP)や、TPE(Tree-structured Parzen Estimator)が仮定される。

今回はKaggleのTitanicのチュートリアルを、チューニングなしのランダムフォレストとXGBoostで解いたときの結果と比較して、ベイズ最適化によるハイパーパラメータで精度が向上するか確認する。

KaggleのTitanicのチュートリアルをランダムフォレストで解く - sambaiz-net

KaggleのTitanicのチュートリアルをXGBoostで解く - sambaiz-net

ランダムフォレスト

Pythonのベイズ最適化のライブラリ、BayesianOptimizationを使う。

$ pip install bayesian-optimization

RandomForestClassifierのハイパーパラメータ

  • n_estimators: 木の数
  • min_samples_split: ノードを分割するのに必要な最小サンプル数
  • max_features: 分割するときに考慮する特徴量の割合

の値を探すため、BayesianOptimizationに最大化したい値(精度)とパラメータの範囲を渡す。

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from bayes_opt import BayesianOptimization

import pandas as pd

def preprocess(df):
    df['Fare'] = df['Fare'].fillna(df['Fare'].mean())
    df['Age'] = df['Age'].fillna(df['Age'].mean())
    df['Embarked'] = df['Embarked'].fillna('Unknown')
    df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)
    df['Embarked'] = df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2, 'Unknown': 3} ).astype(int)
    df = df.drop(['Cabin','Name','PassengerId','Ticket'],axis=1)
    return df

df = preprocess(pd.read_csv('./train.csv'))
train_x = df.drop('Survived', axis=1)
train_y = df.Survived

def randomforest_cv(n_estimators, min_samples_split, max_features):
    val = cross_val_score(
        RandomForestClassifier(
            n_estimators=int(n_estimators),
            min_samples_split=int(min_samples_split),
            max_features=max_features,
            random_state=0
        ),
        train_x, train_y,
        scoring = 'accuracy',
        cv = 3, # 3-fold
        n_jobs = -1 # use all CPUs
    ).mean()
    return val

randomforest_cv_bo = BayesianOptimization(
    randomforest_cv,
    {'n_estimators': (10, 250),
    'min_samples_split': (2, 25),
    'max_features': (0.1, 0.999)}
)

gp_params = {"alpha": 1e-5}
randomforest_cv_bo.maximize(n_iter=50, **gp_params)
print(randomforest_cv_bo.res['max']['max_val'])
print(randomforest_cv_bo.res['max']['max_params'])

まずinit_points回ランダムな値で試して、 それらの結果を起点にベイズ最適化で探していく。

Initialization
-------------------------------------------------------------------------------------
 Step |   Time |      Value |   max_features |   min_samples_split |   n_estimators | 
    1 | 00m00s |    0.82267 |         0.7470 |             23.8836 |       103.9041 | 
    2 | 00m00s |    0.81818 |         0.2784 |             12.4106 |       188.5984 | 
    3 | 00m00s |    0.82267 |         0.4745 |              9.1743 |        16.8421 | 
    4 | 00m00s |    0.82267 |         0.6617 |              4.7600 |       222.8920 | 
    5 | 00m00s |    0.81033 |         0.3057 |              9.2044 |        42.1871 | 
Bayesian Optimization
-------------------------------------------------------------------------------------
 Step |   Time |      Value |   max_features |   min_samples_split |   n_estimators | 
    6 | 00m08s |    0.82043 |         0.5444 |             24.5978 |       249.8593 | 
    7 | 00m08s |    0.81033 |         0.3853 |             24.8421 |       249.9177 | 
    8 | 00m07s |    0.79012 |         0.8454 |              2.2098 |        10.0838 | 
    9 | 00m05s |    0.81257 |         0.6389 |             24.9582 |        10.1302 | 
...
   54 | 00m13s |    0.82379 |         0.9751 |             24.9576 |        73.5207 | 
   55 | 00m13s |    0.82043 |         0.9698 |             24.9442 |        65.3054 | 
0.82379349046
{'n_estimators': 73.520665913948847, 'min_samples_split': 24.957568460557685, 'max_features': 0.97511242524537167}

少し精度がよくなり、性別の影響がかなり強くなった。

0.810169491525
Sex             0.445553
Fare            0.180365
Pclass          0.162260
Age             0.147300
Embarked        0.037309
SibSp           0.014159
Parch           0.013054

XGBoost

XGBoostでは

  • learning_rate
  • max_depth
  • subsample
  • colsample_bytree
  • min_child_weight
  • gamma
  • alpha

を探す。

import pandas as pd
import xgboost as xgb
from bayes_opt import BayesianOptimization

df = preprocess(pd.read_csv('./train.csv'))
train_x = df.drop('Survived', axis=1)
train_y = df.Survived
xgtrain = xgb.DMatrix(train_x, label=train_y)

def xgboost_cv(
    learning_rate,
    max_depth,
    subsample,
    colsample_bytree,
    min_child_weight,
    gamma,
    alpha):
    
    params = {}
    params['learning_rate'] = learning_rate
    # maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting.
    params['max_depth'] = int(max_depth) 
    #  subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
    params['subsample'] = subsample
    # subsample ratio of columns when constructing each tree.
    params['colsample_bytree'] = colsample_bytree 
    # minimum sum of instance weight (hessian) needed in a child. The larger, the more conservative the algorithm will be.
    params['min_child_weight'] = min_child_weight
    # minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
    params['gamma'] = gamma 
    # L1 regularization term on weights, increase this value will make model more conservative. 
    params['alpha'] = alpha 
    params['objective'] = 'binary:logistic'

    cv_result = xgb.cv(
        params,
        xgtrain,
        num_boost_round=10, 
        nfold=3,
        seed=0,
        # Validation error needs to decrease at least every <stopping_rounds> round(s) to continue training.
        # callbacks=[xgb.callback.early_stop(20)]
    )

    return 1.0 - cv_result['test-error-mean'].values[-1]


xgboost_cv_bo = BayesianOptimization(xgboost_cv, 
                             {
                                 'learning_rate': (0.1, 0.9),
                                 'max_depth': (5, 15),
                                 'subsample': (0.5, 1),
                                 'colsample_bytree': (0.1, 1),
                                 'min_child_weight': (1, 20),
                                 'gamma': (0, 10),
                                 'alpha': (0, 10),
                             })

xgboost_cv_bo.maximize(n_iter=50)
Initialization
---------------------------------------------------------------------------------------------------------------------------------------------
 Step |   Time |      Value |     alpha |   colsample_bytree |     gamma |   learning_rate |   max_depth |   min_child_weight |   subsample | 
    1 | 00m00s |    0.77441 |    5.3061 |             0.4302 |    1.1512 |          0.4484 |      8.4632 |            16.3455 |      0.5688 | 
    2 | 00m00s |    0.79349 |    6.5298 |             0.2519 |    8.8275 |          0.8277 |      9.6443 |             8.1207 |      0.5426 | 
    3 | 00m00s |    0.79349 |    0.4152 |             0.7973 |    7.5153 |          0.7173 |      8.2608 |            12.5433 |      0.7952 | 
    4 | 00m00s |    0.76768 |    3.9047 |             0.9619 |    2.0264 |          0.8893 |      9.8001 |            18.0125 |      0.8254 | 
    5 | 00m00s |    0.77217 |    3.0779 |             0.2957 |    3.1872 |          0.4871 |      8.8120 |            10.6444 |      0.6602 | 
Bayesian Optimization
---------------------------------------------------------------------------------------------------------------------------------------------
 Step |   Time |      Value |     alpha |   colsample_bytree |     gamma |   learning_rate |   max_depth |   min_child_weight |   subsample | 
    6 | 00m27s |    0.78676 |    8.5373 |             1.0000 |   10.0000 |          0.9000 |      5.0000 |            20.0000 |      0.5000 | 
    7 | 00m38s |    0.76768 |   10.0000 |             1.0000 |   10.0000 |          0.1000 |      5.0000 |             1.0000 |      1.0000 | 
    8 | 00m16s |    0.67901 |    0.2277 |             0.1000 |   10.0000 |          0.1000 |     15.0000 |            19.9844 |      0.5000 | 
    9 | 00m36s |    0.77666 |    0.0000 |             0.1000 |   10.0000 |          0.9000 |      5.0000 |             1.0000 |      0.5000 | 
...

   55 | 00m20s |    0.81818 |    0.3008 |             1.0000 |    2.5199 |          0.9000 |     13.8205 |             3.3037 |      1.0000 | 
0.833894666667
{'learning_rate': 0.46665290052625796, 'max_depth': 14.985905144970891, 'subsample': 0.96857695798880505, 'colsample_bytree': 0.74722905651892868, 'min_child_weight': 1.1211600650692968, 'gamma': 0.44876616653489076, 'alpha': 0.13669004333540569}

精度は微増した。元々のパラメータもそう悪くはなかったのかもしれない。

0.803389830508

参考

機械学習のためのベイズ最適化入門

ガウス過程の基礎と教師なし学習

Kaggleで勝つデータ分析の技術:書籍案内|技術評論社