ベイズ最適化で良いハイパーパラメータを総当りのグリッドサーチより効率的に探す。
ベイズ最適化はSMBO(Sequential Model-based Global Optimization)と呼ばれる、目的関数を近似するモデルと、ある値を探索すべきか評価する関数を用いて探索を進め、
実際の目的関数での評価とモデルの修正を行っていく手法の一種。
ベイズ最適化ではモデルとして、直近探索したパラメータとスコアの組の集合D
の条件付き事後確率分布P(y|x, D)
を用いる。
確率モデルは、任意の入力(x1, x2, ... , xn)
に対応する出力(y1, y2, ..., yn)
がガウス分布(=正規分布)に従うガウス過程(GP)や、TPE(Tree-structured Parzen Estimator)が仮定される。
今回はKaggleのTitanicのチュートリアルを、チューニングなしのランダムフォレストとXGBoostで解いたときの結果と比較して、ベイズ最適化によるハイパーパラメータで精度が向上するか確認する。
KaggleのTitanicのチュートリアルをランダムフォレストで解く - sambaiz-net
KaggleのTitanicのチュートリアルをXGBoostで解く - sambaiz-net
ランダムフォレスト
Pythonのベイズ最適化のライブラリ、BayesianOptimizationを使う。
$ pip install bayesian-optimization
RandomForestClassifierのハイパーパラメータ
- n_estimators: 木の数
- min_samples_split: ノードを分割するのに必要な最小サンプル数
- max_features: 分割するときに考慮する特徴量の割合
の値を探すため、BayesianOptimization
に最大化したい値(精度)とパラメータの範囲を渡す。
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from bayes_opt import BayesianOptimization
import pandas as pd
def preprocess(df):
df['Fare'] = df['Fare'].fillna(df['Fare'].mean())
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Embarked'] = df['Embarked'].fillna('Unknown')
df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)
df['Embarked'] = df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2, 'Unknown': 3} ).astype(int)
df = df.drop(['Cabin','Name','PassengerId','Ticket'],axis=1)
return df
df = preprocess(pd.read_csv('./train.csv'))
train_x = df.drop('Survived', axis=1)
train_y = df.Survived
def randomforest_cv(n_estimators, min_samples_split, max_features):
val = cross_val_score(
RandomForestClassifier(
n_estimators=int(n_estimators),
min_samples_split=int(min_samples_split),
max_features=max_features,
random_state=0
),
train_x, train_y,
scoring = 'accuracy',
cv = 3, # 3-fold
n_jobs = -1 # use all CPUs
).mean()
return val
randomforest_cv_bo = BayesianOptimization(
randomforest_cv,
{'n_estimators': (10, 250),
'min_samples_split': (2, 25),
'max_features': (0.1, 0.999)}
)
gp_params = {"alpha": 1e-5}
randomforest_cv_bo.maximize(n_iter=50, **gp_params)
print(randomforest_cv_bo.res['max']['max_val'])
print(randomforest_cv_bo.res['max']['max_params'])
まずinit_points回ランダムな値で試して、 それらの結果を起点にベイズ最適化で探していく。
Initialization
-------------------------------------------------------------------------------------
Step | Time | Value | max_features | min_samples_split | n_estimators |
1 | 00m00s | 0.82267 | 0.7470 | 23.8836 | 103.9041 |
2 | 00m00s | 0.81818 | 0.2784 | 12.4106 | 188.5984 |
3 | 00m00s | 0.82267 | 0.4745 | 9.1743 | 16.8421 |
4 | 00m00s | 0.82267 | 0.6617 | 4.7600 | 222.8920 |
5 | 00m00s | 0.81033 | 0.3057 | 9.2044 | 42.1871 |
Bayesian Optimization
-------------------------------------------------------------------------------------
Step | Time | Value | max_features | min_samples_split | n_estimators |
6 | 00m08s | 0.82043 | 0.5444 | 24.5978 | 249.8593 |
7 | 00m08s | 0.81033 | 0.3853 | 24.8421 | 249.9177 |
8 | 00m07s | 0.79012 | 0.8454 | 2.2098 | 10.0838 |
9 | 00m05s | 0.81257 | 0.6389 | 24.9582 | 10.1302 |
...
54 | 00m13s | 0.82379 | 0.9751 | 24.9576 | 73.5207 |
55 | 00m13s | 0.82043 | 0.9698 | 24.9442 | 65.3054 |
0.82379349046
{'n_estimators': 73.520665913948847, 'min_samples_split': 24.957568460557685, 'max_features': 0.97511242524537167}
少し精度がよくなり、性別の影響がかなり強くなった。
0.810169491525
Sex 0.445553
Fare 0.180365
Pclass 0.162260
Age 0.147300
Embarked 0.037309
SibSp 0.014159
Parch 0.013054
XGBoost
XGBoostでは
- learning_rate
- max_depth
- subsample
- colsample_bytree
- min_child_weight
- gamma
- alpha
を探す。
import pandas as pd
import xgboost as xgb
from bayes_opt import BayesianOptimization
df = preprocess(pd.read_csv('./train.csv'))
train_x = df.drop('Survived', axis=1)
train_y = df.Survived
xgtrain = xgb.DMatrix(train_x, label=train_y)
def xgboost_cv(
learning_rate,
max_depth,
subsample,
colsample_bytree,
min_child_weight,
gamma,
alpha):
params = {}
params['learning_rate'] = learning_rate
# maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting.
params['max_depth'] = int(max_depth)
# subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting.
params['subsample'] = subsample
# subsample ratio of columns when constructing each tree.
params['colsample_bytree'] = colsample_bytree
# minimum sum of instance weight (hessian) needed in a child. The larger, the more conservative the algorithm will be.
params['min_child_weight'] = min_child_weight
# minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.
params['gamma'] = gamma
# L1 regularization term on weights, increase this value will make model more conservative.
params['alpha'] = alpha
params['objective'] = 'binary:logistic'
cv_result = xgb.cv(
params,
xgtrain,
num_boost_round=10,
nfold=3,
seed=0,
# Validation error needs to decrease at least every <stopping_rounds> round(s) to continue training.
# callbacks=[xgb.callback.early_stop(20)]
)
return 1.0 - cv_result['test-error-mean'].values[-1]
xgboost_cv_bo = BayesianOptimization(xgboost_cv,
{
'learning_rate': (0.1, 0.9),
'max_depth': (5, 15),
'subsample': (0.5, 1),
'colsample_bytree': (0.1, 1),
'min_child_weight': (1, 20),
'gamma': (0, 10),
'alpha': (0, 10),
})
xgboost_cv_bo.maximize(n_iter=50)
Initialization
---------------------------------------------------------------------------------------------------------------------------------------------
Step | Time | Value | alpha | colsample_bytree | gamma | learning_rate | max_depth | min_child_weight | subsample |
1 | 00m00s | 0.77441 | 5.3061 | 0.4302 | 1.1512 | 0.4484 | 8.4632 | 16.3455 | 0.5688 |
2 | 00m00s | 0.79349 | 6.5298 | 0.2519 | 8.8275 | 0.8277 | 9.6443 | 8.1207 | 0.5426 |
3 | 00m00s | 0.79349 | 0.4152 | 0.7973 | 7.5153 | 0.7173 | 8.2608 | 12.5433 | 0.7952 |
4 | 00m00s | 0.76768 | 3.9047 | 0.9619 | 2.0264 | 0.8893 | 9.8001 | 18.0125 | 0.8254 |
5 | 00m00s | 0.77217 | 3.0779 | 0.2957 | 3.1872 | 0.4871 | 8.8120 | 10.6444 | 0.6602 |
Bayesian Optimization
---------------------------------------------------------------------------------------------------------------------------------------------
Step | Time | Value | alpha | colsample_bytree | gamma | learning_rate | max_depth | min_child_weight | subsample |
6 | 00m27s | 0.78676 | 8.5373 | 1.0000 | 10.0000 | 0.9000 | 5.0000 | 20.0000 | 0.5000 |
7 | 00m38s | 0.76768 | 10.0000 | 1.0000 | 10.0000 | 0.1000 | 5.0000 | 1.0000 | 1.0000 |
8 | 00m16s | 0.67901 | 0.2277 | 0.1000 | 10.0000 | 0.1000 | 15.0000 | 19.9844 | 0.5000 |
9 | 00m36s | 0.77666 | 0.0000 | 0.1000 | 10.0000 | 0.9000 | 5.0000 | 1.0000 | 0.5000 |
...
55 | 00m20s | 0.81818 | 0.3008 | 1.0000 | 2.5199 | 0.9000 | 13.8205 | 3.3037 | 1.0000 |
0.833894666667
{'learning_rate': 0.46665290052625796, 'max_depth': 14.985905144970891, 'subsample': 0.96857695798880505, 'colsample_bytree': 0.74722905651892868, 'min_child_weight': 1.1211600650692968, 'gamma': 0.44876616653489076, 'alpha': 0.13669004333540569}
精度は微増した。元々のパラメータもそう悪くはなかったのかもしれない。
0.803389830508