K-分割交差検証【KFold】

2022年2月12日 16:15

train_test_splitを使うと、機械学習モデルはテスト用データを使わずに学習することになる。この懸念点を解決する検証方法。

交差検証（Cross-Validation）

以下の5つの手順でデータ分析手法の精度を評価する。

1. データを分割
2. 一部のデータを使って学習
3. 残ったデータで分析手法の精度を評価する
4. 一部のデータの分け方を変化させ、２と３を繰り返す
5. 複数回行ったテスト結果をもとに、分析手法の精度を評価する

scikit-learnの中からKFoldというモジュール

【KFold()の引数】
・引数n_split: データの分割数、デフォルトは5
・引数shuffle: 連続する数字のグループ分けとするか（True もしくはFalse）
・引数random_state: 乱数の設定

In [1]: # ライブラリのインポート
       import pandas as pd
       from sklearn.model_selection import KFold
       from sklearn.datasets import load_boston

       # データのロード
       boston = load_boston()

       # 交差検証（fold=3）
       kf = KFold(n_splits=3,shuffle=True,random_state=0)
       for fold,(train_index, test_index) in enumerate(kf.split(boston.data, boston.target)):
           print("Fold：{}".format(fold),"len(train_index)：{}".format(len(train_index)), "len(test_index)：{}".format(len(test_index)))

Out[1]: Fold：0 len(train_index)：337 len(test_index)：169
       Fold：1 len(train_index)：337 len(test_index)：169
       Fold：2 len(train_index)：338 len(test_index)：168

・K-分割交差検証などの検証方法は受託分析やデータコンペでも用いられる重要な手法となります。
・次回はK-分割交差検証と似ている検証方法の層化K-分割交差検証について学んでいきましょう。

この記事が気に入ったらサポートをしてみませんか？