PyCaretによる自動機械学習

2020年12月29日 12:16

最近，AutoMLが流行している．某社が高価な値段で売っていて，それに騙されて買っていた企業も多いようだが，いまでは無料でほぼ同じことができる．

一番おすすめはPyCaretだ．現在のバージョンは2.2.3で，活発に開発が進められているようだ．

例として，まずは，毒キノコの判定でもしてみよう．

mashroom = pd.read_csv("http://logopt.com/data/mashroom.csv")
mashroom.head()

このデータは，すべてのデータがカテゴリカルなので，scikit-learnとかでやると前処理が必要だが，PyCaretだと自動的にしてくれる．setupするだけだ．

from pycaret.classification import *
clf = setup(data = mashroom, target = 'target', session_id=123)

Description	Value
0	session_id	123
1	Target	target
2	Target Type	Binary
3	Label Encoded	edible: 0, poisonous: 1
4	Original Data	(8123, 4)
5	Missing Values	False
6	Numeric Features	0
7	Categorical Features	3
8	Ordinal Features	False
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(5686, 20)
12	Transformed Test Set	(2437, 20)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	clf-default-name
21	USI	e908
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	False
30	Normalize Method	None
31	Transformation	False
32	Transformation Method	None
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Clustering	False
45	Clustering Iteration	None
46	Polynomial Features	False
47	Polynomial Degree	None
48	Trignometry Features	False
49	Polynomial Threshold	None
50	Group Features	False
51	Feature Selection	False
52	Features Selection Threshold	None
53	Feature Interaction	False
54	Feature Ratio	False
55	Interaction Threshold	None
56	Fix Imbalance	False
57	Fix Imbalance Method	SMOTE

膨大な量の前処理を自動的にしてくれた．今度は色々な方法から最良の方法を探索する．

best_model = compare_models()

Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
gbc	Gradient Boosting Classifier	0.6993	0.8015	0.7196	0.6777	0.6977	0.3991	0.4002	0.0310
lightgbm	Light Gradient Boosting Machine	0.6973	0.8062	0.6515	0.7013	0.6744	0.3925	0.3943	0.0160
catboost	CatBoost Classifier	0.6954	0.8054	0.6460	0.7001	0.6711	0.3884	0.3902	7.9400
dt	Decision Tree Classifier	0.6952	0.8050	0.6456	0.7000	0.6708	0.3880	0.3898	0.0070
et	Extra Trees Classifier	0.6952	0.8052	0.6456	0.7000	0.6708	0.3880	0.3898	0.0480
xgboost	Extreme Gradient Boosting	0.6952	0.8057	0.6460	0.6998	0.6710	0.3881	0.3898	0.0910
rf	Random Forest Classifier	0.6943	0.8050	0.6507	0.6967	0.6723	0.3866	0.3879	0.0560
knn	K Neighbors Classifier	0.6774	0.7534	0.6803	0.6699	0.6676	0.3547	0.3610	0.1320
lr	Logistic Regression	0.6627	0.7183	0.6657	0.6461	0.6557	0.3252	0.3255	0.1190
ridge	Ridge Classifier	0.6623	0.0000	0.6650	0.6459	0.6552	0.3245	0.3247	0.0050
lda	Linear Discriminant Analysis	0.6623	0.7155	0.6650	0.6459	0.6552	0.3245	0.3247	0.0070
ada	Ada Boost Classifier	0.6590	0.7140	0.6522	0.6453	0.6485	0.3174	0.3176	0.0220
svm	SVM - Linear Kernel	0.6490	0.0000	0.5614	0.6770	0.5946	0.2930	0.3084	0.0070
nb	Naive Bayes	0.5405	0.7039	0.9880	0.5123	0.6747	0.1079	0.2179	0.0060
qda	Quadratic Discriminant Analysis	0.4824	0.0000	1.0000	0.4824	0.6508	0.0000	0.0000	0.0060

今回はGradient Boosting Classifierが最良で，正解率は0.6993と出てきた．この問題は難しいのでまずまずだ．ちなみに，最適な深さ3の決定木で解釈可能な木を作ると正解率は0.70257で，これよりちょっと良くなる．

バージョン2.2から時系列データも扱えるようになった．以下のようにfold_strategyで時系列を指定し，shuffleをFalseにする．また，どこから検証データにするかはtrain_sizeで指定する．

from pycaret.regression import *   #回帰関連の関数のインポート
reg = setup(df, target = 'demand', session_id=123, 
data_split_shuffle=False, fold_strategy= 'timeseries', 
fold=2, train_size=ratio)

単純な時系列データではまずまずの結果を出すようだ．日付をfast.aiを用いてカテゴリーデータに展開して，深層学習と比較したところ，顧客・製品を切り出した（プロモーションデータだけの）時系列データなら，深層学習よりも良い結果を出す場合もあるようだ．一例しか試していないが，深層学習でMSE（自乗平均誤差）が8程度のところ，最良の手法は3程度まで落ちた．（ちなみにSonyのPrediction Oneは30だった :-）

深層学習だと，顧客や製品のデータも埋め込みを使って利用できる．同じデータで試したところ，深層学習のMSEは5程度，最良の手法（Gradient Boosting Regressor）は360と全然ダメだった．埋め込みではなく，週や曜日や月のデータを整数データでやっているのが原因だろう．一般に言えることは，データ量が増えると深層学習系がよくなり，データ量が少ないと機械学習系が（気軽に使えるという点もあわせて）よい．

PyCaretで埋め込みを使う方法は不明だが，曜日や月をカテゴリー変数にするだけで（時間はかかるが）改善すると考えられる．

この記事が気に入ったらサポートをしてみませんか？