データサイエンスの復習のために『Pythonではじめる機械学習』を読んだ

きつね

2023年9月12日 20:56

AIITに入学する3年前（2020年度）に科目等履修生としてデータサイエンス系の講義をいくつか取得したが、記憶が薄れてしまっているため、復習として以下の書籍を購入して、通読することにした。

準備

公式のサンプルのリポジトリをローカルにクローンする。

Readmeに従って必要なパッケージをインストールする。また以下のリンクに従ってJuypter Labをインストール。

Juypter Labを起動して、クローンしたリポジトリのサンプルファイルを開くと以下のような感じになる。このような感じで学習を進めていく。

（補足）AIITの講義で受講した際にも同様の手順で環境構築を行ったはずだが、別のPCに買い替えているため、再度環境構築から行った。

注意点

サンプルコードをそのまま実行すると、自身の実行環境のバージョンだとエラーが出る箇所があった。

ボストンデータセットはポリティカルに問題があるため、実行時にエラーが発生する。

ImportError:
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

for the California housing dataset and::

from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>

これはscikit-learnのバージョンを下げることでも対応可能なようだが、メッセージに従って、datasets.pyに以下のコードを追加で対応した。

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

np.boolを利用している箇所でもエラーが発生したので、np.bool_とすることで対応した。

AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

graphvizでもエラーが出たので、以下の手順で対応。
・pip install dtreeviz
・手動でgraphvizをインストール＆システム環境変数のPATHへ追加

X_train = data_train.date[:, np.newaxis]のコードで以下のエラーが出たため、X_train = data_train.date.to_numpy()[:, np.newaxis]とすることで対応。メッセージの指示通りnumpy arrayに変換をかけている。

ValueError: Multi-dimensional indexing (e.g. `obj[:, None]`) is no longer supported. Convert to a numpy array before indexing instead.

感想

当初の狙い通り、Pythonを利用した機械学習の手法について、一通り復習出来たので良かった。ただ本書では説明が省かれているように感じる箇所もあり、例えばロジスティック回帰については、教師あり学習の一つとして触れる形で、シグモイド曲線等の説明はなかった。そういった部分は当時の講義資料を読み直すことで、記憶の補完等をすることが出来た。また当時の講義では言及のなかった（と思われる）手法が紹介されていたりもしたので（決定木による回帰モデルなど）、良かったのではないかと思う。

データサイエンスの復習のために『Pythonではじめる機械学習』を読んだ

準備

注意点

感想

いいなと思ったら応援しよう！