[Python]医療費データ160次元を画像として表示して可視化してみた：10×16セルの画像表示

2022年5月9日 00:05

はじめに

こんにちは、機械学習勉強中のあおじるです。
以前の記事で、医療費データ（160次元）を次元削減の手法を使って2次元に圧縮してみました。
今回は、160次元のデータをそのまま（10×16セルの）画像として表示することで可視化してみたいと思います。

言語はPython、環境はGoogle Colaboratoryを使用しました。

使用するデータ

データは、以前の記事で作成した、全国健康保険協会（協会けんぽ）の加入者基本情報、医療費基本情報から作成した、10年間×47都道府県ごとの医療費の160次元のデータ（性別、年齢階級別の診療種別ごとの「医療費の３要素」）df_yt_C10_sn を使います。
　（10年×47都道府県）×（10指標×性別2区分×年齢階級8区分）
　　＝　470行 × 160次元
の形のデータです。

$$
\def\arraystretch{1.5}
\begin{array}{c:c|c:c:c:c}
\textsf{y} & \textsf{t} & \textsf{KperP\_1\_1\_1} & \textsf{KperP\_1\_1\_2} & \cdots & \textsf{TperN\_4\_2\_8} \\ \hline
2010 & 1 & {} & {} & {} & {} \\
2010 & 2 & {} & {} & {} & {} \\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
2019 & 47 & {} & {} & {} & {}
\end{array}
$$

y：年度
2010～2019 の10年度分
t：都道府県
1：北海道、・・・、47：沖縄の47都道府県
C10_s_n：性別s、年齢階級n別の10指標
KperP_1_1_1、KperP_1_1_2、・・・、TperN_4_2_8 の160項目
- C10：診療種別ごとの「医療費の３要素」で、XperY_k（診療種別kのYperX、YperX ＝ Y／X）の形の10指標：
  - KperP_1：１人当たり件数_入院
  - KperP_2：１人当たり件数_外来
  - KperP_3：１人当たり件数_歯科
  - NperK_1：１件当たり日数_入院
  - NperK_2：１件当たり日数_外来
  - NperK_3：１件当たり日数_歯科
  - TperN_1：１日当たり点数_入院
  - TperN_2：１日当たり点数_外来
  - TperN_3：１日当たり点数_歯科
  - TperN_4：１日当たり点数_調剤
- s：性別
  1：男性、2：女性
- n：年齢階級
  1：0～9歳、2：10～19歳、・・・、7：60～69歳、8：70歳以上

# 2010-2019年度データ
import pandas as pd
df = pd.read_csv('./df_yt_C10_sn.csv')
print(df.shape) # (470, 162)

# 数値部分のみ取り出し
X = df.iloc[:,2:]
print(X.shape) # (470, 160)

# スケーリング
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
X = scaler.fit_transform(X)
print(X.shape) # (470, 160)

画像表示の方法

データは、列方向には、XperY_k_s_n の形の列
　KperP_1_1_1、KperP_1_1_2、・・・、TperN_4_2_8
が並んでいますので、これを、
・行　XperY_k（診療種別（k）ごとの医療費の３要素（XperY））：10次元
・列　_s_n（性（s）、年齢階級（n））：2×8＝16次元
の形の行列に変形して、10×16セルの画像として表示（ヒートマップ表示）することにします。

例えば、１番目（Pythonは0オリジンなので0番目）（y=2010年度、t=1北海道）のデータは、

np.set_printoptions(precision=1, suppress=True) # numpyの表示桁数設定
print(X[0].shape)
# (160,)
print(X[0])
# [1.  0.5 0.4 0.6 0.6 0.6 0.7 0.5 1.  0.5 0.7 0.5 0.7 0.9 0.6 0.5 0.2 0.3
#  0.4 0.2 0.  0.  0.1 0.4 0.3 0.3 0.6 0.2 0.1 0.1 0.2 0.4 0.2 0.1 0.3 0.2
#  0.2 0.1 0.1 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.5 0.4 0.5 0.6 0.6
#  0.6 0.5 0.3 0.3 0.4 0.6 0.6 0.8 0.6 0.6 0.4 0.4 0.4 0.5 0.4 0.4 0.3 0.4
#  0.4 0.4 0.4 0.5 0.4 0.4 0.3 0.3 0.8 0.9 0.9 1.  1.  1.  0.9 0.9 0.8 0.9
#  0.9 1.  0.9 1.  0.9 0.9 0.3 0.2 0.3 0.3 0.4 0.4 0.3 0.4 0.3 0.3 0.4 0.3
#  0.3 0.3 0.3 0.3 0.7 0.3 0.3 0.3 0.5 0.5 0.4 0.4 0.6 0.4 0.4 0.5 0.5 0.6
#  0.5 0.4 0.3 0.3 0.5 0.5 0.6 0.7 0.6 0.6 0.3 0.3 0.5 0.5 0.6 0.6 0.6 0.7
#  0.7 0.3 0.3 0.5 0.6 0.6 0.6 0.5 0.8 0.4 0.4 0.4 0.6 0.7 0.6 0.6]

10×16の行列に変換して、

print(X[0].reshape(10,16))
# [[1.  0.5 0.4 0.6 0.6 0.6 0.7 0.5 1.  0.5 0.7 0.5 0.7 0.9 0.6 0.5]
#  [0.2 0.3 0.4 0.2 0.  0.  0.1 0.4 0.3 0.3 0.6 0.2 0.1 0.1 0.2 0.4]
#  [0.2 0.1 0.3 0.2 0.2 0.1 0.1 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3]
#  [0.3 0.5 0.4 0.5 0.6 0.6 0.6 0.5 0.3 0.3 0.4 0.6 0.6 0.8 0.6 0.6]
#  [0.4 0.4 0.4 0.5 0.4 0.4 0.3 0.4 0.4 0.4 0.4 0.5 0.4 0.4 0.3 0.3]
#  [0.8 0.9 0.9 1.  1.  1.  0.9 0.9 0.8 0.9 0.9 1.  0.9 1.  0.9 0.9]
#  [0.3 0.2 0.3 0.3 0.4 0.4 0.3 0.4 0.3 0.3 0.4 0.3 0.3 0.3 0.3 0.3]
#  [0.7 0.3 0.3 0.3 0.5 0.5 0.4 0.4 0.6 0.4 0.4 0.5 0.5 0.6 0.5 0.4]
#  [0.3 0.3 0.5 0.5 0.6 0.7 0.6 0.6 0.3 0.3 0.5 0.5 0.6 0.6 0.6 0.7]
#  [0.7 0.3 0.3 0.5 0.6 0.6 0.6 0.5 0.8 0.4 0.4 0.4 0.6 0.7 0.6 0.6]]

画像として表示すると、

%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
plt.imshow(X[0].reshape(10,16), vmin=0, vmax=1, cmap='bwr')
plt.tick_params(right=True, top=True, labelright=True, labeltop=True)
plt.xticks(np.arange(16))
plt.yticks(np.arange(10))
plt.colorbar()
plt.show()

となります。
カラーマップとして'bwr'を用いていますので、数値が高い（1に近い）ほど赤く、数値が低い（0に近い）ほど青く表示されます。
0行目の0列目・8列目のセル（(0,0), (0,8)）が濃い赤で表示されていますので、0-9歳の男性・女性の入院の１人当たり件数（受診率）が相対的に高いことが確認できます。
また、5行目の各列が赤く表示されていますので、歯科の１件当たり日数がどの性、年齢階級でも相対的に高くなっています。
逆に、1行目・2行目は青く表示された列が多くなっていますので、外来・歯科の１人当たり件数（受診率）が相対的に低いことが確認できます。

同様に、最後の470番目（y=2019年度、t=47沖縄）のデータは、

print(X[470-1].reshape(10,16))
# [[0.4 0.3 0.3 0.3 0.6 0.4 0.4 0.3 0.4 0.3 0.6 0.9 0.7 0.4 0.4 0.4]
#  [0.1 0.1 0.1 0.1 0.3 0.3 0.1 0.1 0.2 0.1 0.1 0.3 0.3 0.2 0.  0.2]
#  [0.1 0.3 0.6 0.6 0.5 0.3 0.2 0.2 0.1 0.3 0.8 0.6 0.6 0.4 0.4 0.5]
#  [0.2 0.4 0.3 0.2 0.3 0.1 0.2 0.3 0.3 0.2 0.4 0.3 0.  0.1 0.1 0.4]
#  [0.1 0.  0.1 0.1 0.1 0.1 0.2 0.2 0.1 0.  0.1 0.1 0.1 0.1 0.2 0.2]
#  [0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.2 0.4 0.5 0.3 0.3 0.3 0.3 0.3 0.3]
#  [0.9 0.7 0.6 0.6 0.8 0.9 0.8 0.8 0.7 0.6 0.7 0.8 0.8 0.8 0.8 0.6]
#  [0.9 0.5 0.6 0.6 0.8 0.7 0.7 0.8 0.8 0.9 0.8 0.8 0.9 0.8 0.8 0.8]
#  [0.4 0.5 0.8 0.9 0.8 0.7 0.7 0.7 0.4 0.6 0.9 0.8 0.7 0.7 0.6 0.6]
#  [0.7 0.6 0.6 0.8 0.9 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.9 0.8 0.7 0.5]]

plt.figure(figsize=(10,5))
plt.imshow(X[470-1].reshape(10,16), vmin=0, vmax=1, cmap='bwr')
plt.tick_params(right=True, top=True, labelright=True, labeltop=True)
plt.xticks(np.arange(16))
plt.yticks(np.arange(10))
plt.colorbar()
plt.show()

となります。
0行目～4行目の１人当たり件数（受診率）や１件当たり日数が全体的に低く、6行目～9行目の１日当たり医療費が全体的に高くなっていることが確認できます。

# 160次元の列名
import numpy as np
print(np.array(df.iloc[:,2:].columns).reshape(10,16))
# [['KperP_1_1_1' 'KperP_1_1_2' 'KperP_1_1_3' 'KperP_1_1_4' 'KperP_1_1_5'
#   'KperP_1_1_6' 'KperP_1_1_7' 'KperP_1_1_8' 'KperP_1_2_1' 'KperP_1_2_2'
#   'KperP_1_2_3' 'KperP_1_2_4' 'KperP_1_2_5' 'KperP_1_2_6' 'KperP_1_2_7'
#   'KperP_1_2_8']
#  ['KperP_2_1_1' 'KperP_2_1_2' 'KperP_2_1_3' 'KperP_2_1_4' 'KperP_2_1_5'
#   'KperP_2_1_6' 'KperP_2_1_7' 'KperP_2_1_8' 'KperP_2_2_1' 'KperP_2_2_2'
#   'KperP_2_2_3' 'KperP_2_2_4' 'KperP_2_2_5' 'KperP_2_2_6' 'KperP_2_2_7'
#   'KperP_2_2_8']
#  ['KperP_3_1_1' 'KperP_3_1_2' 'KperP_3_1_3' 'KperP_3_1_4' 'KperP_3_1_5'
#   'KperP_3_1_6' 'KperP_3_1_7' 'KperP_3_1_8' 'KperP_3_2_1' 'KperP_3_2_2'
#   'KperP_3_2_3' 'KperP_3_2_4' 'KperP_3_2_5' 'KperP_3_2_6' 'KperP_3_2_7'
#   'KperP_3_2_8']
#  ['NperK_1_1_1' 'NperK_1_1_2' 'NperK_1_1_3' 'NperK_1_1_4' 'NperK_1_1_5'
#   'NperK_1_1_6' 'NperK_1_1_7' 'NperK_1_1_8' 'NperK_1_2_1' 'NperK_1_2_2'
#   'NperK_1_2_3' 'NperK_1_2_4' 'NperK_1_2_5' 'NperK_1_2_6' 'NperK_1_2_7'
#   'NperK_1_2_8']
#  ['NperK_2_1_1' 'NperK_2_1_2' 'NperK_2_1_3' 'NperK_2_1_4' 'NperK_2_1_5'
#   'NperK_2_1_6' 'NperK_2_1_7' 'NperK_2_1_8' 'NperK_2_2_1' 'NperK_2_2_2'
#   'NperK_2_2_3' 'NperK_2_2_4' 'NperK_2_2_5' 'NperK_2_2_6' 'NperK_2_2_7'
#   'NperK_2_2_8']
#  ['NperK_3_1_1' 'NperK_3_1_2' 'NperK_3_1_3' 'NperK_3_1_4' 'NperK_3_1_5'
#   'NperK_3_1_6' 'NperK_3_1_7' 'NperK_3_1_8' 'NperK_3_2_1' 'NperK_3_2_2'
#   'NperK_3_2_3' 'NperK_3_2_4' 'NperK_3_2_5' 'NperK_3_2_6' 'NperK_3_2_7'
#   'NperK_3_2_8']
#  ['TperN_1_1_1' 'TperN_1_1_2' 'TperN_1_1_3' 'TperN_1_1_4' 'TperN_1_1_5'
#   'TperN_1_1_6' 'TperN_1_1_7' 'TperN_1_1_8' 'TperN_1_2_1' 'TperN_1_2_2'
#   'TperN_1_2_3' 'TperN_1_2_4' 'TperN_1_2_5' 'TperN_1_2_6' 'TperN_1_2_7'
#   'TperN_1_2_8']
#  ['TperN_2_1_1' 'TperN_2_1_2' 'TperN_2_1_3' 'TperN_2_1_4' 'TperN_2_1_5'
#   'TperN_2_1_6' 'TperN_2_1_7' 'TperN_2_1_8' 'TperN_2_2_1' 'TperN_2_2_2'
#   'TperN_2_2_3' 'TperN_2_2_4' 'TperN_2_2_5' 'TperN_2_2_6' 'TperN_2_2_7'
#   'TperN_2_2_8']
#  ['TperN_3_1_1' 'TperN_3_1_2' 'TperN_3_1_3' 'TperN_3_1_4' 'TperN_3_1_5'
#   'TperN_3_1_6' 'TperN_3_1_7' 'TperN_3_1_8' 'TperN_3_2_1' 'TperN_3_2_2'
#   'TperN_3_2_3' 'TperN_3_2_4' 'TperN_3_2_5' 'TperN_3_2_6' 'TperN_3_2_7'
#   'TperN_3_2_8']
#  ['TperN_4_1_1' 'TperN_4_1_2' 'TperN_4_1_3' 'TperN_4_1_4' 'TperN_4_1_5'
#   'TperN_4_1_6' 'TperN_4_1_7' 'TperN_4_1_8' 'TperN_4_2_1' 'TperN_4_2_2'
#   'TperN_4_2_3' 'TperN_4_2_4' 'TperN_4_2_5' 'TperN_4_2_6' 'TperN_4_2_7'
#   'TperN_4_2_8']]

画像表示の結果

都道府県別のデータ

都道府県別の10年分のデータを画像表示します。横に10年分、縦に47都道府県の画像を並べます。

plt.figure(figsize=(10*2, 47*2))
for t in range(47):
  for y in range(10):
    ax = plt.subplot(47, 10, t*10+y+1)
    plt.title('{}, {}'.format(df.iloc[t+47*y,1], df.iloc[t+47*y,0]), fontsize=16)
    plt.imshow(X[t+47*y].reshape(10,16), vmin=0, vmax=1, cmap='bwr')
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

全体的な傾向としては、概ねどの都道府県も右に進むほど
・１件当たり日数（3行目～5行目）が低く（青く）なっていく
・１日当たり医療費（6行目～9行目）が高く（赤く）なっていく
ことがわかります。

全国平均のデータ

全国平均の10年分のデータも画像表示してみます。
以前の記事で計算していた全国平均のデータを使います。

# 全国平均の2010-2019年度データ
df_zenkoku = pd.read_csv('./df_y_C10_sn.csv')
print(df_zenkoku.shape) # (10, 161)

# 数値部分のみ取り出し
X_zenkoku = df_zenkoku.iloc[:,1:]
print(X_zenkoku.shape) # (10, 160)

# スケーリング
X_zenkoku = scaler.transform(X_zenkoku)
print(X_zenkoku.shape) # (10, 160)

# 画像表示
plt.figure(figsize=(10*2, 2))
for y in range(10):
  ax = plt.subplot(1, 10, y+1)
  plt.title('{}'.format(df_zenkoku.iloc[y,0]), fontsize=16)
  plt.imshow(X_zenkoku[y].reshape(10,16), vmin=0, vmax=1, cmap='bwr')
  ax.get_xaxis().set_visible(False)
  ax.get_yaxis().set_visible(False)
plt.show()

全国平均の10年分の画像表示

さきほど各都道府県で見られた全体的な傾向
・１件当たり日数（3行目～5行目）が低く（青く）なっていく
・１日当たり医療費（6行目～9行目）が高く（赤く）なっていく
は全国平均でも確認できます。
また、１人当たり件数（受診率）については、入院では若干低くなり、外来、歯科では高くなっているのがわかります（特に歯科）。

この10年間で、
・入院から外来へのシフトが進んだ
・歯の健康意識の意識の高まりで歯科受診が増えた
・DPC制度の進展等により入院の在院日数は短縮された
・長期処方により外来の受診日数が減った
・医療の高度化が進んだ
ということのようです。

参考：
・歯科医療に関する一般生活者意識調査
・医療施設(動態)調査･病院報告の概況
・平均診療間隔の年次推移
・再来患者の平均診療間隔の推移

PCA結果の逆変換

以前の記事で、PCAにより160次元を2次元まで圧縮しましたが、それを逆変換して160次元に戻したものを画像表示してみます。

# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
PC = pca.fit_transform(X)
print(PC.shape) # (470, 2)

# PCAの逆変換
X_pca_inv = pca.inverse_transform(PC)
print(X_pca_inv.shape) # (470, 160)

# 画像表示
plt.figure(figsize=(10*2, 47*2))
for t in range(47):
  for y in range(10):
    ax = plt.subplot(47, 10, t*10+y+1)
    plt.title('{}, {}'.format(df.iloc[t+47*y,1], df.iloc[t+47*y,0]), fontsize=16)
    plt.imshow(X_pca_inv[t+47*y].reshape(10,16), vmin=0, vmax=1, cmap='bwr')
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

元のデータの画像表示と見比べると、全体的に白っぽくなっていますが、全体的な傾向は概ね再現できているように見えます。

全国平均のPCA結果の逆変換

全国平均についても、PCAの結果を逆変換して、元のデータの画像表示と見比べてみます。

# PCA
PC_zenkoku = pca.transform(X_zenkoku)
print(PC_zenkoku.shape) # (10, 2)

# PCAの逆変換
X_zenkoku_pca_inv = pca.inverse_transform(PC_zenkoku)
print(X_zenkoku_pca_inv.shape) # (10, 160)

print(X_zenkoku[0].reshape(10,16))
# [[0.4 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.3 0.4 0.2 0.3 0.4 0.4 0.4]
#  [0.5 0.5 0.6 0.4 0.4 0.3 0.4 0.5 0.5 0.5 0.7 0.4 0.3 0.3 0.5 0.6]
#  [0.3 0.2 0.5 0.5 0.4 0.5 0.5 0.6 0.3 0.2 0.4 0.4 0.4 0.4 0.5 0.5]
#  [0.5 0.5 0.4 0.5 0.5 0.6 0.6 0.5 0.5 0.4 0.4 0.4 0.6 0.7 0.6 0.6]
#  [0.6 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.6 0.6 0.6 0.5]
#  [0.6 0.6 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.6 0.7 0.6 0.6]
#  [0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.3 0.3 0.3]
#  [0.5 0.2 0.1 0.1 0.2 0.3 0.3 0.3 0.4 0.3 0.2 0.3 0.3 0.3 0.3 0.3]
#  [0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3]
#  [0.4 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.5 0.2 0.3 0.2 0.3 0.4 0.3 0.3]]

print(X_zenkoku_pca_inv[0].reshape(10,16))
# [[0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.4 0.3 0.5 0.3 0.4 0.5 0.4 0.5]
#  [0.6 0.6 0.6 0.5 0.5 0.5 0.5 0.6 0.6 0.5 0.7 0.5 0.4 0.4 0.6 0.7]
#  [0.3 0.2 0.4 0.4 0.4 0.5 0.5 0.6 0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.6]
#  [0.6 0.5 0.4 0.5 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6]
#  [0.6 0.5 0.5 0.6 0.5 0.5 0.5 0.6 0.6 0.5 0.6 0.7 0.6 0.6 0.6 0.5]
#  [0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6]
#  [0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.3 0.3 0.3 0.3 0.2]
#  [0.5 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.4 0.3 0.2 0.3 0.3 0.3 0.3 0.4]
#  [0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.3 0.3 0.3 0.3 0.2 0.3]
#  [0.4 0.2 0.1 0.3 0.3 0.3 0.3 0.3 0.5 0.3 0.2 0.2 0.3 0.3 0.3 0.3]]

print('全国平均')
plt.figure(figsize=(10*2, 2))
for y in range(10):
  ax = plt.subplot(1, 10, y+1)
  plt.title('{}'.format(df_zenkoku.iloc[y,0]), fontsize=16)
  plt.imshow(X_zenkoku[y].reshape(10,16), vmin=0, vmax=1, cmap='bwr')
  ax.get_xaxis().set_visible(False)
  ax.get_yaxis().set_visible(False)
plt.show()

print('全国平均のPCAの結果の逆変換')
plt.figure(figsize=(10*2, 2))
for y in range(10):
  ax = plt.subplot(1, 10, y+1)
  plt.title('{}'.format(df_zenkoku.iloc[y,0]), fontsize=16)
  plt.imshow(X_zenkoku_pca_inv[y].reshape(10,16), vmin=0, vmax=1, cmap='bwr')
  ax.get_xaxis().set_visible(False)
  ax.get_yaxis().set_visible(False)
plt.show()

全国平均

全国平均のPCAの結果の逆変換

PCAの結果を逆変換で160次元に戻したものと元のデータの画像表示を見比べると、元の画像をほぼ再現できています。

主成分座標平面上の格子点の逆変換

PCA結果の第1主成分と第2主成分のPC1×PC2平面（主成分座標平面）を考えます。

その平面上の格子点を考えます。

# PC1×PC2平面上の格子点
import numpy as np
pc1s = np.arange(-3, 4+1, 1)
pc2s = np.arange(-2, 2.5+0.5, 0.5)
PC_lattice = np.zeros((len(pc1s)*len(pc2s), 2))
i = 0
for pc1 in pc1s:
  for pc2 in pc2s:
    i = i + 1
    # print(i, pc1, pc2)
    PC_lattice[i-1, 0] = pc1
    PC_lattice[i-1, 1] = pc2
print(PC_lattice.shape) # (80, 2)

n = 2
for i in range(n-1):
  for j in range(i+1,n):
    if (i==0 or j==i+1):
      print('PC{} x PC{}'.format(i+1, j+1))
      plt.figure(figsize=(12,5))
      plt.subplot(1, 2, 1)
      plt.title('PC{} x PC{}, colored by y'.format(i+1, j+1))
      plt.xlabel('PC{}'.format(i+1))
      plt.ylabel('PC{}'.format(j+1))
      plt.scatter(x=PC[:,i], y=PC[:,j], c=color_y, s=50, alpha=0.3)
      plt.scatter(x=PC_zenkoku[:,i], y=PC_zenkoku[:,j], s=200, c=color_y_zenkoku)
      plt.scatter(x=PC_lattice[:,i], y=PC_lattice[:,j], s=50, c='gray')
      plt.subplot(1, 2, 2)
      plt.title('PC{} x PC{}, colored by t'.format(i+1, j+1))
      plt.xlabel('PC{}'.format(i+1))
      plt.ylabel('PC{}'.format(j+1))
      plt.scatter(x=PC[:,i], y=PC[:,j], c=color_t, s=50, alpha=0.3)
      plt.scatter(x=PC_zenkoku[:,i], y=PC_zenkoku[:,j], s=200, c='black')
      plt.scatter(x=PC_lattice[:,i], y=PC_lattice[:,j], s=50, c='gray')
      plt.show()

この格子点（PC1×PC2の2次元）を逆変換して160次元に戻したものを画像表示します。

# 格子点の逆変換
X_lattice_pca_inv = pca.inverse_transform(PC_lattice)
print(X_lattice_pca_inv.shape) # (80, 160)

# 画像表示
plt.figure(figsize=(len(pc1s)*2, len(pc2s)*2))
for i in range(len(pc1s)):
  for j in range(len(pc2s)):
    # print(i,j)
    ax = plt.subplot(len(pc2s), len(pc1s), i+j*len(pc1s)+1)
    j = len(pc2s) - j - 1
    # print(i,j)
    plt.title('{}, {}'.format(pc1s[i], pc2s[j]), fontsize=16)
    plt.imshow(X_lattice_pca_inv[i*len(pc2s)+j].reshape(10,16),
               vmin=0, vmax=1, cmap='bwr')
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

第1主成分PC1は、概ね時間軸に対応していました（以前の記事参照）が、格子点の逆変換の画像表示でPC1方向を右から左に見ていくと、
・１件当たり日数（3行目～5行目）が低く（青く）なっていく
・１日当たり医療費（6行目～9行目）が高く（赤く）なっていく
という傾向が確認できます。
また、全国平均は、主成分座標平面上で (2,-0.5) あたりから (-3, 0) あたりまで動きますが、画像表示でもそれが確認できます。
第2主成分以降は何らかの地域差を表す成分でした（以前の記事参照）が、格子点の逆変換の画像表示を縦方向に見ていくと縞模様が見えますので、第2主成分PC2は１人当たり件数と1件当たり日数の診療種別（入院、外来、歯科）の動きの違いを表しているようです。

おわりに

今回は、160次元の医療費データをそのまま可視化する方法として、10×16セルの画像として表示する方法を考えました。
160次元のデータの特徴を画像として視覚的に確認できるようになりました。
PCAにより160次元を2次元まで圧縮して逆変換で160次元に戻したものを元のデータと画像表示で見比べると、全体として白っぽくなったものの、概ね画像としては再現できているようでした。
また、主成分の意味を視覚的に確認できました。

最後まで読んでいただき、ありがとうございました。
お気づきの点等ありましたら、コメントいただけますと幸いです。

#医療費 , #医療費の３要素 , #医療費分析 , #医療費の地域差 , #地域差 , #地域間格差 , #主成分分析 , #PCA , #機械学習 , #Python , #協会けんぽ , #noteで数式 , 画像 , 可視化

この記事が気に入ったらサポートをしてみませんか？