TensorFlow解説、tf.data を使って CSV をロードする

2020年8月8日 17:12

tf.data を使って CSV をロードする https://www.tensorflow.org/tutorials/load_data/csv
上記のTensorFlowチュートリアルを読んでつまづいたところのメモです。

np.set_printoptions(precision=3, suppress=True)とは？

np.set_printoptions(precision=3, suppress=True)

precision=3
は小数点3桁まで表示するという設定

print(np.array([0.123456789]))

[0.123]

0.123456789
という数字が
0.123
までしか表示されなくなる。

suppress
は指数表示を抑えるかどうかに関するオプション。

np.set_printoptions(precision=3, suppress=False)
print(np.array([0.0000123456789]))

[1.235e-05]

supress=False
にした場合、
0.0000123456789
が
1.235e-05
のように表示される。

1.235e-05
というのは
1.23 × 10のマイナス5乗という意味。

np.set_printoptions(precision=3, suppress=True)
print(np.array([0.0000123456789]))

[0.]

supress=True
にした場合、
0.0000123456789
が
0.
と表示される。
（precisionが3なので0.000となるはずだが、少数以下がすべて0なので表示が省かれた）

suppressのデフォルト値はFalse

$ head {train_file_path}がエラーになる

$ head {train_file_path}

チュートリアルに
$ head {train_file_path}
とあるので、Google Colaboratoryにそのまま入力したがエラーになる。

ドルマークを除いて
head {train_file_path}
と入力してみてもエラー。

!head {train_file_path}

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n

先頭に!をつけてやれば正しく実行された。

Google Colaboratoryでは
先頭に!を付けるとlinuxコマンドが実行できる。

head
はファイルの先頭10行だけを表示するlinuxコマンド。

!の行の中でpythonの変数を使うときは変数名を{}で囲む

train_file_path
の中身は
/root/.keras/datasets/train.csv
という文字列。

!head {train_file_path}

の代わりに

!head /root/.keras/datasets/train.csv

と書いても意味は同じ。

get_dataset()は何をしている？

def get_dataset(file_path, **kwargs):
 dataset = tf.data.experimental.make_csv_dataset(
     file_path,
     batch_size=5, # Artificially small to make examples easier to show.
     label_name=LABEL_COLUMN,
     na_value="?",
     num_epochs=1,
     ignore_errors=True, 
     **kwargs)
 return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

試しに1件だけデータを取り出してみる。

for x in raw_train_data.take(1):
   print(x)

(OrderedDict([('sex', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'female', b'male', b'male'], dtype=object)>), ('age', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 9., 60., 33., 32., 47.], dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 0, 0], dtype=int32)>), ('parch', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 0, 0, 0], dtype=int32)>), ('fare', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([15.9 , 79.2 , 53.1 ,  8.05,  7.25], dtype=float32)>), ('class', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Third', b'First', b'First', b'Third', b'Third'], dtype=object)>), ('deck', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'unknown', b'B', b'E', b'E', b'unknown'], dtype=object)>), ('embark_town', <tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Southampton', b'Southampton',
      b'Southampton'], dtype=object)>), ('alone', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'n', b'n', b'n', b'y', b'y'], dtype=object)>)]), <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 1, 0], dtype=int32)>)

中身は上記の通り。

batch_size=5とは？

batch_size=5
というオプションの効果で
データが5件ずつのバッチになっています。

('sex', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'female', b'male', b'male'], dtype=object)>)
性別のデータが5件、

('age', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 9., 60., 33., 32., 47.], dtype=float32)>)
年齢のデータが5件、

みたいに各項目が5件ワンセットになっています。

label_name=LABEL_COLUMNとは？

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n

headの実行結果は上記の通りでした。

左端にsurvivedがあり、縦に0,1,1,1,0とデータが並んでいます。
その隣にsexがあり、縦にmale,female,female,female,maleとデータが並んでいます。

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
という項目がすべて同列に扱われています。

今回は、
sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
のデータを元に
survived
の値を予測するモデルを作成します。

そのため、
survived
をラベル
sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
を入力データ
として切り分けます。

LABEL_COLUMN = 'survived'

def get_dataset(file_path, **kwargs):
 dataset = tf.data.experimental.make_csv_dataset(
     file_path,
     batch_size=5, # Artificially small to make examples easier to show.
     label_name=LABEL_COLUMN,
     na_value="?",
     num_epochs=1,
     ignore_errors=True, 
     **kwargs)
 return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

tf.data.experimental.make_csv_dataset()
の引数で
label_name=LABEL_COLUMN
を指定しています。

LABEL_COLUMN
の中身は
survived
という文字列です。

label_name=LABEL_COLUMN
このオプションの指定でデータとラベルを切り分けています。

for x in raw_train_data.take(1):
   print(x)

でデータの先頭1件を見てみると

(OrderedDict([('sex', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'female', b'male', b'male'], dtype=object)>), ('age', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 9., 60., 33., 32., 47.], dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 0, 0], dtype=int32)>), ('parch', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 0, 0, 0], dtype=int32)>), ('fare', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([15.9 , 79.2 , 53.1 ,  8.05,  7.25], dtype=float32)>), ('class', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Third', b'First', b'First', b'Third', b'Third'], dtype=object)>), ('deck', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'unknown', b'B', b'E', b'E', b'unknown'], dtype=object)>), ('embark_town', <tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Southampton', b'Southampton',
      b'Southampton'], dtype=object)>), ('alone', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'n', b'n', b'n', b'y', b'y'], dtype=object)>)]), <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 1, 0], dtype=int32)>)

となっています。

ごちゃとごちゃしていて見づらいですが、よく見ると
(○○, ××)
という要素数2のタプルになっています。

○○の部分が実際には以下のデータ
OrderedDict([('sex', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'female', b'male', b'male'], dtype=object)>), ('age', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 9., 60., 33., 32., 47.], dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 0, 0], dtype=int32)>), ('parch', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 0, 0, 0], dtype=int32)>), ('fare', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([15.9 , 79.2 , 53.1 , 8.05, 7.25], dtype=float32)>), ('class', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Third', b'First', b'First', b'Third', b'Third'], dtype=object)>), ('deck', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'unknown', b'B', b'E', b'E', b'unknown'], dtype=object)>), ('embark_town', <tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Southampton', b'Southampton',
b'Southampton'], dtype=object)>), ('alone', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'n', b'n', b'n', b'y', b'y'], dtype=object)>)]),

××の部分が実際には以下のデータ
<tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 1, 0], dtype=int32)>

つまり、
（データ, ラベル）
という形のタプルです。

データは
[b'male', b'male', b'female', b'male', b'male']
[ 9., 60., 33., 32., 47.]
のように5件ずつセットになっており、

ラベルも
[1, 1, 1, 1, 0]
のように5件ずつセットになっています。

show_batch(dataset)は何をしている？

def show_batch(dataset):
 for batch, label in dataset.take(1):
   for key, value in batch.items():
     print("{:20s}: {}".format(key, value.numpy()))

for batch, label in dataset.take(1)
のコードは

(OrderedDict([('sex', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'female', b'male', b'male'], dtype=object)>), ('age', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 9., 60., 33., 32., 47.], dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 0, 0], dtype=int32)>), ('parch', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 0, 0, 0], dtype=int32)>), ('fare', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([15.9 , 79.2 , 53.1 , 8.05, 7.25], dtype=float32)>), ('class', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Third', b'First', b'First', b'Third', b'Third'], dtype=object)>), ('deck', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'unknown', b'B', b'E', b'E', b'unknown'], dtype=object)>), ('embark_town', <tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Southampton', b'Southampton',
b'Southampton'], dtype=object)>), ('alone', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'n', b'n', b'n', b'y', b'y'], dtype=object)>)]), <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 1, 0], dtype=int32)>)

こういう感じのデータを

前半のこの部分
OrderedDict([('sex', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'female', b'male', b'male'], dtype=object)>), ('age', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 9., 60., 33., 32., 47.], dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 0, 0], dtype=int32)>), ('parch', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 0, 0, 0], dtype=int32)>), ('fare', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([15.9 , 79.2 , 53.1 , 8.05, 7.25], dtype=float32)>), ('class', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Third', b'First', b'First', b'Third', b'Third'], dtype=object)>), ('deck', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'unknown', b'B', b'E', b'E', b'unknown'], dtype=object)>), ('embark_town', <tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Southampton', b'Southampton',
b'Southampton'], dtype=object)>), ('alone', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'n', b'n', b'n', b'y', b'y'], dtype=object)>)]),
をbatchに、

後半のこの部分
<tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 1, 0], dtype=int32)>
をlabelに切り分けています。

for key, value in batch.items():
のコードは

batchの中身、つまり以下のものを
OrderedDict([('sex', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'male', b'male', b'female', b'male', b'male'], dtype=object)>), ('age', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 9., 60., 33., 32., 47.], dtype=float32)>), ('n_siblings_spouses', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 1, 0, 0], dtype=int32)>), ('parch', <tf.Tensor: shape=(5,), dtype=int32, numpy=array([1, 1, 0, 0, 0], dtype=int32)>), ('fare', <tf.Tensor: shape=(5,), dtype=float32, numpy=array([15.9 , 79.2 , 53.1 , 8.05, 7.25], dtype=float32)>), ('class', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'Third', b'First', b'First', b'Third', b'Third'], dtype=object)>), ('deck', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'unknown', b'B', b'E', b'E', b'unknown'], dtype=object)>), ('embark_town', <tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'Southampton', b'Cherbourg', b'Southampton', b'Southampton',
b'Southampton'], dtype=object)>), ('alone', <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'n', b'n', b'n', b'y', b'y'], dtype=object)>)]),

keyにsex
valueに[b'male', b'male', b'female', b'male', b'male']

keyにage
valueに[ 9., 60., 33., 32., 47.]

という感じで振り分けています。

show_batch(raw_train_data)

sex                 : [b'male' b'female' b'male' b'female' b'male']
age                 : [ 0.83 27.   32.5  28.   28.  ]
n_siblings_spouses  : [0 1 1 1 1]
parch               : [2 0 0 0 0]
fare                : [29.    21.    30.071 89.104 82.171]
class               : [b'Second' b'Second' b'Second' b'First' b'First']
deck                : [b'unknown' b'unknown' b'unknown' b'C' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Cherbourg' b'Cherbourg' b'Cherbourg']
alone               : [b'n' b'n' b'n' b'n' b'n']

そして、show_batch()を実行した結果は上記の通り。

データの中身はこうなってるのね
というのを作業者が確認するためだけのものなので
特にモデルの実装には関係してきません。

temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)は何をしている？

survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n

元のデータが上記のようになっていた場合は

temp_dataset = get_dataset(train_file_path)

こんな感じでデータを読み込むだけでOK。

しかし、

0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n

元のデータがこんな感じだった場合（1行目に列名がない場合）

CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']
temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)

こんな感じで自分で列名をセットしましょう。

今回のデータに関しては元データに列名が含まれているので、実際にはこの作業は不要です。
チュートリアルなので説明として軽く触れただけ。

shuffle=False

def get_dataset(file_path, **kwargs):
 dataset = tf.data.experimental.make_csv_dataset(
     file_path,
     batch_size=5, # Artificially small to make examples easier to show.
     label_name=LABEL_COLUMN,
     na_value="?",
     num_epochs=1,
     ignore_errors=True, 
     **kwargs)
 return dataset

tf.data.experimental.make_csv_dataset()
では
shuffle=True
がデフォルトになっています。

引数として
shuffle=False
を書かない限りは
shuffle=True
として扱われます。

つまり、
get_dataset()
でファイルを読み込むたびに、データがシャッフルされます。
これだと動作確認するときに混乱するので
チュートリアル中は
shuffle=False
にした方がいいでしょう。

DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]って何だ？

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
temp_dataset = get_dataset(train_file_path, 
                          select_columns=SELECT_COLUMNS,
                          shuffle=False,
                          column_defaults=DEFAULTS)

show_batch(temp_dataset)

age                 : [22. 38. 26. 35. 28.]
n_siblings_spouses  : [1. 1. 0. 1. 0.]
parch               : [0. 0. 0. 0. 0.]
fare                : [ 7.25  71.283  7.925 53.1    8.458]

DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
突然登場したこの数字は何だ？
先頭だけ　0　で2つ目以降は　0.0　なのはなぜだ？

DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
の部分を
DEFAULTS = [0, 0.0, 0, 0, 0.0]
に変更して実行してみます。

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0, 0.0, 0, 0, 0.0]
temp_dataset = get_dataset(train_file_path, 
                          select_columns=SELECT_COLUMNS,
                          shuffle=False,
                          column_defaults=DEFAULTS)

show_batch(temp_dataset)

age                 : [22. 38. 26. 35. 28.]
n_siblings_spouses  : [1 1 0 1 0]
parch               : [0 0 0 0 0]
fare                : [ 7.25  71.283  7.925 53.1    8.458]

DEFAULTS = [0, 0.0, 0, 0, 0.0]
の部分は型指定でした。

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0, 0.0, 0, 0, 0.0]
と設定した場合、
survived が 0
ageが0.0
n_siblings_spousesが0
parchが0
fareが0.0
という対応関係になります。

すると、
0を指定した
survived、n_siblings_spouses、parch
はintとして扱われ、
0.0を指定した
age、fare
はfloatとして扱われます。

DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
と指定したときは、
n_siblings_spouses : [1. 1. 0. 1. 0.]
parch : [0. 0. 0. 0. 0.]
となっていた部分が

DEFAULTS = [0, 0.0, 0, 0, 0.0]
と指定したときは、
n_siblings_spouses : [1 1 0 1 0]
parch : [0 0 0 0 0]
となっています。

def pack(features, label):は何をしている？

def pack(features, label):
 return tf.stack(list(features.values()), axis=-1), label

packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
 print(features.numpy())
 print()
 print(labels.numpy())

[[22.     1.     0.     7.25 ]
[38.     1.     0.    71.283]
[26.     0.     0.     7.925]
[35.     1.     0.    53.1  ]
[28.     0.     0.     8.458]]

[0 1 1 1 0]

まずは、temp_datasetの中身を確認。

next(iter(temp_dataset))

(OrderedDict([('age',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([22., 38., 26., 35., 28.], dtype=float32)>),
             ('n_siblings_spouses',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([1., 1., 0., 1., 0.], dtype=float32)>),
             ('parch',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 0., 0., 0., 0.], dtype=float32)>),
             ('fare',
              <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 7.25 , 71.283,  7.925, 53.1  ,  8.458], dtype=float32)>)]),
<tf.Tensor: shape=(5,), dtype=int32, numpy=array([0, 1, 1, 1, 0], dtype=int32)>)

ageが[22., 38., 26., 35., 28.]
n_siblings_spousesが[1., 1., 0., 1., 0.]
という感じで5個ずつデータが入っています。

データとラベルに分けましょう。

features, labels = next(iter(temp_dataset))
features

OrderedDict([('age',
             <tf.Tensor: shape=(5,), dtype=float32, numpy=array([22., 38., 26., 35., 28.], dtype=float32)>),
            ('n_siblings_spouses',
             <tf.Tensor: shape=(5,), dtype=float32, numpy=array([1., 1., 0., 1., 0.], dtype=float32)>),
            ('parch',
             <tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 0., 0., 0., 0.], dtype=float32)>),
            ('fare',
             <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 7.25 , 71.283,  7.925, 53.1  ,  8.458], dtype=float32)>)])

上記がデータの方。

labels

<tf.Tensor: shape=(5,), dtype=int32, numpy=array([0, 1, 1, 1, 0], dtype=int32)>

上記がラベルの方。

features（データ）はOrderedDictという形式になっているので、valueだけ取り出します。

features.values()

odict_values([<tf.Tensor: shape=(5,), dtype=float32, numpy=array([22., 38., 26., 35., 28.], dtype=float32)>, <tf.Tensor: shape=(5,), dtype=float32, numpy=array([1., 1., 0., 1., 0.], dtype=float32)>, <tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 0., 0., 0., 0.], dtype=float32)>, <tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 7.25 , 71.283,  7.925, 53.1  ,  8.458], dtype=float32)>])

age: [22., 38., 26., 35., 28.]
n_siblings_spouses: [1., 1., 0., 1., 0.]
みたいな形式だったのが
[22., 38., 26., 35., 28.]
[1., 1., 0., 1., 0.]
みたいな形式になりました。
（age、n_siblings_spousesなどのkeyを排除しました）

list(features.values())

[<tf.Tensor: shape=(5,), dtype=float32, numpy=array([22., 38., 26., 35., 28.], dtype=float32)>,
<tf.Tensor: shape=(5,), dtype=float32, numpy=array([1., 1., 0., 1., 0.], dtype=float32)>,
<tf.Tensor: shape=(5,), dtype=float32, numpy=array([0., 0., 0., 0., 0.], dtype=float32)>,
<tf.Tensor: shape=(5,), dtype=float32, numpy=array([ 7.25 , 71.283,  7.925, 53.1  ,  8.458], dtype=float32)>]

odict_values([○○○])
という形になっていたのを
[○○○]
という形に変換しました。

tf.stack(list(features.values()))

<tf.Tensor: shape=(4, 5), dtype=float32, numpy=
array([[22.   , 38.   , 26.   , 35.   , 28.   ],
      [ 1.   ,  1.   ,  0.   ,  1.   ,  0.   ],
      [ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ],
      [ 7.25 , 71.283,  7.925, 53.1  ,  8.458]], dtype=float32)>

元々は
<tf.Tensor: shape=(5,),　～～～>
<tf.Tensor: shape=(5,),　～～～>
<tf.Tensor: shape=(5,),　～～～>
<tf.Tensor: shape=(5,),　～～～>
のように、shape(5,)のTensorが4つ並んでいました。

tf.stack()
を使った結果、
<tf.Tensor: shape=(4, 5),～～>
という1つのTensorに統合されました。

tf.stack(list(features.values()), axis=-1)

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[22.   ,  1.   ,  0.   ,  7.25 ],
      [38.   ,  1.   ,  0.   , 71.283],
      [26.   ,  0.   ,  0.   ,  7.925],
      [35.   ,  1.   ,  0.   , 53.1  ],
      [28.   ,  0.   ,  0.   ,  8.458]], dtype=float32)>

tf.stack()
のオプションに
axis=-1
を追加すると
行と列が入れ替わりました。

ちなみに、今回のケースでは2次元なので
tf.stack(list(features.values()))
は
tf.stack(list(features.values()), axis=0)
と同じ意味で、

tf.stack(list(features.values()), axis=-1)
は
tf.stack(list(features.values()), axis=1)
と同じ意味です。

functools.partial()って何だ？

def normalize_numeric_data(data, mean, std):
 # Center the data
 return (data-mean)/std

# See what you just created.
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)

numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column

def normalize_numeric_data(data, mean, std):
で
normalize_numeric_data()
という関数を定義しています。

normalize_numeric_data()
を使うときは
normalize_numeric_data(data=○○, mean=××, std=△△)
のように引数を指定する必要があります。

normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)
というコードは
normalize_numeric_data()
に
normalizer()
という別名をつけています。

しかも、
mean=MEAN, std=STD
という初期値を設定しています。

つまり、本来であれば

normalize_numeric_data(example_batch['numeric'], MEAN, STD)

と書かなければいけないところを

normalizer(example_batch['numeric'])

と書くことができるようになります。

どちらも実行結果は同じ。

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[ 0.429,  0.395, -0.479,  1.019],
      [ 0.269,  0.395,  2.043, -0.122],
      [-0.77 , -0.474, -0.479, -0.501],
      [-0.13 , -0.474, -0.479, -0.485],
      [ 0.189, -0.474, -0.479, -0.485]], dtype=float32)>

このような値が画面に表示されます。

tf.feature_column.numeric_column()とは？

def normalize_numeric_data(data, mean, std):
 # Center the data
 return (data-mean)/std

# See what you just created.
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)

numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column

NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalize_numeric_data at 0x7f3710dfcc80>, mean=array([29.631,  0.545,  0.38 , 34.385]), std=array([12.512,  1.151,  0.793, 54.598])))

numeric_column
の中身は上記の通り。

tf.feature_column.numeric_column()
では、
「'numeric'という名前のついた列は
　shape=(4,)で、
　normalizerを使ってデータの正規化をしてね」
という定義がされただけです。
ここではデータ変換は行われません。

model.fit()などで実際にデータを使用するときにはじめてデータの正規化が行われるみたいです。

tf.keras.layers.DenseFeatures(numeric_columns)とは？

def normalize_numeric_data(data, mean, std):
 # Center the data
 return (data-mean)/std

# See what you just created.
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)

numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]

numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

array([[ 0.429,  0.395, -0.479,  1.019],
      [ 0.269,  0.395,  2.043, -0.122],
      [-0.77 , -0.474, -0.479, -0.501],
      [-0.13 , -0.474, -0.479, -0.485],
      [ 0.189, -0.474, -0.479, -0.485]], dtype=float32)

numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
という部分では
入力データを正規化するための層を定義しています。

たとえば、train_dataの中身が

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[35. , 1. , 0. , 90. ],
[33. , 1. , 2. , 27.75 ],
[20. , 0. , 0. , 7.05 ],
[28. , 0. , 0. , 7.879],
[32. , 0. , 0. , 7.925]], dtype=float32)>

こんな感じだったとします。

array([[ 0.429, 0.395, -0.479, 1.019],
[ 0.269, 0.395, 2.043, -0.122],
[-0.77 , -0.474, -0.479, -0.501],
[-0.13 , -0.474, -0.479, -0.485],
[ 0.189, -0.474, -0.479, -0.485]], dtype=float32)

普通はtrain_dataをこんな感じのデータに正規化をしてから
model.fit(train_data)
として学習します。

numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
というデータを正規化する層を作っておくと

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[35. , 1. , 0. , 90. ],
[33. , 1. , 2. , 27.75 ],
[20. , 0. , 0. , 7.05 ],
[28. , 0. , 0. , 7.879],
[32. , 0. , 0. , 7.925]], dtype=float32)>

こんな感じのデータを
model.fit(train_data)
に直接渡して、
model.fit()
の中で正規化をするようになります。

tf.keras.layers.DenseFeatures()は何をしている？

CATEGORIES = {
   'sex': ['male', 'female'],
   # 'class' : ['First', 'Second', 'Third'],
   # 'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
   # 'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
   # 'alone' : ['y', 'n']
}

categorical_columns = []
for feature, vocab in CATEGORIES.items():
 cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
       key=feature, vocabulary_list=vocab)
 categorical_columns.append(tf.feature_column.indicator_column(cat_col))

categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[:5])

[[0. 1.]
[0. 1.]
[1. 0.]
[0. 1.]
[1. 0.]]

要素が多いと分かりにくいので、
CATEGORIES = {}
の中身を
sex
以外の項目はコメントアウトしました。

また、
print(categorical_layer(example_batch).numpy()[0])
となっていた部分を
print(categorical_layer(example_batch).numpy()[:5])
に改変。

example_batch

OrderedDict([('sex',
             <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'female', b'female', b'male', b'female', b'male'], dtype=object)>),
            ('class',
             <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'First', b'Second', b'Third', b'Third', b'Third'], dtype=object)>),
            ('deck',
             <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'C', b'unknown', b'unknown', b'unknown', b'unknown'], dtype=object)>),
            ('embark_town', <tf.Tensor: shape=(5,), dtype=string, numpy=
             array([b'Southampton', b'Southampton', b'Southampton', b'Queenstown',
                    b'Southampton'], dtype=object)>),
            ('alone',
             <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'n', b'n', b'y', b'y', b'y'], dtype=object)>),
            ('numeric', <tf.Tensor: shape=(5, 4), dtype=float32, numpy=
             array([[35.   ,  1.   ,  0.   , 90.   ],
                    [33.   ,  1.   ,  2.   , 27.75 ],
                    [20.   ,  0.   ,  0.   ,  7.05 ],
                    [28.   ,  0.   ,  0.   ,  7.879],
                    [32.   ,  0.   ,  0.   ,  7.925]], dtype=float32)>)])

example_batch
の中身はこんな感じ。

sexの項目に関しては
[b'female', b'female', b'male', b'female', b'male']
こんなデータが並んでいました。

列で書くと
sex
female
female
male
female
male
こういう感じ。

tf.keras.layers.DenseFeatures()
を使ってデータがこんな形式に変換されました。
male female
0 1
0 1
1 0
0 1
1 0

元データでfemaleになっていた行は、maleが0、femaleが1に
元データでmaleになっていた行はmaleが1、femaleが0に
変換されました。

この記事が気に入ったらサポートをしてみませんか？