6.2. パラメータ管理¶

アーキテクチャを選び、ハイパーパラメータを設定したら、次は学習ループに進む。ここでの目標は、損失関数を最小化するパラメータ値を見つけることである。学習後には、将来の予測を行うためにこれらのパラメータが必要になる。さらに、場合によってはパラメータを取り出して、別の文脈で再利用したり、モデルをディスクに保存して他のソフトウェアで実行できるようにしたり、あるいは科学的理解を得ることを期待して調べたりしたいこともある。

ほとんどの場合、パラメータがどのように宣言され、操作されるかという細かな詳細は気にせず、深層学習フレームワークに重い処理を任せることができる。しかし、標準的な層を積み重ねたアーキテクチャから離れると、パラメータの宣言や操作の細部に踏み込む必要が出てくることがある。この節では、次の内容を扱う。

デバッグ、診断、可視化のためのパラメータへのアクセス。
異なるモデル構成要素間でのパラメータ共有。

pytorch mxnet jax tensorflow

import torch
from torch import nn

from mxnet import init, np, npx
from mxnet.gluon import nn
npx.set_np()

from d2l import jax as d2l
from flax import linen as nn
import jax
from jax import numpy as jnp

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

import tensorflow as tf

まずは隠れ層が1つのMLPに注目する。

pytorch mxnet jax tensorflow

net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))

X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

net = nn.Sequential()
net.add(nn.Dense(8, activation='relu'))
net.add(nn.Dense(1))
net.initialize()  # Use the default initialization method

X = np.random.uniform(size=(2, 4))
net(X).shape

[07:06:02] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU

(2, 1)

net = nn.Sequential([nn.Dense(8), nn.relu, nn.Dense(1)])

X = jax.random.uniform(d2l.get_key(), (2, 4))
params = net.init(d2l.get_key(), X)
net.apply(params, X).shape

(2, 1)

net = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(4, activation=tf.nn.relu),
    tf.keras.layers.Dense(1),
])

X = tf.random.uniform((2, 4))
net(X).shape

TensorShape([2, 1])

6.2.1. パラメータへのアクセス¶

まずは、すでに知っているモデルからパラメータにアクセスする方法を見ていこう。

モデルが Sequential クラスを使って定義されている場合、まずモデルをリストのようにインデックス指定して任意の層にアクセスできる。各層のパラメータは、その属性に簡単に格納されている。

次のようにして、2番目の全結合層のパラメータを調べられる。

pytorch mxnet jax tensorflow

net[2].state_dict()

OrderedDict([('weight',
              tensor([[-0.0915,  0.2323,  0.0672,  0.0951,  0.0561,  0.2846,  0.3074, -0.2919]])),
             ('bias', tensor([0.2610]))])

net[1].params

dense1_ (
  Parameter dense1_weight (shape=(1, 8), dtype=float32)
  Parameter dense1_bias (shape=(1,), dtype=float32)
)

params['params']['layers_2']

FrozenDict({
    kernel: Array([[-0.5666766 ],
           [-0.2169885 ],
           [ 0.37719882],
           [-0.2976346 ],
           [ 0.15896064],
           [ 0.72404635],
           [-0.44518152],
           [-0.08496901]], dtype=float32),
    bias: Array([0.], dtype=float32),
})

net.layers[2].weights

[<tf.Variable 'dense_1/kernel:0' shape=(4, 1) dtype=float32, numpy=
 array([[ 0.2972461],
        [ 0.2490077],
        [-0.9411764],
        [ 1.0483162]], dtype=float32)>,
 <tf.Variable 'dense_1/bias:0' shape=(1,) dtype=float32, numpy=array([0.], dtype=float32)>]

この全結合層には2つのパラメータが含まれており、それぞれその層の重みとバイアスに対応している。

6.2.1.1. 対象を絞ったパラメータ¶

各パラメータはパラメータクラスのインスタンスとして表現されることに注意してほしい。パラメータを使って何か有用なことをするには、まずその内部の数値を取り出す必要がある。その方法はいくつかある。より簡単なものもあれば、より一般的なものもある。次のコードは、2番目のニューラルネットワーク層からバイアスを取り出す。パラメータクラスのインスタンスを返し、さらにそのパラメータの値にアクセスする。

pytorch mxnet jax tensorflow

type(net[2].bias), net[2].bias.data

(torch.nn.parameter.Parameter, tensor([0.2610]))

type(net[1].bias), net[1].bias.data()

(mxnet.gluon.parameter.Parameter, array([0.]))

bias = params['params']['layers_2']['bias']
type(bias), bias

(jaxlib.xla_extension.ArrayImpl, Array([0.], dtype=float32))

type(net.layers[2].weights[1]), tf.convert_to_tensor(net.layers[2].weights[1])

(tensorflow.python.ops.resource_variable_ops.ResourceVariable,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.], dtype=float32)>)

パラメータは、値、勾配、その他の情報を含む複雑なオブジェクトである。そのため、値を明示的に要求する必要がある。

値に加えて、各パラメータからは勾配にもアクセスできる。このネットワークではまだ逆伝播を呼び出していないため、初期状態のままである。

pytorch mxnet jax tensorflow

net[2].weight.grad == None

True

net[1].weight.grad()

array([[0., 0., 0., 0., 0., 0., 0., 0.]])

jax.tree_util.tree_map(lambda x: x.shape, params)

FrozenDict({
    params: {
        layers_0: {
            bias: (8,),
            kernel: (4, 8),
        },
        layers_2: {
            bias: (1,),
            kernel: (8, 1),
        },
    },
})

net.get_weights()

[array([[ 0.8407554 , -0.70623875, -0.5730775 ,  0.19883841],
        [ 0.22767669, -0.00824153, -0.27324384, -0.05706435],
        [ 0.66750854,  0.72794133, -0.57247126, -0.4776808 ],
        [ 0.35080236,  0.47631615, -0.53025633, -0.19356781]],
       dtype=float32),
 array([0., 0., 0., 0.], dtype=float32),
 array([[ 0.2972461],
        [ 0.2490077],
        [-0.9411764],
        [ 1.0483162]], dtype=float32),
 array([0.], dtype=float32)]

6.2.1.2. すべてのパラメータを一度に¶

すべてのパラメータに対して操作を行う必要があるとき、 1つずつアクセスするのは面倒になりがちである。特に、より複雑な、たとえば入れ子になったモジュールを扱う場合には、各サブモジュールのパラメータを取り出すためにツリー全体を再帰的にたどる必要があるため、状況はさらに扱いにくくなる。以下では、すべての層のパラメータにアクセスする方法を示す。

pytorch mxnet jax tensorflow

[(name, param.shape) for name, param in net.named_parameters()]

[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

net.collect_params()

sequential0_ (
  Parameter dense0_weight (shape=(8, 4), dtype=float32)
  Parameter dense0_bias (shape=(8,), dtype=float32)
  Parameter dense1_weight (shape=(1, 8), dtype=float32)
  Parameter dense1_bias (shape=(1,), dtype=float32)
)

# We need to give the shared layer a name so that we can refer to its
# parameters
shared = nn.Dense(8)
net = nn.Sequential([nn.Dense(8), nn.relu,
                     shared, nn.relu,
                     shared, nn.relu,
                     nn.Dense(1)])

params = net.init(jax.random.PRNGKey(d2l.get_seed()), X)

# Check whether the parameters are different
print(len(params['params']) == 3)

True

# tf.keras behaves a bit differently. It removes the duplicate layer
# automatically
shared = tf.keras.layers.Dense(4, activation=tf.nn.relu)
net = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    shared,
    shared,
    tf.keras.layers.Dense(1),
])

net(X)
# Check whether the parameters are different
print(len(net.layers) == 3)

True

6.2.2. 共有パラメータ¶

しばしば、複数の層にまたがってパラメータを共有したいことがある。それをエレガントに行う方法を見てみよう。以下では全結合層を1つ用意し、そのパラメータを使って別の層のパラメータを設定する。ここでは、パラメータにアクセスする前に順伝播 net(X) を実行する必要がある。

pytorch mxnet

# We need to give the shared layer a name so that we can refer to its
# parameters
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))

net(X)
# Check whether the parameters are the same
print(net[2].weight.data[0] == net[4].weight.data[0])
net[2].weight.data[0, 0] = 100
# Make sure that they are actually the same object rather than just having the
# same value
print(net[2].weight.data[0] == net[4].weight.data[0])

tensor([True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True])

net = nn.Sequential()
# We need to give the shared layer a name so that we can refer to its
# parameters
shared = nn.Dense(8, activation='relu')
net.add(nn.Dense(8, activation='relu'),
        shared,
        nn.Dense(8, activation='relu', params=shared.params),
        nn.Dense(10))
net.initialize()

X = np.random.uniform(size=(2, 20))

net(X)
# Check whether the parameters are the same
print(net[1].weight.data()[0] == net[2].weight.data()[0])
net[1].weight.data()[0, 0] = 100
# Make sure that they are actually the same object rather than just having the
# same value
print(net[1].weight.data()[0] == net[2].weight.data()[0])

[ True  True  True  True  True  True  True  True]
[ True  True  True  True  True  True  True  True]

この例は、第2層と第3層のパラメータが結び付けられていることを示している。単に等しいだけではなく、まったく同じテンソルとして表現されている。したがって、どちらか一方のパラメータを変更すると、もう一方も変化する。

パラメータが共有されているとき、勾配はどうなるのか疑問に思うかもしれない。モデルのパラメータには勾配が含まれているため、第2の隠れ層と第3の隠れ層の勾配は逆伝播の際に加算される。

6.2.3. まとめ¶

モデルパラメータにアクセスし、共有するためのいくつかの方法がある。

6.2.4. 演習¶

6.1 章で定義した NestMLP モデルを使い、各層のパラメータにアクセスせよ。
共有パラメータ層を含むMLPを構成して学習せよ。学習の過程で、各層のモデルパラメータと勾配を観察せよ。
パラメータ共有はなぜ良い考えなのか。