%load_ext d2lbook.tab
tab.interact_select(['mxnet', 'pytorch', 'tensorflow', 'jax'])

2.3. 線形代数¶

ここまでで、データセットをテンソルとして読み込み、基本的な数学演算でそれらを操作できるようになった。さらに洗練されたモデルを構築するには、線形代数のいくつかの道具も必要になる。この節では、スカラー演算から始めて行列積へと進みながら、最も重要な概念を段階的に導入する。

pytorch mxnet jax tensorflow

import torch

from mxnet import np, npx
npx.set_np()

from jax import numpy as jnp

import tensorflow as tf

2.3.1. スカラー¶

数学における基本的な演算の多くは、個々の数値を操作することから成る。形式的には、こうした値を スカラー と呼ぶ。たとえば、パロアルトの気温が華氏 \(72\) 度だとする。これを摂氏に変換するには、 \(f\) に \(72\) を代入して \(c = \frac{5}{9}(f - 32)\) を計算すればよい。この式では、\(5\)、\(9\)、\(32\) は定数スカラーである。変数 \(c\) と \(f\) は一般に未知のスカラーを表す。

スカラーは通常の小文字（たとえば \(x\)、\(y\)、\(z\)）で表し、すべての（連続な） 実数値 スカラーの空間を \(\mathbb{R}\) で表す。簡潔さのため、空間の厳密な定義は省く。ここでは、式 \(x \in \mathbb{R}\) は \(x\) が実数値スカラーであることを示す形式的な記法だと考えればよい。記号 \(\in\) は集合への所属を表す。たとえば、\(x, y \in \{0, 1\}\) は、 \(x\) と \(y\) が \(0\) または \(1\) のみを取ることを意味する。

スカラーは、1つの要素だけを持つテンソルとして実装する。以下では、2つのスカラーを代入し、よく知られた加算、乗算、除算、べき乗を行う。

pytorch mxnet jax tensorflow

x = torch.tensor(3.0)
y = torch.tensor(2.0)

x + y, x * y, x / y, x**y

(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))

x = np.array(3.0)
y = np.array(2.0)

x + y, x * y, x / y, x ** y

[07:04:15] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU

(array(5.), array(6.), array(1.5), array(9.))

x = jnp.array(3.0)
y = jnp.array(2.0)

x + y, x * y, x / y, x**y

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

(Array(5., dtype=float32, weak_type=True),
 Array(6., dtype=float32, weak_type=True),
 Array(1.5, dtype=float32, weak_type=True),
 Array(9., dtype=float32, weak_type=True))

x = tf.constant(3.0)
y = tf.constant(2.0)

x + y, x * y, x / y, x**y

(<tf.Tensor: shape=(), dtype=float32, numpy=5.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=6.0>,
 <tf.Tensor: shape=(), dtype=float32, numpy=1.5>,
 <tf.Tensor: shape=(), dtype=float32, numpy=9.0>)

2.3.2. ベクトル¶

ここでは、ベクトルをスカラーの固定長配列と考えればよい。通常の配列と同様に、これらのスカラーをベクトルの要素と呼ぶ（エントリ や成分とも呼ぶ）。ベクトルが現実のデータセット中の例を表すとき、その値は何らかの現実的な意味を持つ。たとえば、ローンの債務不履行リスクを予測するモデルを学習するなら、各申請者を1つのベクトルに対応付け、その成分は収入、勤続年数、過去の債務不履行回数などに対応するかもしれない。心臓発作のリスクを研究するなら、各ベクトルは患者を表し、その成分は最近のバイタルサイン、コレステロール値、 1日あたりの運動時間などに対応するかもしれない。ベクトルは太字の小文字（たとえば \(\mathbf{x}\)、\(\mathbf{y}\)、\(\mathbf{z}\)）で表す。

ベクトルは \(1^{\textrm{st}}\)-order テンソルとして実装する。一般に、このようなテンソルはメモリの制約が許す限り任意の長さを持てる。注意として、Python では他の多くのプログラミング言語と同様に、ベクトルの添字は \(0\) から始まる。これは ゼロ始まりのインデックス付け と呼ばれる。一方、線形代数では添字を \(1\) から始めるのが普通である（1始まりのインデックス付け）。

pytorch mxnet jax tensorflow

x = torch.arange(3)
x

tensor([0, 1, 2])

x = np.arange(3)
x

array([0., 1., 2.])

x = jnp.arange(3)
x

Array([0, 1, 2], dtype=int32)

x = tf.range(3)
x

<tf.Tensor: shape=(3,), dtype=int32, numpy=array([0, 1, 2], dtype=int32)>

添字を使ってベクトルの要素を参照できる。たとえば、\(x_2\) は \(\mathbf{x}\) の2番目の要素を表す。 \(x_2\) はスカラーなので、太字にはしない。通常、ベクトルは要素を縦に並べて表す。

(2.3.1)¶\[\begin{split}\mathbf{x} =\begin{bmatrix}x_{1} \\ \vdots \\x_{n}\end{bmatrix}.\end{split}\]

ここで \(x_1, \ldots, x_n\) はベクトルの要素である。後で、このような 列ベクトル と、要素を横に並べた 行ベクトル を区別する。テンソルの要素にはインデックスでアクセスすることを思い出そう。

x[2]

tensor(2)

ベクトルが \(n\) 個の要素を含むことを示すには、 \(\mathbf{x} \in \mathbb{R}^n\) と書く。形式的には、\(n\) をベクトルの 次元数 と呼ぶ。コードでは、これはテンソルの長さに対応する。 Python の組み込み関数 len で取得できる。

len(x)

長さは shape 属性でも取得できる。 shape は、各軸に沿ったテンソルの長さを表すタプルである。 1つの軸しか持たないテンソルの shape は、1つの要素だけを持つ。

x.shape

torch.Size([3])

しばしば「次元」という語は、軸の数と、特定の軸に沿った長さの両方を指して曖昧に使われる。この混乱を避けるため、軸の数を指すときには階数を用い、要素数を指すときには 次元数 を用いることにする。

2.3.3. 行列¶

スカラーが \(0^{\textrm{th}}\)-order テンソルであり、ベクトルが \(1^{\textrm{st}}\)-order テンソルであるのと同様に、行列は \(2^{\textrm{nd}}\)-order テンソルである。行列は太字の大文字（たとえば \(\mathbf{X}\)、\(\mathbf{Y}\)、\(\mathbf{Z}\)）で表し、コードでは2つの軸を持つテンソルとして表現する。式 \(\mathbf{A} \in \mathbb{R}^{m \times n}\) は、行列 \(\mathbf{A}\) が \(m \times n\) 個の実数値スカラーを含み、 \(m\) 行 \(n\) 列に配置されていることを意味する。 \(m = n\) のとき、その行列を 正方行列 と呼ぶ。視覚的には、任意の行列を表として表せる。個々の要素を参照するには、行と列の両方の添字を付ける。たとえば、 \(a_{ij}\) は \(\mathbf{A}\) の \(i^{\textrm{th}}\) 行 \(j^{\textrm{th}}\) 列の値である。

(2.3.2)¶\[\begin{split}\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}.\end{split}\]

コードでは、行列 \(\mathbf{A} \in \mathbb{R}^{m \times n}\) を shape が (\(m\), \(n\)) の \(2^{\textrm{nd}}\)-order テンソルとして表す。適切なサイズの \(m \times n\) テンソルは、 \(m \times n\) 行列に変形できる。これは reshape に希望する shape を渡せばよい。

pytorch mxnet jax tensorflow

A = torch.arange(6).reshape(3, 2)
A

tensor([[0, 1],
        [2, 3],
        [4, 5]])

A = np.arange(6).reshape(3, 2)
A

array([[0., 1.],
       [2., 3.],
       [4., 5.]])

A = jnp.arange(6).reshape(3, 2)
A

Array([[0, 1],
       [2, 3],
       [4, 5]], dtype=int32)

A = tf.reshape(tf.range(6), (3, 2))
A

<tf.Tensor: shape=(3, 2), dtype=int32, numpy=
array([[0, 1],
       [2, 3],
       [4, 5]], dtype=int32)>

軸を入れ替えたい場合もある。行列の行と列を交換した結果を転置と呼ぶ。形式的には、行列 \(\mathbf{A}\) の転置を \(\mathbf{A}^\top\) で表し、 \(\mathbf{B} = \mathbf{A}^\top\) なら、すべての \(i\) と \(j\) について \(b_{ij} = a_{ji}\) である。したがって、\(m \times n\) 行列の転置は \(n \times m\) 行列になる。

(2.3.3)¶\[\begin{split}\mathbf{A}^\top = \begin{bmatrix} a_{11} & a_{21} & \dots & a_{m1} \\ a_{12} & a_{22} & \dots & a_{m2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} & a_{2n} & \dots & a_{mn} \end{bmatrix}.\end{split}\]

コードでは、任意の行列の転置を次のように得る。

pytorch mxnet jax tensorflow

A.T

tensor([[0, 2, 4],
        [1, 3, 5]])

A.T

array([[0., 2., 4.],
       [1., 3., 5.]])

A.T

Array([[0, 2, 4],
       [1, 3, 5]], dtype=int32)

tf.transpose(A)

<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
array([[0, 2, 4],
       [1, 3, 5]], dtype=int32)>

対称行列とは、自身の転置と等しい正方行列である: \(\mathbf{A} = \mathbf{A}^\top\). 次の行列は対称である。

pytorch mxnet jax tensorflow

A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == A.T

tensor([[True, True, True],
        [True, True, True],
        [True, True, True]])

A = np.array([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == A.T

array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])

A = jnp.array([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == A.T

Array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

A = tf.constant([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == tf.transpose(A)

<tf.Tensor: shape=(3, 3), dtype=bool, numpy=
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])>

行列はデータセットの表現にも便利である。通常、行は個々の記録に対応し、列は異なる属性に対応する。

2.3.4. テンソル¶

スカラー、ベクトル、行列だけでも機械学習の多くを扱えるが、やがてはより高階のテンソルを扱う必要が生じる。テンソルは\(n^{\textrm{th}}\)-order 配列への拡張を一般的に記述する方法を与える。ソフトウェア上の テンソルクラス のオブジェクトを「テンソル」と呼ぶのは、それらも任意個の軸を持てるからである。数学的対象としての テンソル と、コード上の実装を同じ語で呼ぶのは紛らわしいかもしれないが、通常は文脈から意味が明らかである。一般のテンソルは特別な書体の大文字（たとえば \(\mathsf{X}\)、\(\mathsf{Y}\)、\(\mathsf{Z}\)）で表し、そのインデックス付け（たとえば \(x_{ijk}\) や \([\mathsf{X}]_{1, 2i-1, 3}\)）は行列の場合から自然に拡張される。

画像を扱い始めると、テンソルはさらに重要になる。各画像は、高さ、幅、チャネル に対応する軸を持つ \(3^{\textrm{rd}}\)-order テンソルとして表される。各空間位置では、各色（赤、緑、青）の強度がチャネル方向に並ぶ。さらに、画像の集合はコード上では \(4^{\textrm{th}}\)-order テンソルとして表され、個々の画像は第1軸に沿ってインデックス付けされる。高階テンソルも、ベクトルや行列と同様に、 shape の成分数を増やすことで構成する。

pytorch mxnet jax tensorflow

torch.arange(24).reshape(2, 3, 4)

tensor([[[ 0,  1,  2,  3],
         [ 4,  5,  6,  7],
         [ 8,  9, 10, 11]],

        [[12, 13, 14, 15],
         [16, 17, 18, 19],
         [20, 21, 22, 23]]])

np.arange(24).reshape(2, 3, 4)

array([[[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]],

       [[12., 13., 14., 15.],
        [16., 17., 18., 19.],
        [20., 21., 22., 23.]]])

jnp.arange(24).reshape(2, 3, 4)

Array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]], dtype=int32)

tf.reshape(tf.range(24), (2, 3, 4))

<tf.Tensor: shape=(2, 3, 4), dtype=int32, numpy=
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]], dtype=int32)>

2.3.5. テンソル演算の基本的性質¶

スカラー、ベクトル、行列、および高階テンソルには、いくつかの便利な性質がある。たとえば、要素ごとの演算は、入力と同じ shape を持つ出力を生成する。

pytorch mxnet jax tensorflow

A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone()  # A のコピーを B に割り当て、新たにメモリを確保する
A, A + B

A = np.arange(6).reshape(2, 3)
B = A.copy()  # A のコピーを B に割り当て、新たにメモリを確保する
A, A + B

A = jnp.arange(6, dtype=jnp.float32).reshape(2, 3)
B = A
A, A + B

(Array([[0., 1., 2.],
        [3., 4., 5.]], dtype=float32),
 Array([[ 0.,  2.,  4.],
        [ 6.,  8., 10.]], dtype=float32))

A = tf.reshape(tf.range(6, dtype=tf.float32), (2, 3))
B = A  # AをBへ新規メモリ割り当てで複製しない
A, A + B

2つの行列の要素ごとの積は Hadamard 積 と呼ぶ（\(\odot\) で表す）。 2つの行列 \(\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}\) の Hadamard 積の各要素は次のように書ける。

(2.3.4)¶\[\begin{split}\mathbf{A} \odot \mathbf{B} = \begin{bmatrix} a_{11} b_{11} & a_{12} b_{12} & \dots & a_{1n} b_{1n} \\ a_{21} b_{21} & a_{22} b_{22} & \dots & a_{2n} b_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} b_{m1} & a_{m2} b_{m2} & \dots & a_{mn} b_{mn} \end{bmatrix}.\end{split}\]

A * B

tensor([[ 0.,  1.,  4.],
        [ 9., 16., 25.]])

スカラーとテンソルの加算や乗算は、元のテンソルと同じ shape の結果を返す。ここでは、テンソルの各要素にスカラーを加算し、あるいは乗算している。

pytorch mxnet jax tensorflow

a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(tensor([[[ 2,  3,  4,  5],
          [ 6,  7,  8,  9],
          [10, 11, 12, 13]],

         [[14, 15, 16, 17],
          [18, 19, 20, 21],
          [22, 23, 24, 25]]]),
 torch.Size([2, 3, 4]))

a = 2
X = np.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(array([[[ 2.,  3.,  4.,  5.],
         [ 6.,  7.,  8.,  9.],
         [10., 11., 12., 13.]],

        [[14., 15., 16., 17.],
         [18., 19., 20., 21.],
         [22., 23., 24., 25.]]]),
 (2, 3, 4))

a = 2
X = jnp.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape

(Array([[[ 2,  3,  4,  5],
         [ 6,  7,  8,  9],
         [10, 11, 12, 13]],

        [[14, 15, 16, 17],
         [18, 19, 20, 21],
         [22, 23, 24, 25]]], dtype=int32),
 (2, 3, 4))

a = 2
X = tf.reshape(tf.range(24), (2, 3, 4))
a + X, (a * X).shape

(<tf.Tensor: shape=(2, 3, 4), dtype=int32, numpy=
 array([[[ 2,  3,  4,  5],
         [ 6,  7,  8,  9],
         [10, 11, 12, 13]],

        [[14, 15, 16, 17],
         [18, 19, 20, 21],
         [22, 23, 24, 25]]], dtype=int32)>,
 TensorShape([2, 3, 4]))

2.3.6. リダクション¶

しばしば、テンソルの要素の総和を計算したい。長さ \(n\) のベクトル \(\mathbf{x}\) の要素の和は、 \(\sum_{i=1}^n x_i\) と書ける。これを計算する簡単な関数がある。

pytorch mxnet jax tensorflow

x = torch.arange(3, dtype=torch.float32)
x, x.sum()

(tensor([0., 1., 2.]), tensor(3.))

x = np.arange(3)
x, x.sum()

(array([0., 1., 2.]), array(3.))

x = jnp.arange(3, dtype=jnp.float32)
x, x.sum()

(Array([0., 1., 2.], dtype=float32), Array(3., dtype=float32))

x = tf.range(3, dtype=tf.float32)
x, tf.reduce_sum(x)

(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([0., 1., 2.], dtype=float32)>,
 <tf.Tensor: shape=(), dtype=float32, numpy=3.0>)

任意の shape のテンソルの要素の和を表すには、すべての軸にわたって和を取ればよい。たとえば、\(m \times n\) 行列 \(\mathbf{A}\) の要素の和は \(\sum_{i=1}^{m} \sum_{j=1}^{n} a_{ij}\) と書ける。

pytorch mxnet jax tensorflow

A.shape, A.sum()

(torch.Size([2, 3]), tensor(15.))

A.shape, A.sum()

((2, 3), array(15.))

A.shape, A.sum()

((2, 3), Array(15., dtype=float32))

A.shape, tf.reduce_sum(A)

(TensorShape([2, 3]), <tf.Tensor: shape=(), dtype=float32, numpy=15.0>)

デフォルトでは、sum 関数を呼ぶとテンソルはすべての軸に沿って リダクション され、最終的にスカラーが得られる。ライブラリでは、テンソルをどの軸に沿ってリダクションするかを指定できる。行（軸0）に沿ってすべての要素を足し合わせるには、 sum に axis=0 を指定する。入力行列は軸0に沿ってリダクションされて出力ベクトルを生成するため、この軸は出力の shape から消える。

pytorch mxnet jax tensorflow

A.shape, A.sum(axis=0).shape

(torch.Size([2, 3]), torch.Size([3]))

A.shape, A.sum(axis=0).shape

((2, 3), (3,))

A.shape, A.sum(axis=0).shape

((2, 3), (3,))

A.shape, tf.reduce_sum(A, axis=0).shape

(TensorShape([2, 3]), TensorShape([3]))

axis=1 を指定すると、列方向（軸1）がリダクションされ、各行について列方向の要素を足し合わせる。

pytorch mxnet jax tensorflow

A.shape, A.sum(axis=1).shape

(torch.Size([2, 3]), torch.Size([2]))

A.shape, A.sum(axis=1).shape

((2, 3), (2,))

A.shape, A.sum(axis=1).shape

((2, 3), (2,))

A.shape, tf.reduce_sum(A, axis=1).shape

(TensorShape([2, 3]), TensorShape([2]))

行と列の両方に沿って和を取って行列をリダクションすることは、行列のすべての要素を足し合わせることと同じである。

pytorch mxnet jax tensorflow

A.sum(axis=[0, 1]) == A.sum()  # A.sum()と同じ

A.sum(axis=[0, 1]) == A.sum()  # A.sum()と同じ

A.sum(axis=[0, 1]) == A.sum()  # A.sum()と同じ

tf.reduce_sum(A, axis=[0, 1]), tf.reduce_sum(A)  # tf.reduce_sum(A) と同じ

関連する量として平均、すなわち アベレージ がある。平均は、和を要素数で割ることで求める。平均の計算は非常に頻繁に現れるため、 sum と同様に使える専用のライブラリ関数がある。

pytorch mxnet jax tensorflow

A.mean(), A.sum() / A.numel()

(tensor(2.5000), tensor(2.5000))

A.mean(), A.sum() / A.size

(array(2.5), array(2.5))

A.mean(), A.sum() / A.size

(Array(2.5, dtype=float32), Array(2.5, dtype=float32))

tf.reduce_mean(A), tf.reduce_sum(A) / tf.size(A).numpy()

(<tf.Tensor: shape=(), dtype=float32, numpy=2.5>,
 <tf.Tensor: shape=(), dtype=float32, numpy=2.5>)

同様に、平均を計算する関数も特定の軸に沿ってテンソルをリダクションできる。

pytorch mxnet jax tensorflow

A.mean(axis=0), A.sum(axis=0) / A.shape[0]

(tensor([1.5000, 2.5000, 3.5000]), tensor([1.5000, 2.5000, 3.5000]))

A.mean(axis=0), A.sum(axis=0) / A.shape[0]

(array([1.5, 2.5, 3.5]), array([1.5, 2.5, 3.5]))

A.mean(axis=0), A.sum(axis=0) / A.shape[0]

(Array([1.5, 2.5, 3.5], dtype=float32), Array([1.5, 2.5, 3.5], dtype=float32))

tf.reduce_mean(A, axis=0), tf.reduce_sum(A, axis=0) / A.shape[0]

(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([1.5, 2.5, 3.5], dtype=float32)>,
 <tf.Tensor: shape=(3,), dtype=float32, numpy=array([1.5, 2.5, 3.5], dtype=float32)>)

2.3.7. 非リダクション和¶

和や平均を計算する関数を呼ぶとき、軸の数を保ったままにすると便利な場合がある。これは、ブロードキャスト機構を使いたいときに重要である。

pytorch mxnet jax tensorflow

sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape

(tensor([[ 3.],
         [12.]]),
 torch.Size([2, 1]))

sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape

(array([[ 3.],
        [12.]]),
 (2, 1))

sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape

(Array([[ 3.],
        [12.]], dtype=float32),
 (2, 1))

sum_A = tf.reduce_sum(A, axis=1, keepdims=True)
sum_A, sum_A.shape

(<tf.Tensor: shape=(2, 1), dtype=float32, numpy=
 array([[ 3.],
        [12.]], dtype=float32)>,
 TensorShape([2, 1]))

たとえば、sum_A は各行を足し合わせた後も2つの軸を保つので、ブロードキャストを使って A を sum_A で割ることができ、各行の和が \(1\) になる行列を作れる。

A / sum_A

tensor([[0.0000, 0.3333, 0.6667],
        [0.2500, 0.3333, 0.4167]])

A の要素の累積和を、たとえば axis=0（行方向）に沿って計算したければ、 cumsum 関数を呼べる。この関数は設計上、入力テンソルをどの軸に沿ってもリダクションしない。

pytorch mxnet jax tensorflow

A.cumsum(axis=0)

tensor([[0., 1., 2.],
        [3., 5., 7.]])

A.cumsum(axis=0)

array([[0., 1., 2.],
       [3., 5., 7.]])

A.cumsum(axis=0)

Array([[0., 1., 2.],
       [3., 5., 7.]], dtype=float32)

tf.cumsum(A, axis=0)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0., 1., 2.],
       [3., 5., 7.]], dtype=float32)>

2.3.8. ドット積¶

ここまでで、要素ごとの演算、和、平均だけを扱ってきた。もしそれだけなら、線形代数が独立した節を持つ必要はない。幸い、ここからが本題である。最も基本的な演算の1つがドット積である。 2つのベクトル \(\mathbf{x}, \mathbf{y} \in \mathbb{R}^d\) に対して、その ドット積 \(\mathbf{x}^\top \mathbf{y}\)（内積、\(\langle \mathbf{x}, \mathbf{y} \rangle\) とも呼ぶ）は、対応する要素同士の積の和である: \(\mathbf{x}^\top \mathbf{y} = \sum_{i=1}^{d} x_i y_i\)。

pytorch mxnet jax tensorflow

y = torch.ones(3, dtype = torch.float32)
x, y, torch.dot(x, y)

(tensor([0., 1., 2.]), tensor([1., 1., 1.]), tensor(3.))

y = np.ones(3)
x, y, np.dot(x, y)

(array([0., 1., 2.]), array([1., 1., 1.]), array(3.))

y = jnp.ones(3, dtype = jnp.float32)
x, y, jnp.dot(x, y)

(Array([0., 1., 2.], dtype=float32),
 Array([1., 1., 1.], dtype=float32),
 Array(3., dtype=float32))

y = tf.ones(3, dtype=tf.float32)
x, y, tf.tensordot(x, y, axes=1)

(<tf.Tensor: shape=(3,), dtype=float32, numpy=array([0., 1., 2.], dtype=float32)>,
 <tf.Tensor: shape=(3,), dtype=float32, numpy=array([1., 1., 1.], dtype=float32)>,
 <tf.Tensor: shape=(), dtype=float32, numpy=3.0>)

同値な見方をすると、2つのベクトルのドット積は、要素ごとに乗算した後で和を取れば計算できる:

pytorch mxnet jax tensorflow

torch.sum(x * y)

tensor(3.)

np.sum(x * y)

array(3.)

jnp.sum(x * y)

Array(3., dtype=float32)

tf.reduce_sum(x * y)

<tf.Tensor: shape=(), dtype=float32, numpy=3.0>

ドット積は幅広い文脈で有用である。たとえば、ある値の集合をベクトル \(\mathbf{x} \in \mathbb{R}^n\) で表し、重みの集合を \(\mathbf{w} \in \mathbb{R}^n\) で表すと、重み \(\mathbf{w}\) に従った \(\mathbf{x}\) の重み付き和はドット積 \(\mathbf{x}^\top \mathbf{w}\) として表せる。重みが非負で、かつ和が \(1\)、すなわち \(\left(\sum_{i=1}^{n} {w_i} = 1\right)\) であるとき、ドット積は 重み付き平均 を表す。 2つのベクトルを単位長に正規化すると、ドット積はそれらのなす角の余弦を表す。この節の後半で、この長さの概念を正式に導入する。

2.3.9. 行列–ベクトル積¶

ドット積の計算方法がわかれば、 \(m \times n\) 行列 \(\mathbf{A}\) と \(n\) 次元ベクトル \(\mathbf{x}\) の積も理解できる。まず、行列を行ベクトルの集まりとして見る。

(2.3.5)¶\[\begin{split}\mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix},\end{split}\]

ここで各 \(\mathbf{a}^\top_{i} \in \mathbb{R}^n\) は、行列 \(\mathbf{A}\) の \(i^\textrm{th}\) 行を表す行ベクトルである。

行列–ベクトル積 \(\mathbf{A}\mathbf{x}\) は、長さ \(m\) の列ベクトルであり、その \(i^\textrm{th}\) 要素はドット積 \(\mathbf{a}^\top_i \mathbf{x}\) である:

(2.3.6)¶\[\begin{split}\mathbf{A}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{x} \\ \mathbf{a}^\top_{2} \mathbf{x} \\ \vdots\\ \mathbf{a}^\top_{m} \mathbf{x}\\ \end{bmatrix}.\end{split}\]

行列 \(\mathbf{A}\in \mathbb{R}^{m \times n}\) による乗算は、ベクトルを \(\mathbb{R}^{n}\) から \(\mathbb{R}^{m}\) へ写す変換とみなせる。このような変換は非常に有用である。たとえば、回転は特定の正方行列との乗算として表せる。行列–ベクトル積は、前の層の出力からニューラルネットワークの各層の出力を計算する際の主要な計算でもある。

コードで行列–ベクトル積を表すには、 mv 関数を使う。 A の列方向の次元（軸1に沿った長さ）が x の次元（長さ）と一致しなければならないことに注意しよう。 Python には便利な演算子 @ があり、行列–ベクトル積と行列–行列積の両方を（引数に応じて）実行できる。したがって A@x と書ける。

pytorch mxnet jax tensorflow

A.shape, x.shape, torch.mv(A, x), A@x

(torch.Size([2, 3]), torch.Size([3]), tensor([ 5., 14.]), tensor([ 5., 14.]))

A.shape, x.shape, np.dot(A, x)

((2, 3), (3,), array([ 5., 14.]))

A.shape, x.shape, jnp.matmul(A, x)

((2, 3), (3,), Array([ 5., 14.], dtype=float32))

A.shape, x.shape, tf.linalg.matvec(A, x)

(TensorShape([2, 3]),
 TensorShape([3]),
 <tf.Tensor: shape=(2,), dtype=float32, numpy=array([ 5., 14.], dtype=float32)>)

2.3.10. 行列–行列積¶

ドット積と行列–ベクトル積に慣れれば、 行列–行列積 も容易に理解できる。

2つの行列 \(\mathbf{A} \in \mathbb{R}^{n \times k}\) と \(\mathbf{B} \in \mathbb{R}^{k \times m}\) があるとする。

(2.3.7)¶\[\begin{split}\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1k} \\ a_{21} & a_{22} & \cdots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nk} \\ \end{bmatrix},\quad \mathbf{B}=\begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1m} \\ b_{21} & b_{22} & \cdots & b_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ b_{k1} & b_{k2} & \cdots & b_{km} \\ \end{bmatrix}.\end{split}\]

\(\mathbf{A}\) の \(i^\textrm{th}\) 行を表す行ベクトルを \(\mathbf{a}^\top_{i} \in \mathbb{R}^k\) とし、 \(\mathbf{B}\) の \(j^\textrm{th}\) 列を表す列ベクトルを \(\mathbf{b}_{j} \in \mathbb{R}^k\) とする。

(2.3.8)¶\[\begin{split}\mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix}, \quad \mathbf{B}=\begin{bmatrix} \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix}.\end{split}\]

行列積 \(\mathbf{C} \in \mathbb{R}^{n \times m}\) を作るには、各要素 \(c_{ij}\) を \(\mathbf{A}\) の \(i^\textrm{th}\) 行と \(\mathbf{B}\) の \(j^\textrm{th}\) 列のドット積、すなわち \(\mathbf{a}^\top_i \mathbf{b}_j\) として計算すればよい。

(2.3.9)¶\[\begin{split}\mathbf{C} = \mathbf{AB} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix} \begin{bmatrix} \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\ \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\ \vdots & \vdots & \ddots &\vdots\\ \mathbf{a}^\top_{n} \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m \end{bmatrix}.\end{split}\]

行列–行列積 \(\mathbf{AB}\) は、 \(m\) 個の行列–ベクトル積あるいは \(m \times n\) 個のドット積を計算し、その結果を並べて \(n \times m\) 行列を作るものと考えられる。次のコード片では、A と B に対して行列積を計算する。ここで A は2行3列の行列で、 B は3行4列の行列である。乗算後、2行4列の行列が得られる。

pytorch mxnet jax tensorflow

B = torch.ones(3, 4)
torch.mm(A, B), A@B

(tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]]),
 tensor([[ 3.,  3.,  3.,  3.],
         [12., 12., 12., 12.]]))

B = np.ones(shape=(3, 4))
np.dot(A, B)

array([[ 3.,  3.,  3.,  3.],
       [12., 12., 12., 12.]])

B = jnp.ones((3, 4))
jnp.matmul(A, B)

Array([[ 3.,  3.,  3.,  3.],
       [12., 12., 12., 12.]], dtype=float32)

B = tf.ones((3, 4), tf.float32)
tf.matmul(A, B)

<tf.Tensor: shape=(2, 4), dtype=float32, numpy=
array([[ 3.,  3.,  3.,  3.],
       [12., 12., 12., 12.]], dtype=float32)>

行列–行列積 という語は、しばしば単に 行列積 と略される。 Hadamard 積と混同してはならない。

2.3.11. ノルム¶

線形代数で最も有用な演算子のいくつかが ノルム である。直感的には、ベクトルのノルムはそれがどれだけ 大きい かを表す。たとえば、\(\ell_2\) ノルムはベクトルの（ユークリッド）長さを測る。ここでいう 大きさ は、成分の規模に関する概念であり、次元数のことではない。

ノルムは、ベクトルをスカラーに写す関数 \(\| \cdot \|\) であり、次の3つの性質を満たす。

任意のベクトル \(\mathbf{x}\) について、ベクトル（のすべての要素）をスカラー \(\alpha \in \mathbb{R}\) でスケールすると、そのノルムも同じ比率で変化する:

(2.3.10)¶\[\|\alpha \mathbf{x}\| = |\alpha| \|\mathbf{x}\|.\]
任意のベクトル \(\mathbf{x}\) と \(\mathbf{y}\) について、ノルムは三角不等式を満たす:

(2.3.11)¶\[\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|.\]
ベクトルのノルムは非負であり、ベクトルがゼロのときに限って 0 になる:

(2.3.12)¶\[\|\mathbf{x}\| > 0 \textrm{ for all } \mathbf{x} \neq 0.\]

多くの関数が有効なノルムであり、異なるノルムは異なる大きさの概念を表す。小学校で直角三角形の斜辺を求めるときに学ぶユークリッドノルムは、ベクトルの要素の二乗和の平方根である。形式的には、 \(\ell_2\) ノルム と呼ばれ、次のように表される。

(2.3.13)¶\[\|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}.\]

norm メソッドは \(\ell_2\) ノルムを計算する。

pytorch mxnet jax tensorflow

u = torch.tensor([3.0, -4.0])
torch.norm(u)

tensor(5.)

u = np.array([3, -4])
np.linalg.norm(u)

array(5.)

u = jnp.array([3.0, -4.0])
jnp.linalg.norm(u)

Array(5., dtype=float32)

u = tf.constant([3.0, -4.0])
tf.norm(u)

<tf.Tensor: shape=(), dtype=float32, numpy=5.0>

\(\ell_1\) ノルムもよく使われ、それに対応する尺度はマンハッタン距離と呼ばれる。定義より、\(\ell_1\) ノルムはベクトルの要素の絶対値の和である。

(2.3.14)¶\[\|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|.\]

\(\ell_2\) ノルムと比べると、外れ値の影響を受けにくい。 \(\ell_1\) ノルムを計算するには、絶対値と和の演算を組み合わせる。

pytorch mxnet jax tensorflow

torch.abs(u).sum()

tensor(7.)

np.abs(u).sum()

array(7.)

jnp.linalg.norm(u, ord=1) # jnp.abs(u).sum()と同じ

tf.reduce_sum(tf.abs(u))

<tf.Tensor: shape=(), dtype=float32, numpy=7.0>

\(\ell_2\) ノルムと \(\ell_1\) ノルムはどちらも、より一般的な \(\ell_p\) ノルム の特殊な場合である。

(2.3.15)¶\[\|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}.\]

行列の場合は、事情がやや複雑である。行列は、個々の要素の集まりとしても、ベクトルに作用して別のベクトルへ変換する対象としても見られるからである。たとえば、行列–ベクトル積 \(\mathbf{X} \mathbf{v}\) が \(\mathbf{v}\) に比べてどれだけ長くなりうるかを問える。この考え方は、スペクトル ノルムにつながる。ここではまず、計算がはるかに容易な フロベニウスノルム を導入する。これは、行列の要素の二乗和の平方根として定義される。

(2.3.16)¶\[\|\mathbf{X}\|_\textrm{F} = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}.\]

フロベニウスノルムは、行列を1つの長いベクトルとみなしたときの \(\ell_2\) ノルムのように振る舞う。次の関数を呼ぶと、行列のフロベニウスノルムを計算できる。

pytorch mxnet jax tensorflow

torch.norm(torch.ones((4, 9)))

tensor(6.)

np.linalg.norm(np.ones((4, 9)))

array(6.)

jnp.linalg.norm(jnp.ones((4, 9)))

Array(6., dtype=float32)

tf.norm(tf.ones((4, 9)))

<tf.Tensor: shape=(), dtype=float32, numpy=6.0>

あまり先を急ぎすぎたくはないが、これらの概念がなぜ有用なのかについての直感はすでに少し持てる。深層学習では、しばしば最適化問題を解く。観測データに割り当てられる確率を 最大化 すること、推薦モデルに関連する収益を 最大化 すること、予測と正解観測値の間の距離を 最小化 すること、同一人物の写真表現同士の距離を 最小化 しつつ、異なる人物の写真表現同士の距離を 最大化 することなどである。これらの距離は深層学習アルゴリズムの目的関数を構成し、しばしばノルムで表される。

2.3.12. 議論¶

この節では、現代の深層学習のかなりの部分を理解するのに必要な線形代数をひととおり見てきた。とはいえ、線形代数にはまだ多くの内容があり、その多くは機械学習に有用である。たとえば、行列は因子に分解でき、その分解によって実世界のデータセットに潜む低次元構造が明らかになることがある。機械学習には、行列分解とその高階テンソルへの一般化を用いてデータセットの構造を発見し、予測問題を解くことに焦点を当てたサブフィールドが存在する。しかし、この本の焦点は深層学習である。そして、実際のデータセットに機械学習を適用して手を動かした後のほうが、さらに多くの数学を学ぶ動機も高まると考えている。そのため、後でさらに数学を導入する余地を残しつつ、ここでこの節を締めくくる。

さらに線形代数を学びたいなら、優れた書籍やオンライン資料が数多くある。やや発展的な速習コースとしては、 Strang (1993), Kolter (2008), Petersen and Pedersen (2008) を参照されたい。

要点をまとめると:

スカラー、ベクトル、行列、テンソルは線形代数で用いる基本的な数学的対象であり、それぞれ 0、1、2、および任意個の軸を持つ。
テンソルは、インデックス付けや sum、mean などの演算によって、指定した軸に沿ってスライスしたりリダクションしたりできる。
要素ごとの積は Hadamard 積と呼ばれる。これに対して、ドット積、行列–ベクトル積、行列–行列積は要素ごとの演算ではなく、一般に入力とは異なる shape を持つ対象を返す。
Hadamard 積と比べると、行列–行列積は計算にかなり時間がかかる（2次時間ではなく3次時間）。
ノルムはベクトル（または行列）の大きさに関するさまざまな概念を捉え、 2つのベクトルの差に適用して距離を測るのによく使われる。
よく使われるベクトルノルムには \(\ell_1\) ノルムと \(\ell_2\) ノルムがあり、よく使われる行列ノルムには スペクトル ノルムと フロベニウス ノルムがある。

2.3.13. 演習¶

行列の転置の転置は元の行列そのものであることを証明せよ: \((\mathbf{A}^\top)^\top = \mathbf{A}\)。
2つの行列 \(\mathbf{A}\) と \(\mathbf{B}\) について、和と転置が可換であることを示せ: \(\mathbf{A}^\top + \mathbf{B}^\top = (\mathbf{A} + \mathbf{B})^\top\)。
任意の正方行列 \(\mathbf{A}\) について、\(\mathbf{A} + \mathbf{A}^\top\) は常に対称か。前の2つの演習の結果だけを使って証明できるか。
この節では shape が (2, 3, 4) のテンソル X を定義した。len(X) の出力は何だろうか。コードを実行せずに答えを書き、その後コードで確認しよう。
任意の shape のテンソル X について、len(X) は常に X のある軸の長さに対応するか。その軸はどれであるか。
A / A.sum(axis=1) を実行して何が起こるか見よ。結果を分析できるか。
マンハッタンの中心部で2点間を移動するとき、座標、すなわち通りと街路の観点で、どれだけの距離を移動する必要があるか。斜めに移動できるか。
shape が (2, 3, 4) のテンソルを考える。軸0、1、2 に沿った和の出力の shape はそれぞれ何であるか。
3つ以上の軸を持つテンソルを linalg.norm 関数に入力して、その出力を観察しよう。この関数は任意の shape のテンソルに対して何を計算するか。
たとえば \(\mathbf{A} \in \mathbb{R}^{2^{10} \times 2^{16}}\), \(\mathbf{B} \in \mathbb{R}^{2^{16} \times 2^{5}}\), \(\mathbf{C} \in \mathbb{R}^{2^{5} \times 2^{14}}\) のような3つの大きな行列を、ガウス乱数で初期化したとする。積 \(\mathbf{A} \mathbf{B} \mathbf{C}\) を計算したいとき、\((\mathbf{A} \mathbf{B}) \mathbf{C}\) と \(\mathbf{A} (\mathbf{B} \mathbf{C})\) のどちらで計算するかによって、メモリ使用量や速度に違いはあるか。なぜであるか。
たとえば \(\mathbf{A} \in \mathbb{R}^{2^{10} \times 2^{16}}\), \(\mathbf{B} \in \mathbb{R}^{2^{16} \times 2^{5}}\), \(\mathbf{C} \in \mathbb{R}^{2^{5} \times 2^{16}}\) のような3つの大きな行列を考える。\(\mathbf{A} \mathbf{B}\) と \(\mathbf{A} \mathbf{C}^\top\) のどちらを計算するかによって速度に違いはあるか。なぜであるか。もし \(\mathbf{C} = \mathbf{B}^\top\) をメモリを複製せずに初期化したら何が変わるか。なぜであるか。
たとえば \(\mathbf{A}, \mathbf{B}, \mathbf{C} \in \mathbb{R}^{100 \times 200}\) の3つの行列を考える。\([\mathbf{A}, \mathbf{B}, \mathbf{C}]\) をスタックして3つの軸を持つテンソルを構成しよう。次元数はいくつであるか。第3軸の第2成分を取り出して \(\mathbf{B}\) を復元しよう。答えが正しいことを確認しよう。