22.8. 分布¶

これまで、離散的な場合と連続的な場合の両方で確率を扱う方法を学んできた。ここでは、よく出会う代表的な分布について見ていこう。機械学習の分野によっては、これらよりはるかに多くの分布に精通している必要があるかもしれないし、深層学習のある分野ではまったく不要なこともある。しかし、基本的な一覧として知っておくのは有益である。まず、いくつかの共通ライブラリをインポートしよう。

pytorch mxnet jax tensorflow

%matplotlib inline
from d2l import torch as d2l
from IPython import display
from math import erf, factorial
import torch

torch.pi = torch.acos(torch.zeros(1)) * 2  # torchでpiを定義する

%matplotlib inline
from d2l import mxnet as d2l
from IPython import display
from math import erf, factorial
import numpy as np

p = 0.3

d2l.set_figsize()
d2l.plt.stem([0, 1], [1 - p, p], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

%matplotlib inline
from d2l import tensorflow as d2l
from IPython import display
from math import erf, factorial
import tensorflow as tf
import tensorflow_probability as tfp

tf.pi = tf.acos(tf.zeros(1)) * 2  # TensorFlowでpiを定義する

22.8.1. ベルヌーイ分布¶

通常最初に出会う最も単純な確率変数である。この確率変数は、コイン投げを表し、\(1\) が確率 \(p\)、\(0\) が確率 \(1-p\) で出ることを表す。この分布に従う確率変数 \(X\) は

(22.8.1)¶\[X \sim \textrm{Bernoulli}(p).\]

と書く。

累積分布関数は

(22.8.2)¶\[\begin{split}F(x) = \begin{cases} 0 & x < 0, \\ 1-p & 0 \le x < 1, \\ 1 & x >= 1 . \end{cases}\end{split}\]

である。

確率質量関数を以下に描く。

pytorch mxnet jax tensorflow

p = 0.3

d2l.set_figsize()
d2l.plt.stem([0, 1], [1 - p, p], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

../_images/output_distributions_1214ac_18_0.svg

p = 0.3

d2l.set_figsize()
d2l.plt.stem([0, 1], [1 - p, p], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

../_images/output_distributions_1214ac_21_0.svg

n = 5

d2l.plt.stem([i+1 for i in range(n)], n*[1 / n], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

p = 0.3

d2l.set_figsize()
d2l.plt.stem([0, 1], [1 - p, p], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

../_images/output_distributions_1214ac_27_0.svg

では、累積分布関数 (22.8.2) を描いてみよう。

pytorch mxnet tensorflow

x = torch.arange(-1, 2, 0.01)

def F(x):
    return 0 if x < 0 else 1 if x > 1 else 1 - p

d2l.plot(x, torch.tensor([F(y) for y in x]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_33_0.svg

x = np.arange(-1, 2, 0.01)

def F(x):
    return 0 if x < 0 else 1 if x > 1 else 1 - p

d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_36_0.svg

x = tf.range(-1, 2, 0.01)

def F(x):
    return 0 if x < 0 else 1 if x > 1 else 1 - p

d2l.plot(x, tf.constant([F(y) for y in x]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_39_0.svg

\(X \sim \textrm{Bernoulli}(p)\) ならば、

\(\mu_X = p\)、
\(\sigma_X^2 = p(1-p)\)。

ベルヌーイ確率変数から任意の形状の配列をサンプルするには、次のようにする。

pytorch mxnet tensorflow

1*(torch.rand(10, 10) < p)

tensor([[0, 0, 0, 1, 0, 0, 1, 0, 0, 1],
        [0, 1, 1, 1, 0, 0, 1, 1, 1, 0],
        [0, 0, 1, 0, 0, 0, 0, 1, 0, 1],
        [0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
        [0, 1, 1, 0, 1, 1, 1, 1, 0, 1],
        [1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
        [0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
        [0, 1, 0, 0, 1, 0, 0, 0, 0, 0]])

1*(np.random.rand(10, 10) < p)

array([[0, 0, 1, 0, 0, 0, 1, 0, 0, 1],
       [1, 0, 0, 1, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 1, 1, 0, 1, 0, 0],
       [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 1, 0, 0, 0, 0, 0]])

tf.cast(tf.random.uniform((10, 10)) < p, dtype=tf.float32)

<tf.Tensor: shape=(10, 10), dtype=float32, numpy=
array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 1., 1., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 1., 1., 0., 0., 1.],
       [0., 0., 0., 1., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 1., 0., 0., 0.]], dtype=float32)>

22.8.2. 離散一様分布¶

次によく現れる確率変数は離散一様分布である。ここでは、\(\{1, 2, \ldots, n\}\) の整数上に台を持つものを考えるが、他の任意の値集合を自由に選んでも構いない。この文脈で一様というのは、取りうるすべての値が等確率であることを意味する。各値 \(i \in \{1, 2, 3, \ldots, n\}\) の確率は \(p_i = \frac{1}{n}\) である。この分布に従う確率変数 \(X\) は

(22.8.3)¶\[X \sim U(n).\]

と表す。

累積分布関数は

(22.8.4)¶\[\begin{split}F(x) = \begin{cases} 0 & x < 1, \\ \frac{k}{n} & k \le x < k+1 \textrm{ with } 1 \le k < n, \\ 1 & x >= n . \end{cases}\end{split}\]

である。

まず確率質量関数を描きよう。

n = 5

d2l.plt.stem([i+1 for i in range(n)], n*[1 / n], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

../_images/output_distributions_1214ac_55_0.svg

では、累積分布関数 (22.8.4) を描いてみよう。

pytorch mxnet tensorflow

x = torch.arange(-1, 6, 0.01)

def F(x):
    return 0 if x < 1 else 1 if x > n else torch.floor(x) / n

d2l.plot(x, torch.tensor([F(y) for y in x]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_59_0.svg

x = np.arange(-1, 6, 0.01)

def F(x):
    return 0 if x < 1 else 1 if x > n else np.floor(x) / n

d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_62_0.svg

x = tf.range(-1, 6, 0.01)

def F(x):
    return 0 if x < 1 else 1 if x > n else tf.floor(x) / n

d2l.plot(x, [F(y) for y in x], 'x', 'c.d.f.')

../_images/output_distributions_1214ac_65_0.svg

\(X \sim U(n)\) ならば、

\(\mu_X = \frac{1+n}{2}\)、
\(\sigma_X^2 = \frac{n^2-1}{12}\)。

離散一様確率変数から任意の形状の配列をサンプルするには、次のようにする。

pytorch mxnet tensorflow

torch.randint(1, n, size=(10, 10))

tensor([[3, 3, 4, 2, 1, 4, 3, 3, 3, 1],
        [2, 3, 4, 4, 1, 1, 4, 3, 4, 2],
        [1, 2, 1, 2, 2, 3, 2, 4, 3, 3],
        [3, 4, 1, 4, 2, 3, 1, 2, 3, 3],
        [1, 2, 4, 1, 2, 4, 1, 3, 2, 1],
        [3, 2, 2, 1, 2, 4, 2, 4, 2, 1],
        [4, 2, 2, 1, 3, 4, 3, 3, 1, 4],
        [4, 4, 4, 1, 1, 3, 3, 2, 3, 4],
        [4, 3, 3, 3, 4, 4, 4, 2, 2, 4],
        [2, 2, 2, 2, 1, 2, 2, 1, 4, 2]])

np.random.randint(1, n, size=(10, 10))

array([[2, 2, 2, 3, 4, 1, 4, 2, 1, 2],
       [1, 1, 3, 3, 4, 1, 2, 4, 3, 1],
       [4, 2, 1, 3, 2, 2, 1, 4, 1, 2],
       [3, 4, 1, 1, 2, 4, 2, 4, 4, 1],
       [1, 2, 2, 1, 4, 1, 1, 1, 4, 4],
       [4, 4, 2, 3, 1, 2, 2, 2, 2, 1],
       [2, 3, 1, 4, 2, 3, 2, 4, 4, 3],
       [4, 1, 3, 1, 3, 2, 2, 4, 2, 1],
       [4, 3, 2, 4, 4, 3, 1, 1, 1, 2],
       [2, 2, 1, 4, 1, 2, 4, 1, 1, 1]])

tf.random.uniform((10, 10), 1, n, dtype=tf.int32)

<tf.Tensor: shape=(10, 10), dtype=int32, numpy=
array([[4, 1, 3, 1, 4, 2, 3, 3, 3, 2],
       [4, 3, 1, 2, 4, 4, 3, 1, 4, 4],
       [4, 1, 4, 3, 3, 1, 4, 3, 4, 4],
       [4, 2, 2, 2, 3, 3, 1, 3, 1, 4],
       [3, 1, 3, 3, 1, 3, 4, 4, 3, 2],
       [1, 1, 4, 1, 3, 1, 3, 4, 3, 1],
       [1, 4, 4, 1, 3, 4, 4, 2, 3, 4],
       [4, 3, 2, 3, 2, 2, 4, 1, 3, 1],
       [1, 3, 3, 1, 3, 4, 3, 4, 2, 3],
       [4, 2, 2, 2, 1, 3, 2, 2, 1, 3]], dtype=int32)>

22.8.3. 連続一様分布¶

次に、連続一様分布について説明しよう。この確率変数の考え方は、離散一様分布の \(n\) を増やし、それを区間 \([a, b]\) に収まるようにスケーリングすると、\([a, b]\) の中の任意の値をすべて等確率で選ぶ連続確率変数に近づく、というものである。この分布は

(22.8.5)¶\[X \sim U(a, b).\]

と表す。

確率密度関数は

(22.8.6)¶\[\begin{split}p(x) = \begin{cases} \frac{1}{b-a} & x \in [a, b], \\ 0 & x \not\in [a, b].\end{cases}\end{split}\]

である。

累積分布関数は

(22.8.7)¶\[\begin{split}F(x) = \begin{cases} 0 & x < a, \\ \frac{x-a}{b-a} & x \in [a, b], \\ 1 & x >= b . \end{cases}\end{split}\]

である。

まず確率密度関数 (22.8.6) を描きよう。

pytorch mxnet tensorflow

a, b = 1, 3

x = torch.arange(0, 4, 0.01)
p = (x > a).type(torch.float32)*(x < b).type(torch.float32)/(b-a)
d2l.plot(x, p, 'x', 'p.d.f.')

../_images/output_distributions_1214ac_83_0.svg

a, b = 1, 3

x = np.arange(0, 4, 0.01)
p = (x > a)*(x < b)/(b - a)

d2l.plot(x, p, 'x', 'p.d.f.')

../_images/output_distributions_1214ac_86_0.svg

a, b = 1, 3

x = tf.range(0, 4, 0.01)
p = tf.cast(x > a, tf.float32) * tf.cast(x < b, tf.float32) / (b - a)
d2l.plot(x, p, 'x', 'p.d.f.')

../_images/output_distributions_1214ac_89_0.svg

では、累積分布関数 (22.8.7) を描いてみよう。

pytorch mxnet tensorflow

def F(x):
    return 0 if x < a else 1 if x > b else (x - a) / (b - a)

d2l.plot(x, torch.tensor([F(y) for y in x]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_95_0.svg

def F(x):
    return 0 if x < a else 1 if x > b else (x - a) / (b - a)

d2l.plot(x, np.array([F(y) for y in x]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_98_0.svg

def F(x):
    return 0 if x < a else 1 if x > b else (x - a) / (b - a)

d2l.plot(x, [F(y) for y in x], 'x', 'c.d.f.')

../_images/output_distributions_1214ac_101_0.svg

\(X \sim U(a, b)\) ならば、

\(\mu_X = \frac{a+b}{2}\)、
\(\sigma_X^2 = \frac{(b-a)^2}{12}\)。

一様確率変数から任意の形状の配列をサンプルするには、次のようにする。デフォルトでは \(U(0,1)\) からサンプルされるので、別の範囲が欲しい場合はスケーリングする必要がある。

pytorch mxnet tensorflow

(b - a) * torch.rand(10, 10) + a

tensor([[2.0679, 2.4100, 2.5762, 1.8499, 1.0621, 1.9996, 2.3631, 2.6283, 1.6981,
         1.3658],
        [2.2581, 2.8063, 2.5712, 2.9981, 1.2781, 1.3680, 2.5652, 2.8932, 1.1143,
         1.6694],
        [2.9813, 2.2677, 2.7297, 1.4408, 2.9802, 2.4943, 2.5363, 1.6275, 1.4729,
         2.5822],
        [2.9420, 2.5521, 2.1417, 1.2070, 1.2284, 2.7954, 2.9322, 1.7272, 2.8715,
         1.5982],
        [2.3660, 1.6054, 2.5989, 2.2134, 1.2444, 2.1781, 1.5832, 2.1513, 1.8993,
         2.8935],
        [1.7960, 2.7038, 1.1465, 1.3827, 1.9866, 2.1409, 2.0720, 1.9597, 2.5218,
         2.1478],
        [1.8626, 2.3913, 2.0165, 2.2057, 1.9050, 2.0541, 1.1899, 2.6062, 2.8070,
         1.3668],
        [2.4310, 1.1270, 1.9386, 2.7942, 1.9744, 2.6841, 2.5107, 1.2452, 2.3955,
         1.8620],
        [2.0176, 2.9634, 1.2320, 2.4971, 1.4820, 2.8725, 2.6219, 2.1171, 2.1623,
         1.8533],
        [2.1182, 1.2056, 1.6436, 1.7542, 1.7792, 2.2062, 2.9191, 1.1343, 2.0459,
         1.9995]])

(b - a) * np.random.rand(10, 10) + a

array([[1.10493389, 2.18332184, 2.55880868, 2.40644506, 1.98823581,
        2.03304566, 2.54314285, 2.28893216, 1.96832242, 1.21947864],
       [1.00116734, 1.87327363, 2.06307085, 1.37704206, 2.98520983,
        2.85361327, 2.614708  , 1.67971392, 2.31319471, 2.08720314],
       [2.9417875 , 2.58942767, 1.36652719, 2.76520312, 1.6291505 ,
        2.57007544, 2.18738455, 1.08755762, 1.69884709, 1.9310117 ],
       [1.82504201, 1.80920424, 1.52510532, 2.17378476, 1.211283  ,
        2.15483981, 2.14476987, 1.30869007, 2.7514197 , 2.93051749],
       [1.46794909, 1.1367775 , 1.35302173, 2.6377266 , 2.43745649,
        2.98864716, 1.26447563, 1.48047024, 2.08178225, 1.32105626],
       [1.25845706, 2.03397717, 2.73034409, 2.70469505, 2.34734372,
        1.74027474, 2.96683054, 1.37055974, 1.53226889, 2.36267969],
       [2.94902185, 1.77360701, 1.30736801, 2.21610289, 2.60674295,
        1.97389815, 1.96275176, 2.9917009 , 2.66415546, 2.43568684],
       [1.41549567, 2.78340407, 1.49017239, 1.92110928, 2.44749996,
        1.94183467, 2.55685336, 2.06993835, 1.62822726, 1.99831404],
       [2.4445657 , 2.44027101, 2.11419221, 1.94853489, 1.81220577,
        1.74003385, 1.05746194, 1.94546557, 2.53363203, 1.74536491],
       [2.2648276 , 2.99783037, 2.15039751, 1.75704603, 1.46434226,
        2.05093236, 1.80039954, 1.37730793, 2.03797871, 1.62589064]])

(b - a) * tf.random.uniform((10, 10)) + a

<tf.Tensor: shape=(10, 10), dtype=float32, numpy=
array([[1.5681243, 1.8772495, 2.4839144, 1.5758138, 1.51053  , 2.5540862,
        1.1616733, 2.970998 , 2.3602724, 1.4549215],
       [1.1370218, 1.2265592, 1.267734 , 2.7446856, 1.3154321, 2.1516864,
        2.9253604, 1.2757335, 1.1784961, 1.4303627],
       [1.3996754, 1.6577711, 2.4583366, 2.9051504, 2.5880024, 1.2249057,
        2.5468373, 1.6416607, 2.4872744, 2.6817133],
       [2.1866   , 2.4692721, 2.0441496, 1.6734836, 2.4291286, 2.7591374,
        2.8040156, 2.3101315, 2.9683428, 2.2392678],
       [1.4663892, 2.7784042, 2.6934195, 2.1465902, 1.9948542, 1.8531013,
        2.4695644, 1.7235866, 1.9122441, 1.3478694],
       [1.7405694, 1.8584762, 1.3478763, 1.194274 , 1.7994637, 2.0353231,
        1.4105015, 2.9599195, 1.3373148, 1.2692168],
       [2.0873346, 2.6769884, 2.5507674, 2.666769 , 2.8097057, 2.0322669,
        1.9368191, 2.791131 , 2.3000367, 1.9854867],
       [1.270735 , 1.7274289, 1.680018 , 1.3198214, 2.4480646, 2.0401797,
        2.9680197, 1.4028776, 1.9465854, 2.1036496],
       [2.446667 , 1.9727476, 2.4787436, 1.896085 , 1.8473601, 2.78446  ,
        1.8548636, 2.0098755, 2.0615952, 1.9896467],
       [1.8838322, 2.0051875, 2.7124982, 2.7813394, 1.4981465, 2.5291965,
        1.2720184, 2.5645955, 1.8139408, 2.5306706]], dtype=float32)>

22.8.4. 二項分布¶

少し複雑にして、二項確率変数を見てみよう。この確率変数は、成功確率 \(p\) の独立な試行を \(n\) 回行い、成功が何回起こるかを数えることから生まれる。

これを数学的に表しよう。各試行は独立な確率変数 \(X_i\) であり、成功を \(1\)、失敗を \(0\) で表す。それぞれが成功確率 \(p\) の独立なコイン投げなので、\(X_i \sim \textrm{Bernoulli}(p)\) と書ける。すると、二項確率変数は

(22.8.8)¶\[X = \sum_{i=1}^n X_i.\]

である。

このとき、

(22.8.9)¶\[X \sim \textrm{Binomial}(n, p).\]

と書く。

累積分布関数を得るには、ちょうど \(k\) 回成功する場合が \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) 通りあり、それぞれが確率 \(p^k(1-p)^{n-k}\) で起こることに注意する必要がある。したがって、累積分布関数は

(22.8.10)¶\[\begin{split}F(x) = \begin{cases} 0 & x < 0, \\ \sum_{m \le k} \binom{n}{m} p^m(1-p)^{n-m} & k \le x < k+1 \textrm{ with } 0 \le k < n, \\ 1 & x >= n . \end{cases}\end{split}\]

である。

まず確率質量関数を描きよう。

pytorch mxnet tensorflow

n, p = 10, 0.2

# 二項係数を計算する
def binom(n, k):
    comb = 1
    for i in range(min(k, n - k)):
        comb = comb * (n - i) // (i + 1)
    return comb

pmf = d2l.tensor([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)])

d2l.plt.stem([i for i in range(n + 1)], pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

../_images/output_distributions_1214ac_119_0.svg

n, p = 10, 0.2

# 二項係数を計算する
def binom(n, k):
    comb = 1
    for i in range(min(k, n - k)):
        comb = comb * (n - i) // (i + 1)
    return comb

pmf = np.array([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)])

d2l.plt.stem([i for i in range(n + 1)], pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

../_images/output_distributions_1214ac_122_0.svg

n, p = 10, 0.2

# 二項係数を計算する
def binom(n, k):
    comb = 1
    for i in range(min(k, n - k)):
        comb = comb * (n - i) // (i + 1)
    return comb

pmf = tf.constant([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)])

d2l.plt.stem([i for i in range(n + 1)], pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

../_images/output_distributions_1214ac_125_0.svg

では、累積分布関数 (22.8.10) を描いてみよう。

pytorch mxnet tensorflow

x = torch.arange(-1, 11, 0.01)
cmf = torch.cumsum(pmf, dim=0)

def F(x):
    return 0 if x < 0 else 1 if x > n else cmf[int(x)]

d2l.plot(x, torch.tensor([F(y) for y in x.tolist()]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_131_0.svg

x = np.arange(-1, 11, 0.01)
cmf = np.cumsum(pmf)

def F(x):
    return 0 if x < 0 else 1 if x > n else cmf[int(x)]

d2l.plot(x, np.array([F(y) for y in x.tolist()]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_134_0.svg

x = tf.range(-1, 11, 0.01)
cmf = tf.cumsum(pmf)

def F(x):
    return 0 if x < 0 else 1 if x > n else cmf[int(x)]

d2l.plot(x, [F(y) for y in x.numpy().tolist()], 'x', 'c.d.f.')

../_images/output_distributions_1214ac_137_0.svg

\(X \sim \textrm{Binomial}(n, p)\) ならば、

\(\mu_X = np\)、
\(\sigma_X^2 = np(1-p)\)。

これは、\(n\) 個のベルヌーイ確率変数の和に対する期待値の線形性と、独立な確率変数の和の分散が分散の和になることから従う。次のようにサンプルできる。

pytorch mxnet tensorflow

m = torch.distributions.binomial.Binomial(n, p)
m.sample(sample_shape=(10, 10))

tensor([[2., 2., 2., 1., 3., 1., 2., 2., 1., 2.],
        [3., 3., 2., 1., 2., 4., 1., 3., 1., 2.],
        [1., 2., 4., 1., 4., 4., 1., 0., 0., 1.],
        [1., 1., 0., 4., 2., 2., 2., 1., 2., 2.],
        [2., 3., 2., 1., 0., 2., 4., 3., 1., 2.],
        [1., 1., 5., 2., 3., 2., 3., 1., 3., 2.],
        [1., 6., 2., 1., 0., 3., 4., 4., 2., 4.],
        [2., 3., 2., 1., 2., 2., 3., 2., 2., 3.],
        [2., 1., 5., 0., 1., 3., 1., 2., 2., 5.],
        [0., 2., 5., 0., 0., 2., 3., 3., 0., 2.]])

np.random.binomial(n, p, size=(10, 10))

array([[5, 3, 2, 1, 3, 1, 1, 1, 3, 1],
       [2, 2, 2, 0, 1, 1, 1, 3, 0, 3],
       [1, 2, 3, 2, 1, 1, 2, 2, 2, 1],
       [1, 1, 1, 2, 2, 2, 0, 3, 4, 1],
       [2, 1, 2, 3, 1, 2, 1, 5, 2, 4],
       [4, 3, 2, 4, 0, 1, 2, 4, 3, 2],
       [1, 3, 1, 4, 1, 0, 2, 2, 3, 0],
       [2, 3, 1, 1, 3, 4, 3, 2, 1, 3],
       [5, 1, 4, 2, 2, 2, 0, 0, 2, 4],
       [3, 1, 3, 2, 0, 6, 2, 0, 2, 1]])

m = tfp.distributions.Binomial(n, p)
m.sample(sample_shape=(10, 10))

WARNING:tensorflow:From /home/ci/.local/lib/python3.10/site-packages/tensorflow_probability/python/internal/batched_rejection_sampler.py:102: calling while_loop_v2 (from tensorflow.python.ops.control_flow_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))

<tf.Tensor: shape=(10, 10), dtype=float32, numpy=
array([[4., 6., 7., 5., 6., 6., 5., 6., 6., 5.],
       [6., 7., 6., 6., 6., 4., 7., 3., 4., 7.],
       [5., 5., 6., 6., 6., 5., 6., 5., 5., 3.],
       [6., 4., 6., 7., 8., 4., 8., 5., 6., 7.],
       [5., 8., 6., 3., 6., 7., 4., 6., 7., 7.],
       [7., 8., 4., 5., 4., 6., 5., 5., 9., 5.],
       [7., 5., 4., 6., 9., 7., 8., 7., 5., 6.],
       [6., 6., 5., 6., 5., 4., 4., 6., 5., 8.],
       [7., 5., 5., 3., 6., 6., 6., 4., 8., 7.],
       [7., 3., 7., 5., 7., 7., 7., 4., 5., 6.]], dtype=float32)>

22.8.5. ポアソン分布¶

では、思考実験をしてみよう。バス停に立っていて、次の1分間に何台のバスが到着するかを知りたいとする。まず、\(X^{(1)} \sim \textrm{Bernoulli}(p)\) を考える。、1分間の窓の中でバスが到着する確率そのものである。都市中心部から離れたバス停では、かなり良い近似かもしれない。1分間に2台以上のバスを見ることはまずないだろう。

しかし、混雑した地域にいるなら、2台のバスが到着することもありえますし、むしろ起こりやすいかもしれない。これを、最初の30秒と次の30秒に分けて確率変数を分割することでモデル化できる。この場合、

(22.8.11)¶\[X^{(2)} \sim X^{(2)}_1 + X^{(2)}_2,\]

と書ける。ここで \(X^{(2)}\) は合計であり、\(X^{(2)}_i \sim \textrm{Bernoulli}(p/2)\) である。すると全体の分布は \(X^{(2)} \sim \textrm{Binomial}(2, p/2)\) になる。

ここで止める必要はあるだろうか。さらにその1分を \(n\) 個に分割してみよう。上と同じ推論により、

(22.8.12)¶\[X^{(n)} \sim \textrm{Binomial}(n, p/n).\]

が得られる。

これらの確率変数を考える。前節より、(22.8.12) の平均は \(\mu_{X^{(n)}} = n(p/n) = p\)、分散は \(\sigma_{X^{(n)}}^2 = n(p/n)(1-(p/n)) = p(1-p/n)\) である。\(n \rightarrow \infty\) とすると、これらの値は \(\mu_{X^{(\infty)}} = p\)、分散 \(\sigma_{X^{(\infty)}}^2 = p\) に安定することがわかる。、この無限分割の極限で定義できる確率変数が 存在するかもしれない ことを示している。

現実世界ではバスの到着回数を数えればよいだけなので、それほど驚くことではないが、数学的モデルがきちんと定義されていることを見るのは良いことである。この議論は 稀な事象の法則 として形式化できる。

この推論を丁寧にたどると、次のモデルに到達できる。\(X \sim \textrm{Poisson}(\lambda)\) とは、\(\{0,1,2, \ldots\}\) の値を確率

(22.8.13)¶\[p_k = \frac{\lambda^ke^{-\lambda}}{k!}.\]

でとる確率変数のことである。

\(\lambda > 0\) は率（あるいは形状パラメータ）と呼ばれ、1単位時間あたりに期待される到着回数を表す。

この確率質量関数を和をとって累積分布関数を得ることができる。

(22.8.14)¶\[\begin{split}F(x) = \begin{cases} 0 & x < 0, \\ e^{-\lambda}\sum_{m = 0}^k \frac{\lambda^m}{m!} & k \le x < k+1 \textrm{ with } 0 \le k. \end{cases}\end{split}\]

まず確率質量関数 (22.8.13) を描きよう。

pytorch mxnet tensorflow

lam = 5.0

xs = [i for i in range(20)]
pmf = torch.tensor([torch.exp(torch.tensor(-lam)) * lam**k
                    / factorial(k) for k in xs])

d2l.plt.stem(xs, pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

../_images/output_distributions_1214ac_155_0.svg

lam = 5.0

xs = [i for i in range(20)]
pmf = np.array([np.exp(-lam) * lam**k / factorial(k) for k in xs])

d2l.plt.stem(xs, pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

../_images/output_distributions_1214ac_158_0.svg

lam = 5.0

xs = [i for i in range(20)]
pmf = tf.constant([tf.exp(tf.constant(-lam)).numpy() * lam**k
                    / factorial(k) for k in xs])

d2l.plt.stem(xs, pmf, use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.show()

../_images/output_distributions_1214ac_161_0.svg

では、累積分布関数 (22.8.14) を描いてみよう。

pytorch mxnet tensorflow

x = torch.arange(-1, 21, 0.01)
cmf = torch.cumsum(pmf, dim=0)
def F(x):
    return 0 if x < 0 else 1 if x > n else cmf[int(x)]

d2l.plot(x, torch.tensor([F(y) for y in x.tolist()]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_167_0.svg

x = np.arange(-1, 21, 0.01)
cmf = np.cumsum(pmf)
def F(x):
    return 0 if x < 0 else 1 if x > n else cmf[int(x)]

d2l.plot(x, np.array([F(y) for y in x.tolist()]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_170_0.svg

x = tf.range(-1, 21, 0.01)
cmf = tf.cumsum(pmf)
def F(x):
    return 0 if x < 0 else 1 if x > n else cmf[int(x)]

d2l.plot(x, [F(y) for y in x.numpy().tolist()], 'x', 'c.d.f.')

../_images/output_distributions_1214ac_173_0.svg

上で見たように、平均と分散は特に簡潔である。\(X \sim \textrm{Poisson}(\lambda)\) ならば、

\(\mu_X = \lambda\)、
\(\sigma_X^2 = \lambda\)。

次のようにサンプルできる。

pytorch mxnet tensorflow

m = torch.distributions.poisson.Poisson(lam)
m.sample((10, 10))

tensor([[ 1.,  7.,  6.,  3.,  1.,  4.,  3.,  5.,  6.,  5.],
        [ 3.,  5.,  4.,  5.,  8.,  8.,  5.,  2.,  6.,  8.],
        [ 6.,  7.,  7.,  6.,  8.,  5.,  6.,  6.,  4.,  4.],
        [ 4.,  6.,  5.,  3.,  3.,  4.,  8.,  8.,  5.,  5.],
        [ 3.,  7.,  7.,  6.,  6.,  3.,  8.,  8.,  7.,  3.],
        [ 7.,  4.,  4.,  8.,  5.,  3.,  4.,  5.,  5.,  3.],
        [ 4.,  7.,  3.,  2.,  5.,  9.,  4.,  6.,  4.,  8.],
        [ 5.,  5.,  4.,  4.,  4.,  4.,  3.,  7.,  2.,  6.],
        [ 4.,  8., 10.,  4.,  6.,  7.,  5., 10.,  5.,  5.],
        [ 2.,  5.,  4.,  5.,  7.,  4.,  2.,  4.,  5.,  8.]])

np.random.poisson(lam, size=(10, 10))

array([[ 5,  5,  6,  4,  2,  4,  8,  3,  4,  7],
       [ 0,  3, 11,  5,  3,  2,  7,  8,  3,  1],
       [ 6,  5,  4,  9,  7,  2,  1,  6,  6,  5],
       [ 6,  5,  5,  3,  6,  4,  5,  3,  6,  4],
       [ 6,  6,  5,  5,  5,  5,  5,  9,  7,  7],
       [ 5,  4,  4,  6,  2,  5,  4,  7,  7,  2],
       [ 1,  4,  7,  2,  4,  8,  1,  4,  6,  4],
       [ 3,  4,  6,  8,  8,  4,  3,  4,  5,  3],
       [ 4,  6,  6, 10,  6,  4,  6,  3,  8,  2],
       [ 2,  9,  8,  3,  7,  5,  6,  4,  5,  3]])

m = tfp.distributions.Poisson(lam)
m.sample((10, 10))

<tf.Tensor: shape=(10, 10), dtype=float32, numpy=
array([[ 5.,  4.,  3.,  4.,  3.,  6.,  3., 10.,  5.,  4.],
       [ 9.,  7.,  3.,  3.,  5.,  3.,  5.,  5.,  3.,  1.],
       [ 7.,  6.,  5.,  6.,  8.,  3.,  2.,  4.,  4.,  3.],
       [ 7.,  2.,  4.,  4.,  5.,  9.,  2.,  5.,  6.,  6.],
       [ 6.,  8.,  3.,  3.,  4.,  5.,  6.,  5.,  3.,  6.],
       [ 3.,  3.,  6.,  5.,  3.,  0.,  5.,  8.,  5.,  7.],
       [ 2.,  4.,  1.,  9.,  8.,  8.,  7.,  5.,  5.,  6.],
       [ 5.,  4.,  6., 10.,  4., 10.,  8.,  6.,  3.,  6.],
       [ 4.,  3.,  5.,  4.,  5.,  6.,  4.,  7.,  7.,  2.],
       [ 5.,  4.,  5.,  5.,  5.,  5.,  6.,  5.,  4.,  3.]], dtype=float32)>

22.8.6. ガウス分布¶

では、別の、しかし関連した実験をしてみよう。ここでも、\(n\) 個の独立な \(\textrm{Bernoulli}(p)\) 測定 \(X_i\) を行うとする。これらの和の分布は \(X^{(n)} \sim \textrm{Binomial}(n, p)\) である。\(n\) を増やし \(p\) を減らす極限をとる代わりに、\(p\) を固定して \(n \rightarrow \infty\) としてみよう。この場合、\(\mu_{X^{(n)}} = np \rightarrow \infty\) かつ \(\sigma_{X^{(n)}}^2 = np(1-p) \rightarrow \infty\) となるので、この極限がうまく定義されるとは考えにくい。

しかし、希望はまだある。平均と分散がうまく振る舞うように、次を定義しよう。

(22.8.15)¶\[Y^{(n)} = \frac{X^{(n)} - \mu_{X^{(n)}}}{\sigma_{X^{(n)}}}.\]

平均0、分散1を持つことがわかるので、何らかの極限分布に収束すると考えるのはもっともらしい。これらの分布がどのように見えるかを描いてみると、さらにそれがうまくいきそうだと確信できるだろう。

pytorch mxnet tensorflow

p = 0.2
ns = [1, 10, 100, 1000]
d2l.plt.figure(figsize=(10, 3))
for i in range(4):
    n = ns[i]
    pmf = torch.tensor([p**i * (1-p)**(n-i) * binom(n, i)
                        for i in range(n + 1)])
    d2l.plt.subplot(1, 4, i + 1)
    d2l.plt.stem([(i - n*p)/torch.sqrt(torch.tensor(n*p*(1 - p)))
                  for i in range(n + 1)], pmf,
                 use_line_collection=True)
    d2l.plt.xlim([-4, 4])
    d2l.plt.xlabel('x')
    d2l.plt.ylabel('p.m.f.')
    d2l.plt.title("n = {}".format(n))
d2l.plt.show()

../_images/output_distributions_1214ac_191_0.svg

p = 0.2
ns = [1, 10, 100, 1000]
d2l.plt.figure(figsize=(10, 3))
for i in range(4):
    n = ns[i]
    pmf = np.array([p**i * (1-p)**(n-i) * binom(n, i) for i in range(n + 1)])
    d2l.plt.subplot(1, 4, i + 1)
    d2l.plt.stem([(i - n*p)/np.sqrt(n*p*(1 - p)) for i in range(n + 1)], pmf,
                 use_line_collection=True)
    d2l.plt.xlim([-4, 4])
    d2l.plt.xlabel('x')
    d2l.plt.ylabel('p.m.f.')
    d2l.plt.title("n = {}".format(n))
d2l.plt.show()

../_images/output_distributions_1214ac_194_0.svg

p = 0.2
ns = [1, 10, 100, 1000]
d2l.plt.figure(figsize=(10, 3))
for i in range(4):
    n = ns[i]
    pmf = tf.constant([p**i * (1-p)**(n-i) * binom(n, i)
                        for i in range(n + 1)])
    d2l.plt.subplot(1, 4, i + 1)
    d2l.plt.stem([(i - n*p)/tf.sqrt(tf.constant(n*p*(1 - p)))
                  for i in range(n + 1)], pmf,
                 use_line_collection=True)
    d2l.plt.xlim([-4, 4])
    d2l.plt.xlabel('x')
    d2l.plt.ylabel('p.m.f.')
    d2l.plt.title("n = {}".format(n))
d2l.plt.show()

../_images/output_distributions_1214ac_197_0.svg

注意すべき点が1つある。ポアソンの場合と比べると、ここでは標準偏差で割っているため、取りうる結果をより狭い領域に押し込めている。、この極限がもはや離散的ではなく、連続的になることを示している。

何が起こるかの導出はこの文書の範囲を超えるが、中心極限定理 によれば、\(n \rightarrow \infty\) のとき、ガウス分布（あるいは正規分布）を与える。より明示的には、任意の \(a, b\) に対して

(22.8.16)¶\[\lim_{n \rightarrow \infty} P(Y^{(n)} \in [a, b]) = P(\mathcal{N}(0,1) \in [a, b]),\]

が成り立ちる。ここで、確率変数が平均 \(\mu\)、分散 \(\sigma^2\) を持つ正規分布に従うとき、\(X \sim \mathcal{N}(\mu, \sigma^2)\) と書き、その密度は

(22.8.17)¶\[p_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}.\]

である。

まず確率密度関数 (22.8.17) を描きよう。

pytorch mxnet tensorflow

mu, sigma = 0, 1

x = torch.arange(-3, 3, 0.01)
p = 1 / torch.sqrt(2 * torch.pi * sigma**2) * torch.exp(
    -(x - mu)**2 / (2 * sigma**2))

d2l.plot(x, p, 'x', 'p.d.f.')

../_images/output_distributions_1214ac_203_0.svg

mu, sigma = 0, 1

x = np.arange(-3, 3, 0.01)
p = 1 / np.sqrt(2 * np.pi * sigma**2) * np.exp(-(x - mu)**2 / (2 * sigma**2))

d2l.plot(x, p, 'x', 'p.d.f.')

../_images/output_distributions_1214ac_206_0.svg

mu, sigma = 0, 1

x = tf.range(-3, 3, 0.01)
p = 1 / tf.sqrt(2 * tf.pi * sigma**2) * tf.exp(
    -(x - mu)**2 / (2 * sigma**2))

d2l.plot(x, p, 'x', 'p.d.f.')

../_images/output_distributions_1214ac_209_0.svg

では、累積分布関数を描いてみよう。ここでは詳しく述べないが、ガウス分布の c.d.f. は、より初等的な関数を用いた閉じた形の式を持たない。そこで、erf を使ってこの積分を数値的に計算する。

pytorch mxnet tensorflow

def phi(x):
    return (1.0 + erf((x - mu) / (sigma * torch.sqrt(d2l.tensor(2.))))) / 2.0

d2l.plot(x, torch.tensor([phi(y) for y in x.tolist()]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_215_0.svg

def phi(x):
    return (1.0 + erf((x - mu) / (sigma * np.sqrt(2)))) / 2.0

d2l.plot(x, np.array([phi(y) for y in x.tolist()]), 'x', 'c.d.f.')

../_images/output_distributions_1214ac_218_0.svg

def phi(x):
    return (1.0 + erf((x - mu) / (sigma * tf.sqrt(tf.constant(2.))))) / 2.0

d2l.plot(x, [phi(y) for y in x.numpy().tolist()], 'x', 'c.d.f.')

../_images/output_distributions_1214ac_221_0.svg

注意深い読者なら、これらの項のいくつかに見覚えがあるだろう。実際、この積分は 22.5 章で扱いた。まさにその計算によって、この \(p_X(x)\) の全体の面積が1であり、したがって有効な密度であることがわかる。

コイン投げを用いたのは計算を簡単にするためでしたが、その選択自体に本質的な意味はない。実際、独立同分布な確率変数 \(X_i\) の任意の集まりを取り、

(22.8.18)¶\[X^{(N)} = \sum_{i=1}^N X_i.\]

としたとき、

(22.8.19)¶\[\frac{X^{(N)} - \mu_{X^{(N)}}}{\sigma_{X^{(N)}}}\]

はおおよそガウス分布に従う。うまく成り立つためには追加の条件が必要で、最も一般的には \(E[X^4] < \infty\) であるが、考え方は明確である。

中心極限定理は、ガウス分布が確率、統計、機械学習において基本的である理由である。何か測定したものが多くの小さな独立な寄与の和だと言えるなら、その測定対象はガウス分布に近いと仮定できる。

ガウス分布にはさらに多くの興味深い性質があるが、ここではもう1つだけ述べよう。ガウス分布は 最大エントロピー分布 として知られている。エントロピーについては 22.11 章でより深く扱うが、ここで知っておくべきことは、それがランダムさの尺度であるということだけである。厳密な数学的意味では、ガウス分布は平均と分散が固定された確率変数の中で最もランダムな選択だと考えられる。したがって、確率変数の平均と分散がわかっているなら、ガウス分布はある意味で最も保守的な分布の選択である。

節を閉じる前に、\(X \sim \mathcal{N}(\mu, \sigma^2)\) ならば、

\(\mu_X = \mu\)、
\(\sigma_X^2 = \sigma^2\)。

であることを思い出そう。

ガウス分布（あるいは標準正規分布）からのサンプルは以下のように得られる。

pytorch mxnet tensorflow

torch.normal(mu, sigma, size=(10, 10))

tensor([[ 5.6402e-01, -2.9638e+00, -4.0225e-01, -1.0502e+00, -8.5967e-01,
         -1.8663e+00,  1.2591e+00, -4.8977e-01, -1.8643e-01, -8.4658e-01],
        [ 9.8634e-01,  2.5091e-01,  5.5886e-01, -4.9260e-01, -1.7324e-01,
         -4.1209e-01, -5.2245e-01,  5.0280e-01, -6.3664e-01, -9.0722e-01],
        [-1.1246e+00, -1.2719e+00, -1.0566e+00,  1.5562e+00, -3.7938e-01,
          8.7909e-02,  1.5184e+00, -6.9584e-01, -1.4542e+00, -2.9577e+00],
        [-1.3921e-02,  1.1291e+00,  2.1045e-01,  1.5291e+00,  2.8422e-02,
          1.3347e+00,  2.0202e+00,  2.0668e-01,  8.9387e-01,  2.1965e-03],
        [ 1.3032e+00,  8.9880e-01, -3.0384e-01,  7.6841e-01,  4.2073e-02,
         -3.1455e-01, -1.5442e+00, -1.0123e-01, -2.5143e-01, -1.8960e+00],
        [-2.0620e+00,  7.7326e-01, -9.0571e-02,  6.0874e-01,  3.1394e-01,
         -3.9894e-01, -3.0525e-01, -9.5778e-01,  1.6019e-01, -5.7222e-02],
        [-5.1813e-01, -1.4124e-01,  4.8078e-01,  1.9416e+00,  8.5896e-01,
          5.8826e-01,  6.5373e-01,  5.8058e-01, -1.4395e+00,  4.8357e-01],
        [-1.5158e+00, -2.5533e-01, -6.5353e-01,  1.0617e+00,  4.6570e-01,
          1.8963e-01,  1.6243e+00,  5.5373e-01,  1.8233e-01,  1.3987e+00],
        [ 1.7548e+00, -5.5406e-02,  7.8020e-01, -1.9451e+00, -1.1238e+00,
          1.3765e+00,  4.4821e-01, -8.6358e-01, -7.7124e-01, -1.8096e-01],
        [ 8.7309e-01,  6.8673e-01,  6.5082e-01,  1.7901e+00, -6.1880e-01,
         -3.6823e-01,  3.9230e-01, -7.9143e-01,  8.1201e-01, -2.5098e-01]])

np.random.normal(mu, sigma, size=(10, 10))

array([[-0.70057691, -1.08498021,  0.61045712,  0.38959336, -0.39842221,
         0.44038231,  0.71311424,  0.87899763, -1.04249347, -1.99314222],
       [-1.19042449,  1.42323131, -2.03115771, -0.55207844, -0.34944086,
        -0.88541313,  0.68772012, -1.08336504,  0.0043346 , -0.84555549],
       [-1.01220272, -0.18602191, -0.96093954, -0.06832284,  0.67980878,
        -0.84498207,  0.12785912,  0.42840493, -1.02172581,  0.94296419],
       [-1.47794148, -1.0995401 ,  0.32091134, -1.38076821,  0.53228056,
         1.37388727,  0.85008316, -0.8838563 ,  0.25436875,  0.8664244 ],
       [-0.86416781,  0.68150441,  0.91070442, -0.31040679, -1.24914887,
        -1.09228335,  0.23755172,  0.46968028, -1.38901965, -1.3061524 ],
       [ 0.92143486,  2.25051919,  0.93860495, -0.04962321, -0.94616495,
        -0.30157155,  0.25928661,  0.52036172, -0.33842228, -1.16051394],
       [ 0.39067333, -0.29209444,  2.04488421,  1.26964076,  0.02560868,
         0.09079073, -0.89648959, -0.805082  ,  0.0373519 ,  1.77227377],
       [-1.30955062,  1.55305581, -0.87499763, -0.37879967,  1.41977948,
        -1.56414618, -0.19597413,  0.17618799,  1.39520775,  0.8540339 ],
       [ 2.00466515,  0.78036987,  1.50414668,  0.8129593 , -0.33577407,
        -0.74829381, -0.51340285, -0.16907909,  0.26541118, -0.52972822],
       [-0.12352982,  1.31952519, -0.64136017, -0.11811265,  1.37662606,
        -0.09667546,  0.41606841, -0.06347404,  0.78929749, -0.40645727]])

tf.random.normal((10, 10), mu, sigma)

<tf.Tensor: shape=(10, 10), dtype=float32, numpy=
array([[ 1.7760942 , -0.4816362 , -0.35947865, -1.9446675 , -1.0896894 ,
         1.5467461 , -1.2217262 ,  1.1850185 ,  0.8834578 ,  0.06816649],
       [-0.6936654 ,  1.2579052 ,  0.588886  , -0.82313985, -0.4076043 ,
         0.8245478 , -0.4964524 ,  0.39823017,  0.6062789 ,  0.5577762 ],
       [ 0.25759804, -1.9598231 , -0.26861465, -0.34204927,  0.91185987,
        -1.2176397 ,  0.72053534,  0.67268974, -1.6410838 , -0.4068907 ],
       [-0.46769965,  0.75001246,  0.14487791, -0.98179877, -1.1259598 ,
         1.1308956 , -0.5497258 ,  0.3834835 ,  0.9722934 ,  0.6679007 ],
       [-0.92601836, -1.2266558 ,  0.274593  ,  0.43597487,  0.52778405,
        -0.32230726,  0.25692734, -1.5467511 ,  0.26726183, -1.2071371 ],
       [ 1.1377119 ,  1.420974  ,  0.3854779 , -0.30418903, -2.0691662 ,
         1.4372991 ,  0.41003123, -0.18349372,  0.4678633 ,  1.1770911 ],
       [ 0.7949645 , -0.24854268, -0.2203115 , -0.10029124, -0.17249188,
         0.59759915,  2.479362  ,  0.3725324 ,  0.0325313 ,  0.25079554],
       [ 0.44387078,  0.60749555, -0.29627955, -0.93514496, -0.7731776 ,
        -1.6067672 ,  2.6808097 , -2.1464765 ,  0.03318942,  0.9157554 ],
       [-0.59047383,  0.6210598 , -0.28843474, -0.20017219,  1.0729624 ,
         1.8634268 , -0.36915967, -0.53046304, -1.0391091 , -0.6351748 ],
       [-0.73569167, -0.7207913 , -1.1421863 ,  0.47300977, -0.03475155,
        -1.6653922 ,  1.6053184 , -1.5436187 , -0.03438377,  0.4566425 ]],
      dtype=float32)>

22.8.7. 指数型分布族¶

上で挙げた分布に共通する性質の1つは、すべてが 指数型分布族 に属することである。指数型分布族は、密度を次の形で表せる分布の集合である。

(22.8.20)¶\[p(\mathbf{x} \mid \boldsymbol{\eta}) = h(\mathbf{x}) \cdot \exp \left( \boldsymbol{\eta}^{\top} \cdot T(\mathbf{x}) - A(\boldsymbol{\eta}) \right)\]

この定義は少し繊細なので、詳しく見ていこう。

まず、\(h(\mathbf{x})\) は 基底測度 あるいは ベース測度 と呼ばれる。、指数重みで修正する元の測度とみなせる。

次に、\(\boldsymbol{\eta} = (\eta_1, \eta_2, ..., \eta_l) \in \mathbb{R}^l\) というベクトルがあり、これを 自然パラメータ または 標準パラメータ と呼ぶ。基底測度をどのように修正するかを定める。自然パラメータは、\(\mathbf{x}= (x_1, x_2, ..., x_n) \in \mathbb{R}^n\) のある関数 \(T(\cdot)\) に対してこれらのパラメータとの内積を取り、それを指数化することで新しい測度に入る。ベクトル \(T(\mathbf{x})= (T_1(\mathbf{x}), T_2(\mathbf{x}), ..., T_l(\mathbf{x}))\) は、\(\boldsymbol{\eta}\) に対する 十分統計量 と呼ばれる。この名前は、\(T(\mathbf{x})\) に表される情報だけで確率密度を計算するのに十分であり、サンプル \(\mathbf{x}\) の他の情報は不要だからである。

第三に、\(A(\boldsymbol{\eta})\) がある。 累積関数 と呼ばれ、上の分布 (22.8.20) が1に積分されること、すなわち

(22.8.21)¶\[A(\boldsymbol{\eta}) = \log \left[\int h(\mathbf{x}) \cdot \exp \left(\boldsymbol{\eta}^{\top} \cdot T(\mathbf{x}) \right) d\mathbf{x} \right].\]

を保証する。

具体的に、ガウス分布を考えてみよう。\(\mathbf{x}\) が1変数であるとすると、その密度は

(22.8.22)¶\[\begin{split}\begin{aligned} p(x \mid \mu, \sigma) &= \frac{1}{\sqrt{2 \pi \sigma^2}} \cdot \exp \left\{ \frac{-(x-\mu)^2}{2 \sigma^2} \right\} \\ &= \frac{1}{\sqrt{2 \pi}} \cdot \exp \left\{ \frac{\mu}{\sigma^2}x -\frac{1}{2 \sigma^2} x^2 - \left( \frac{1}{2 \sigma^2} \mu^2 +\log(\sigma) \right) \right\}. \end{aligned}\end{split}\]

指数型分布族の定義に次のように対応する。

基底測度: \(h(x) = \frac{1}{\sqrt{2 \pi}}\)、
自然パラメータ: \(\boldsymbol{\eta} = \begin{bmatrix} \eta_1 \\ \eta_2 \end{bmatrix} = \begin{bmatrix} \frac{\mu}{\sigma^2} \\ \frac{1}{2 \sigma^2} \end{bmatrix}\)、
十分統計量: \(T(x) = \begin{bmatrix}x\\-x^2\end{bmatrix}\)、および
累積関数: \(A({\boldsymbol\eta}) = \frac{1}{2 \sigma^2} \mu^2 + \log(\sigma) = \frac{\eta_1^2}{4 \eta_2} - \frac{1}{2}\log(2 \eta_2)\)。

上の各項の正確な選び方は、ある程度任意であることに注意する価値がある。重要なのは、分布がこの形で表せることであって、その厳密な形そのものではない。

4.1.2.2 章でほのめかしたように、広く使われる手法の1つは、最終出力 \(\mathbf{y}\) が指数型分布族に従うと仮定することである。指数型分布族は、機械学習で頻繁に現れる、一般的で強力な分布族である。

22.8.8. まとめ¶

ベルヌーイ確率変数は、はい/いいえの結果を持つ事象をモデル化できる。
離散一様分布は、有限個の候補からの選択をモデル化する。
連続一様分布は、区間からの選択をモデル化する。
二項分布は、ベルヌーイ確率変数の系列をモデル化し、成功回数を数える。
ポアソン確率変数は、稀な事象の到着をモデル化する。
ガウス確率変数は、多数の独立な確率変数を足し合わせた結果をモデル化する。
上記の分布はすべて指数型分布族に属する。

22.8.9. 演習¶

独立な二項確率変数 \(X, Y \sim \textrm{Binomial}(16, 1/2)\) の差 \(X-Y\) である確率変数の標準偏差はいくつか。
ポアソン確率変数 \(X \sim \textrm{Poisson}(\lambda)\) を取り、\(\lambda \rightarrow \infty\) のときの \((X - \lambda)/\sqrt{\lambda}\) を考えると、これが近似的にガウス分布になることを示せる。なぜ自然なのか。
\(n\) 個の要素を持つ2つの離散一様確率変数の和の確率質量関数は何か。