14.3. 物体検出とバウンディングボックス¶

前の節（たとえば 8.1 章– 8.4 章）では、画像分類のためのさまざまなモデルを紹介した。画像分類タスクでは、画像中に主要な物体がただ 1つだけ存在すると仮定し、そのカテゴリを認識することだけに注目する。しかし、関心のある画像にはしばしば複数の物体が含まれる。その場合、それらのカテゴリだけでなく、画像内での具体的な位置も知りたいのである。コンピュータビジョンでは、このようなタスクを 物体検出（object detection、または 物体認識）と呼ぶ。

物体検出は、多くの分野で広く応用されている。たとえば、自動運転では、撮影した動画画像から車両、歩行者、道路、障害物の位置を検出することで、走行経路を計画する必要がある。また、ロボットはこの技術を用いて、環境内を移動する間に関心のある物体を検出し、その位置を特定することがある。さらに、セキュリティシステムでは、侵入者や爆弾のような異常な物体を検出する必要があるかもしれない。

次のいくつかの節では、物体検出のためのいくつかの深層学習手法を紹介する。まずは、物体の位置（または場所）の紹介から始める。

pytorch mxnet jax tensorflow

%matplotlib inline
from d2l import torch as d2l
import torch

%matplotlib inline
from d2l import mxnet as d2l
from mxnet import image, npx, np

npx.set_np()

#@save
def box_corner_to_center(boxes):
    """Convert from (upper-left, lower-right) to (center, width, height)."""
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    boxes = d2l.stack((cx, cy, w, h), axis=-1)
    return boxes

#@save
def box_center_to_corner(boxes):
    """Convert from (center, width, height) to (upper-left, lower-right)."""
    cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x1 = cx - 0.5 * w
    y1 = cy - 0.5 * h
    x2 = cx + 0.5 * w
    y2 = cy + 0.5 * h
    boxes = d2l.stack((x1, y1, x2, y2), axis=-1)
    return boxes

%matplotlib inline
from d2l import tensorflow as d2l
import tensorflow as tf

この節で使用するサンプル画像を読み込みる。画像の左側に犬、右側に猫がいることがわかる。これらがこの画像における2つの主要な物体である。

pytorch mxnet jax tensorflow

d2l.set_figsize()
img = d2l.plt.imread('../img/catdog.jpg')
d2l.plt.imshow(img);

../_images/output_bounding-box_a9e907_18_0.svg

d2l.set_figsize()
img = image.imread('../img/catdog.jpg').asnumpy()
d2l.plt.imshow(img);

[07:02:09] ../src/storage/storage.cc:196: Using Pooled (Naive) StorageManager for CPU

../_images/output_bounding-box_a9e907_21_1.svg

# ここで `bbox` はバウンディングボックスの略である
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]

d2l.set_figsize()
img = d2l.plt.imread('../img/catdog.jpg')
d2l.plt.imshow(img);

../_images/output_bounding-box_a9e907_27_0.svg

14.3.1. バウンディングボックス¶

物体検出では、通常、物体の空間的位置を表すために バウンディングボックス を用いる。バウンディングボックスは長方形であり、長方形の左上隅の \(x\) 座標と \(y\) 座標、および右下隅の同様の座標によって決まる。もう1つの一般的なバウンディングボックス表現は、バウンディングボックス中心の \((x, y)\) 座標と、ボックスの幅および高さである。

ここでは、これら2つの表現の間を変換する関数を定義する。 box_corner_to_center は2隅表現から中心・幅・高さ表現へ変換し、 box_center_to_corner はその逆を行う。入力引数 boxes は形状 (\(n\), 4) の2次元テンソルである必要があり、ここで \(n\) はバウンディングボックスの数である。

pytorch mxnet jax tensorflow

#@save
def box_corner_to_center(boxes):
    """Convert from (upper-left, lower-right) to (center, width, height)."""
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    boxes = d2l.stack((cx, cy, w, h), axis=-1)
    return boxes

#@save
def box_center_to_corner(boxes):
    """Convert from (center, width, height) to (upper-left, lower-right)."""
    cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x1 = cx - 0.5 * w
    y1 = cy - 0.5 * h
    x2 = cx + 0.5 * w
    y2 = cy + 0.5 * h
    boxes = d2l.stack((x1, y1, x2, y2), axis=-1)
    return boxes

#@save
def box_corner_to_center(boxes):
    """Convert from (upper-left, lower-right) to (center, width, height)."""
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    boxes = d2l.stack((cx, cy, w, h), axis=-1)
    return boxes

#@save
def box_center_to_corner(boxes):
    """Convert from (center, width, height) to (upper-left, lower-right)."""
    cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x1 = cx - 0.5 * w
    y1 = cy - 0.5 * h
    x2 = cx + 0.5 * w
    y2 = cy + 0.5 * h
    boxes = d2l.stack((x1, y1, x2, y2), axis=-1)
    return boxes

boxes = d2l.tensor((dog_bbox, cat_bbox))
box_center_to_corner(box_corner_to_center(boxes)) == boxes

#@save
def box_corner_to_center(boxes):
    """Convert from (upper-left, lower-right) to (center, width, height)."""
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    cx = (x1 + x2) / 2
    cy = (y1 + y2) / 2
    w = x2 - x1
    h = y2 - y1
    boxes = d2l.stack((cx, cy, w, h), axis=-1)
    return boxes

#@save
def box_center_to_corner(boxes):
    """Convert from (center, width, height) to (upper-left, lower-right)."""
    cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    x1 = cx - 0.5 * w
    y1 = cy - 0.5 * h
    x2 = cx + 0.5 * w
    y2 = cy + 0.5 * h
    boxes = d2l.stack((x1, y1, x2, y2), axis=-1)
    return boxes

座標情報に基づいて、画像中の犬と猫のバウンディングボックスを定義する。画像における座標の原点は画像の左上隅であり、右方向と下方向がそれぞれ \(x\) 軸と \(y\) 軸の正の方向である。

pytorch mxnet jax tensorflow

# ここで `bbox` はバウンディングボックスの略である
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]

# ここで `bbox` はバウンディングボックスの略である
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]

#@save
def bbox_to_rect(bbox, color):
    """Convert bounding box to matplotlib format."""
    # バウンディングボックス（左上の x、左上の y、右下の x、
    # 左上x、右下y）の形式をmatplotlibの形式に変換する: ((upper-left x,
    # 左上の y), 幅, 高さ)
    return d2l.plt.Rectangle(
        xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
        fill=False, edgecolor=color, linewidth=2)

# ここで `bbox` はバウンディングボックスの略である
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]

2回変換することで、2つのバウンディングボックス変換関数の正しさを確認できる。

pytorch mxnet jax tensorflow

boxes = d2l.tensor((dog_bbox, cat_bbox))
box_center_to_corner(box_corner_to_center(boxes)) == boxes

tensor([[True, True, True, True],
        [True, True, True, True]])

boxes = d2l.tensor((dog_bbox, cat_bbox))
box_center_to_corner(box_corner_to_center(boxes)) == boxes

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True]])

fig = d2l.plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));

boxes = d2l.tensor((dog_bbox, cat_bbox))
box_center_to_corner(box_corner_to_center(boxes)) == boxes

<tf.Tensor: shape=(2, 4), dtype=bool, numpy=
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True]])>

正確に描けているか確認するために、画像にバウンディングボックスを描画してみましょう。描画する前に、補助関数 bbox_to_rect を定義する。バウンディングボックスを matplotlib パッケージのバウンディングボックス形式で表す。

#@save
def bbox_to_rect(bbox, color):
    """Convert bounding box to matplotlib format."""
    # バウンディングボックス（左上の x、左上の y、右下の x、
    # 左上x、右下y）の形式をmatplotlibの形式に変換する: ((upper-left x,
    # 左上の y), 幅, 高さ)
    return d2l.plt.Rectangle(
        xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
        fill=False, edgecolor=color, linewidth=2)

画像にバウンディングボックスを追加すると、 2つの物体の主な輪郭が、基本的に2つのボックスの内側に収まっていることがわかる。

fig = d2l.plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));

../_images/output_bounding-box_a9e907_78_0.svg

14.3.2. まとめ¶

物体検出では、画像中の関心のあるすべての物体を認識するだけでなく、それらの位置も認識する。位置は一般に長方形のバウンディングボックスで表される。
よく使われる2種類のバウンディングボックス表現の間で変換できる。

14.3.3. 演習¶

別の画像を見つけて、物体を含むバウンディングボックスをラベル付けしてみなさい。バウンディングボックスのラベル付けとカテゴリのラベル付けを比較すると、通常どちらに時間がかかりますか？
box_corner_to_center と box_center_to_corner の入力引数 boxes の最内次元が常に 4 であるのはなぜであるか？