.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python from d2l import torch as d2l import torch from torch import nn import os .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python from d2l import mxnet as d2l from mxnet import np, npx import os npx.set_np() .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python #@save d2l.DATA_HUB['glove.6b.50d'] = (d2l.DATA_URL + 'glove.6B.50d.zip', '0b8703943ccdb6eb788e6f091b8946e82231bc4d') #@save d2l.DATA_HUB['glove.6b.100d'] = (d2l.DATA_URL + 'glove.6B.100d.zip', 'cd43bfb07e44e6f27cbcc7bc9ae3d80284fdaf5a') #@save d2l.DATA_HUB['glove.42b.300d'] = (d2l.DATA_URL + 'glove.42B.300d.zip', 'b5116e234e9eb9076672cfeabf5469f3eec904fa') #@save d2l.DATA_HUB['wiki.en'] = (d2l.DATA_URL + 'wiki.en.zip', 'c1816da3821ae9f43899be655002f6c723e91b88') .. raw:: html

.. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python #@save class TokenEmbedding: """Token Embedding.""" def __init__(self, embedding_name): self.idx_to_token, self.idx_to_vec = self._load_embedding( embedding_name) self.unknown_idx = 0 self.token_to_idx = {token: idx for idx, token in enumerate(self.idx_to_token)} def _load_embedding(self, embedding_name): idx_to_token, idx_to_vec = [''], [] data_dir = d2l.download_extract(embedding_name) # GloVe website: https://nlp.stanford.edu/projects/glove/ # fastText website: https://fasttext.cc/ with open(os.path.join(data_dir, 'vec.txt'), 'r') as f: for line in f: elems = line.rstrip().split(' ') token, elems = elems[0], [float(elem) for elem in elems[1:]] # Skip header information, such as the top row in fastText if len(elems) > 1: idx_to_token.append(token) idx_to_vec.append(elems) idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec return idx_to_token, d2l.tensor(idx_to_vec) def __getitem__(self, tokens): indices = [self.token_to_idx.get(token, self.unknown_idx) for token in tokens] vecs = self.idx_to_vec[d2l.tensor(indices)] return vecs def __len__(self): return len(self.idx_to_token) .. raw:: html

.. raw:: html

| これらの事前学習済み GloVe および fastText 埋め込みを読み込むために、 | 以下の ``TokenEmbedding`` クラスを定義する。 .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python glove_6b50d = TokenEmbedding('glove.6b.50d') .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python glove_6b50d = TokenEmbedding('glove.6b.50d') .. raw:: html

.. raw:: html

| 以下では、 | 50 次元の GloVe 埋め込み（Wikipedia の一部で事前学習済み）を読み込みる。 | ``TokenEmbedding`` インスタンスを作成するとき、 | 指定した埋め込みファイルがまだ存在しなければダウンロードされる。 .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python glove_6b50d = TokenEmbedding('glove.6b.50d') .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output Downloading ../data/glove.6B.50d.zip from http://d2l-data.s3-accelerate.amazonaws.com/glove.6B.50d.zip... .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python len(glove_6b50d) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python len(glove_6b50d) .. raw:: html

.. raw:: html

語彙サイズを出力する。語彙には 400000 語（トークン）と特別な未知語トークンが含まれる。 .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python len(glove_6b50d) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 400001 .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python len(glove_6b50d) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 400001 .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367] .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367] .. raw:: html

.. raw:: html

語彙中の単語のインデックスを取得したり、その逆を行ったりできる。 .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367] .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output (3367, 'beautiful') .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def get_similar_tokens(query_token, k, embed): topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1) for i, c in zip(topk[1:], cos[1:]): # Exclude the input word print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}') .. raw:: html

.. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def knn(W, x, k): # Add 1e-9 for numerical stability cos = torch.mv(W, x.reshape(-1,)) / ( torch.sqrt(torch.sum(W * W, axis=1) + 1e-9) * torch.sqrt((x * x).sum())) _, topk = torch.topk(cos, k=k) return topk, [cos[int(i)] for i in topk] .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def knn(W, x, k): # Add 1e-9 for numerical stability cos = np.dot(W, x.reshape(-1,)) / ( np.sqrt(np.sum(W * W, axis=1) + 1e-9) * np.sqrt((x * x).sum())) topk = npx.topk(cos, k=k, ret_typ='indices') return topk, [cos[int(i)] for i in topk] .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_similar_tokens('chip', 3, glove_6b50d) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_similar_tokens('chip', 3, glove_6b50d) .. raw:: html

.. raw:: html

| 次に、\ ``TokenEmbedding`` インスタンス ``embed`` に含まれる事前学習済み単語ベクトルを使って、 | 類似単語を検索する。 .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_similar_tokens('baby', 3, glove_6b50d) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_similar_tokens('baby', 3, glove_6b50d) .. raw:: html

.. raw:: html

| 事前学習済み単語ベクトル ``glove_6b50d`` の語彙には 400000 語と特別な未知語トークンが含まれる。 | 入力単語と未知語トークンを除いて、 | この語彙の中から | 単語 “chip” に意味的に最も類似した 3 語を見つけてみよう。 .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_similar_tokens('chip', 3, glove_6b50d) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output cosine sim=0.856: chips cosine sim=0.749: intel cosine sim=0.749: electronics .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_similar_tokens('beautiful', 3, glove_6b50d) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_similar_tokens('beautiful', 3, glove_6b50d) .. raw:: html

.. raw:: html

以下は “baby” と “beautiful” に類似した単語を出力する。 .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_similar_tokens('baby', 3, glove_6b50d) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output cosine sim=0.839: babies cosine sim=0.800: boy cosine sim=0.792: girl .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python def get_analogy(token_a, token_b, token_c, embed): vecs = embed[[token_a, token_b, token_c]] x = vecs[1] - vecs[0] + vecs[2] topk, cos = knn(embed.idx_to_vec, x, 1) return embed.idx_to_token[int(topk[0])] # Remove unknown words .. raw:: html

.. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_similar_tokens('beautiful', 3, glove_6b50d) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output cosine sim=0.921: lovely cosine sim=0.893: gorgeous cosine sim=0.830: wonderful .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('man', 'woman', 'son', glove_6b50d) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('man', 'woman', 'son', glove_6b50d) .. raw:: html

.. raw:: html

単語のアナロジー ~~~~~~~~~~~~~~~~ | 類似単語を見つけるだけでなく、 | 単語ベクトルを単語アナロジーのタスクにも適用できる。 | たとえば、 | “man”:“woman”::“son”:“daughter” | は単語アナロジーの形式である。 | つまり、 | “man” は “woman” に対して、 | “son” は “daughter” に対する、という関係である。 | 具体的には、 | 単語アナロジー補完タスクは次のように定義できる。 | 単語アナロジー :math:`a : b :: c : d` に対して、最初の 3 語 :math:`a`, :math:`b`, :math:`c` が与えられたとき、\ :math:`d` を見つける。 | 単語 :math:`w` のベクトルを :math:`\textrm{vec}(w)` と表す。 | このアナロジーを完成させるために、 | :math:`\textrm{vec}(c)+\textrm{vec}(b)-\textrm{vec}(a)` の結果に最も類似したベクトルを持つ単語を見つける。 .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('beijing', 'china', 'tokyo', glove_6b50d) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('beijing', 'china', 'tokyo', glove_6b50d) .. raw:: html

.. raw:: html

読み込んだ単語ベクトルを使って、“male-female” のアナロジーを確認してみよう。 .. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('man', 'woman', 'son', glove_6b50d) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 'daughter' .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('bad', 'worst', 'big', glove_6b50d) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('bad', 'worst', 'big', glove_6b50d) .. raw:: html

.. raw:: html

pytorch mxnet jax tensorflow

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('beijing', 'china', 'tokyo', glove_6b50d) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 'japan' .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('do', 'did', 'go', glove_6b50d) .. raw:: html

.. raw:: html

.. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('do', 'did', 'go', glove_6b50d) .. raw:: html

.. raw:: html

| “bad”:“worst”::“big”:“biggest” | のような | “形容詞-最上級形容詞” のアナロジーでは、 | 事前学習済み単語ベクトルが | 統語情報を捉えている可能性があることがわかる。 .. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('bad', 'worst', 'big', glove_6b50d) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 'biggest' | 事前学習済み単語ベクトルにおける過去形の概念が捉えられていることを示すために、 | “do”:“did”::“go”:“went” | という | “現在形-過去形” のアナロジーで構文をテストできる。 .. raw:: latex \diilbookstyleinputcell .. code:: python get_analogy('do', 'did', 'go', glove_6b50d) .. raw:: latex \diilbookstyleoutputcell .. parsed-literal:: :class: output 'went' まとめ ------ - 実際には、大規模コーパスで事前学習された単語ベクトルは、後続の自然言語処理タスクに適用できる。 - 事前学習済み単語ベクトルは、単語の類似性とアナロジーのタスクに適用できる。演習 ---- 1. ``TokenEmbedding('wiki.en')`` を使って fastText の結果を試しよ。 2. 語彙が非常に大きい場合、類似単語の検索や単語アナロジーの補完をより高速に行うにはどうすればよいだろうか？