一个简单Word2Vec的例子

这个例子展示了如何利用DSW进行构造一个Word2Vec的学习,Word2Vec是NLP训练一个比较基础的数据处理方式,具体原理大家可以仔细阅读论文《Efficient Estimation of Word Representations in Vector Space

我们准备一个文章text,然后通过这个文章的信息来学习出文章中单词的一个向量表达,使得单词的空间距离能够体现单词语义距离,使得越靠近的单词语义越相似。

首先我们把文章读入到words这个数组中。可以看到这个文章总共包括17005207单词。

In [1]:
import tensorflow as tf
if tf.__version__ >= '2.0.0':
    raise RuntimeError('This demo is only applicable on TensorFlow 1.x.')

import sys
if (sys.version_info < (3, 0)):
    raise RuntimeError('This demo is only compatible with Python 3.x')
TF_KEY: py3.6+1.15.0+10.0.130+7.6.5.32
In [2]:
import tensorflow as tf
import zipfile
import urllib


urllib.request.urlretrieve("https://notebook-dataset.oss-cn-beijing.aliyuncs.com/word2vec_text.zip", "./text.zip")   

with zipfile.ZipFile("text.zip") as f:
    words = tf.compat.as_str(f.read(f.namelist()[0])).split()
    
print('words size', len(words))
words size 17005207

然后我们准备一个词典,例子中我们限制一下这个词典最大size为50000, 字典中我们保存文章最多出现49999个词,然后其他词都当做'UNK'

In [27]:
import collections
import math

vocabulary_size = 50000
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
print("最多5个单词以及出现次数", count[1:6])
最多5个单词以及出现次数 [('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764), ('in', 372201)]
In [ ]:
为了后面训练的方便我们把单词用字典的index来进行标识并且把原文用这个进行编码
In [28]:
dictionary = dict()
for word, _ in count:
    dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
    if word in dictionary:
        index = dictionary[word]
    else:
        index = 0  # dictionary['UNK']
    unk_count += 1
    data.append(index)
count[0][1] = unk_count

print('编码后文章为', data[:10], '...')
编码后文章为 [5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156] ...

建立一个方向查找表,等学习完,可以把单词的编码又变回到原单词

In [29]:
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
print([reverse_dictionary[i] for i in data[:10]])
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']

接下来我们进入正题,我们构造训练word2vec的样本,也就是skip gram论文中的方法,把这个定义为一个函数,留到后面训练时候用

In [30]:
import numpy as np
import random
data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    return batch, labels

我们观察这个训练样本的样子

In [31]:
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
for i in range(8):
    print(batch[i], reverse_dictionary[batch[i]],
        '->', labels[i, 0], reverse_dictionary[labels[i, 0]])
3081 originated -> 12 as
3081 originated -> 5234 anarchism
12 as -> 3081 originated
12 as -> 6 a
6 a -> 12 as
6 a -> 195 term
195 term -> 2 of
195 term -> 6 a

现在我们来定义训练说需要的DNN模型

In [32]:
batch_size = 128
embedding_size = 128  # Dimension of the embedding vector.
skip_window = 1       # How many words to consider left and right.
num_skips = 2         # How many times to reuse an input to generate a label.

# We pick a random validation set to sample nearest neighbors. Here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64    # Number of negative examples to sample.
In [33]:
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

nce_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                            stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

loss = tf.reduce_mean(
    tf.nn.nce_loss(weights=nce_weights,
                     biases=nce_biases,
                     labels=train_labels,
                     inputs=embed,
                     num_sampled=num_sampled,
                     num_classes=vocabulary_size))

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)
In [34]:
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

average_loss = 0
for step in range(100001):
    batch_inputs, batch_labels = generate_batch(
        batch_size, num_skips, skip_window)
    feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}

    # We perform one update step by evaluating the optimizer op (including it
    # in the list of returned values for session.run()
    _, loss_val = sess.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += loss_val

    if step % 2000 == 0:
        if step > 0:
            average_loss /= 2000
        print("Average loss at step ", step, ": ", average_loss)
        average_loss = 0

    if step % 10000 == 0:
        sim = similarity.eval()
        for i in range(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 8 # number of nearest neighbors
            nearest = (-sim[i, :]).argsort()[1:top_k+1]
            log_str = "Nearest to %s:" % valid_word
            for k in range(top_k):
                close_word = reverse_dictionary[nearest[k]]
                log_str = "%s %s," % (log_str, close_word)
            print(log_str)
final_embeddings = normalized_embeddings.eval()
/home/admin/.local/lib/python3.6/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
  warnings.warn('An interactive session is already active. This can '
Average loss at step  0 :  293.93377685546875
Nearest to more: ona, conquistador, loa, vichy, escoffier, miss, humber, gulliver,
Nearest to six: degrees, californian, ascertaining, internal, undecidable, lease, meteorologists, terabytes,
Nearest to be: jupiter, surinam, detritus, involves, spongebob, cluniac, mulholland, postgraduate,
Nearest to known: noticing, senators, scripture, hahn, demolished, castilian, modernity, wiser,
Nearest to two: gions, composing, flames, foolish, tempting, ali, silt, captive,
Nearest to use: passport, morgue, uma, messiaen, scrap, studio, excelled, garcia,
Nearest to for: patricio, klerk, puritanical, modernity, arminius, belgica, lenovo, pentagonal,
Nearest to other: marianas, aarseth, legates, justus, confocal, exe, indented, rockwell,
Nearest to first: ingredients, massing, bigelow, disarm, allein, discouraged, cygnus, greenberg,
Nearest to from: batter, clarified, unpaired, neurological, monetarism, shoe, imprisoned, affair,
Nearest to after: rightist, diz, formulaic, police, apap, nanotubes, samaria, enters,
Nearest to there: oxidise, uttar, ndebele, quotient, newsgroups, larry, omnivores, bites,
Nearest to years: vasco, outlined, bashar, aramaean, crashed, muscle, coinciding, lombardy,
Nearest to between: emphasised, quadrature, alcibiades, muted, symbolist, traps, spokes, cordell,
Nearest to nine: symbolize, sacramento, zenith, yad, hemp, repeaters, syst, tactician,
Nearest to so: hab, winston, tendons, ethnic, signatories, wrecking, transportation, training,
Average loss at step  2000 :  113.83043353939057
Average loss at step  4000 :  52.04903098034859
Average loss at step  6000 :  33.55314987397194
Average loss at step  8000 :  23.63758887195587
Average loss at step  10000 :  18.26611834061146
Nearest to more: miss, refrain, located, icelandic, abbots, austin, vs, basins,
Nearest to six: zero, nine, austin, five, gollancz, altenberg, vaccination, vs,
Nearest to be: gland, jupiter, six, have, parallels, involves, inequality, died,
Nearest to known: sign, modernity, aspiration, analogue, senators, cheap, converter, archive,
Nearest to two: one, victoriae, gland, austin, nine, reginae, alpina, afghani,
Nearest to use: reginae, seed, passport, victoriae, anatole, austin, studio, aluminium,
Nearest to for: and, of, with, in, by, sanjaks, to, as,
Nearest to other: austin, hit, phylum, dreyfus, roman, reginae, households, indented,
Nearest to first: austin, discouraged, mantle, version, genocide, airport, vaccination, barrel,
Nearest to from: in, of, vs, and, to, tanto, neurological, near,
Nearest to after: two, across, one, extremely, police, egg, reginae, ludwig,
Nearest to there: larry, easily, buddhism, buried, means, history, assigned, persecute,
Nearest to years: painter, zero, detector, outlined, principal, colspan, analogue, anarchism,
Nearest to between: emphasised, in, and, vaccination, nine, of, sa, resident,
Nearest to nine: zero, austin, victoriae, reginae, alpina, six, implicit, vs,
Nearest to so: one, ethnic, transportation, winston, territories, competition, austin, chloride,
Average loss at step  12000 :  14.036011875867844
Average loss at step  14000 :  11.746295873045922
Average loss at step  16000 :  9.960373474597931
Average loss at step  18000 :  8.668071280956267
Average loss at step  20000 :  7.740298974275589
Nearest to more: absalom, refrain, miss, located, hebron, abbots, icelandic, intended,
Nearest to six: nine, eight, five, seven, zero, dasyprocta, two, four,
Nearest to be: have, by, parallels, six, was, make, as, veil,
Nearest to known: aspiration, noticing, cheap, demolished, lahore, modernity, dexter, atkinson,
Nearest to two: five, three, one, seven, six, eight, dasyprocta, four,
Nearest to use: seed, dasyprocta, backslash, reginae, passport, anatole, uma, homomorphism,
Nearest to for: in, with, of, and, by, to, as, dasyprocta,
Nearest to other: indented, households, nn, hit, multiple, roman, austin, edgar,
Nearest to first: backslash, austin, for, discouraged, vaccination, bigelow, numa, mantle,
Nearest to from: in, of, to, and, at, vs, for, agouti,
Nearest to after: dasyprocta, across, two, and, in, with, as, heterosexual,
Nearest to there: it, they, which, he, easily, larry, buried, newsgroups,
Nearest to years: two, lombardy, detector, outlined, dasyprocta, gather, imran, painter,
Nearest to between: in, and, dasyprocta, of, vaccination, emphasised, quadrature, seven,
Nearest to nine: eight, seven, six, zero, dasyprocta, five, agouti, four,
Nearest to so: winston, ethnic, endeavour, signatories, transportation, territories, competition, defended,
Average loss at step  22000 :  7.200594078540802
Average loss at step  24000 :  6.964842308521271
Average loss at step  26000 :  6.659653395533562
Average loss at step  28000 :  6.239654780030251
Average loss at step  30000 :  6.1383238241672515
Nearest to more: absalom, refrain, ona, hebron, humber, intended, abbots, miss,
Nearest to six: eight, five, nine, seven, four, three, zero, two,
Nearest to be: have, by, was, parallels, make, as, is, veil,
Nearest to known: reuptake, hamas, aspiration, atkinson, noticing, modernity, cheap, dexter,
Nearest to two: four, three, five, one, six, seven, eight, dasyprocta,
Nearest to use: passport, seed, dasyprocta, microseconds, backslash, abitibi, risk, positivists,
Nearest to for: with, of, in, and, by, to, from, dasyprocta,
Nearest to other: reuptake, ligature, rockwell, nn, austin, multiple, households, hit,
Nearest to first: backslash, austin, numa, potsdam, vaccination, discouraged, on, for,
Nearest to from: in, and, of, vs, at, for, on, amalthea,
Nearest to after: abitibi, two, dasyprocta, in, and, across, with, as,
Nearest to there: it, they, he, which, easily, buried, now, larry,
Nearest to years: lombardy, two, autocad, detector, gather, outlined, crashed, colspan,
Nearest to between: and, in, of, with, dasyprocta, on, two, emphasised,
Nearest to nine: eight, seven, six, five, four, three, zero, dasyprocta,
Nearest to so: winston, endeavour, amalthea, ethnic, signatories, advantages, territories, ending,
Average loss at step  32000 :  5.876646659016609
Average loss at step  34000 :  5.817600245714187
Average loss at step  36000 :  5.715285798311234
Average loss at step  38000 :  5.253283932566643
Average loss at step  40000 :  5.463549421310425
Nearest to more: absalom, not, ops, refrain, intended, roskilde, library, hebron,
Nearest to six: seven, four, eight, five, three, nine, zero, two,
Nearest to be: have, by, was, is, parallels, make, been, were,
Nearest to known: noticing, reuptake, modernity, atkinson, rakyat, hamas, used, exploding,
Nearest to two: three, four, six, five, one, seven, eight, dasyprocta,
Nearest to use: passport, dasyprocta, backslash, microseconds, risk, abitibi, uma, homomorphism,
Nearest to for: with, in, of, to, from, dasyprocta, by, and,
Nearest to other: reuptake, ligature, two, multiple, rockwell, six, austin, nn,
Nearest to first: backslash, austin, potsdam, numa, lerner, vaccination, lipids, hus,
Nearest to from: in, vs, on, agouti, at, through, of, amalthea,
Nearest to after: abitibi, dasyprocta, and, with, as, across, two, in,
Nearest to there: they, it, he, which, easily, now, philanthropy, buried,
Nearest to years: lombardy, two, six, autocad, detector, one, gather, vasco,
Nearest to between: in, with, and, recitative, from, dasyprocta, to, on,
Nearest to nine: eight, seven, six, zero, five, three, four, dasyprocta,
Nearest to so: winston, endeavour, amalthea, signatories, ethnic, advantages, fidonet, territories,
Average loss at step  42000 :  5.311039533734322
Average loss at step  44000 :  5.285898304700852
Average loss at step  46000 :  5.267841118693352
Average loss at step  48000 :  5.037707761883736
Average loss at step  50000 :  5.153288539290428
Nearest to more: absalom, not, less, trolleybus, roskilde, ops, refrain, intended,
Nearest to six: eight, four, seven, five, three, nine, two, one,
Nearest to be: have, by, was, is, were, make, been, are,
Nearest to known: noticing, reuptake, used, atkinson, modernity, rakyat, exploding, hahn,
Nearest to two: three, one, four, six, five, eight, seven, dasyprocta,
Nearest to use: risk, dasyprocta, passport, backslash, microseconds, abitibi, reginae, seed,
Nearest to for: in, with, and, of, from, dasyprocta, to, against,
Nearest to other: reuptake, multiple, ligature, austin, two, rockwell, certain, nn,
Nearest to first: backslash, bigelow, potsdam, numa, austin, lerner, abitibi, word,
Nearest to from: in, through, at, vs, on, agouti, and, into,
Nearest to after: abitibi, prism, dasyprocta, as, in, and, three, when,
Nearest to there: it, they, he, which, easily, now, philanthropy, buried,
Nearest to years: lombardy, two, aramaean, autocad, gather, detector, outlined, colspan,
Nearest to between: with, in, dasyprocta, and, from, recitative, vaccination, everywhere,
Nearest to nine: eight, seven, six, zero, four, three, five, agouti,
Nearest to so: winston, endeavour, signatories, amalthea, ethnic, advantages, fidonet, territories,
Average loss at step  52000 :  5.159958806276322
Average loss at step  54000 :  5.105936750173568
Average loss at step  56000 :  5.042992538094521
Average loss at step  58000 :  5.137826075315475
Average loss at step  60000 :  4.93975215446949
Nearest to more: less, absalom, microsite, not, roskilde, quantifiers, trolleybus, microscopy,
Nearest to six: eight, five, four, seven, nine, three, zero, dasyprocta,
Nearest to be: have, by, been, was, were, is, make, refer,
Nearest to known: used, noticing, modernity, reuptake, atkinson, rakyat, exploding, hahn,
Nearest to two: three, four, one, five, six, seven, eight, dasyprocta,
Nearest to use: risk, passport, dasyprocta, microseconds, backslash, callithrix, abitibi, substituting,
Nearest to for: of, in, or, and, with, to, dasyprocta, against,
Nearest to other: reuptake, multiple, xhtml, conic, cebus, rockwell, ligature, michelob,
Nearest to first: backslash, bigelow, numa, potsdam, austin, tamarin, lerner, heavy,
Nearest to from: in, through, into, at, vs, agouti, and, amalthea,
Nearest to after: in, as, before, prism, when, abitibi, dasyprocta, marmoset,
Nearest to there: they, it, he, which, easily, now, cebus, hardly,
Nearest to years: vasco, lombardy, autocad, aramaean, four, callithrix, outlined, detector,
Nearest to between: with, in, everywhere, vaccination, recitative, dasyprocta, from, on,
Nearest to nine: eight, six, seven, five, four, zero, three, callithrix,
Nearest to so: endeavour, tamarin, winston, advantages, ethnic, amalthea, msg, fidonet,
Average loss at step  62000 :  4.812170468568802
Average loss at step  64000 :  4.798992517709732
Average loss at step  66000 :  4.980687629342079
Average loss at step  68000 :  4.907456518173218
Average loss at step  70000 :  4.795046325325966
Nearest to more: less, absalom, roskilde, ona, microsite, quantifiers, most, not,
Nearest to six: eight, four, five, seven, three, nine, zero, two,
Nearest to be: been, have, by, were, is, are, refer, was,
Nearest to known: used, noticing, reuptake, modernity, atkinson, exploding, hahn, rakyat,
Nearest to two: three, four, six, one, five, seven, eight, dasyprocta,
Nearest to use: risk, passport, dasyprocta, microseconds, callithrix, backslash, abitibi, microcebus,
Nearest to for: in, of, and, with, including, or, against, dasyprocta,
Nearest to other: reuptake, multiple, cebus, many, xhtml, ligature, different, conic,
Nearest to first: bigelow, backslash, thaler, numa, potsdam, austin, heavy, abitibi,
Nearest to from: through, in, into, vs, amalthea, during, agouti, on,
Nearest to after: before, when, in, prism, abitibi, dasyprocta, marmoset, while,
Nearest to there: they, it, he, which, easily, now, cebus, also,
Nearest to years: four, lombardy, autocad, aramaean, vasco, six, callithrix, govern,
Nearest to between: with, in, from, everywhere, dasyprocta, vaccination, recitative, around,
Nearest to nine: eight, six, seven, five, zero, four, three, callithrix,
Nearest to so: endeavour, winston, tamarin, advantages, amalthea, thz, ethnic, fidonet,
Average loss at step  72000 :  4.805254952311516
Average loss at step  74000 :  4.778901010006666
Average loss at step  76000 :  4.864288411319256
Average loss at step  78000 :  4.7959640386104585
Average loss at step  80000 :  4.8224020563364025
Nearest to more: less, most, absalom, roskilde, microsite, ona, not, quantifiers,
Nearest to six: five, four, seven, eight, three, nine, two, zero,
Nearest to be: have, been, by, was, were, are, refer, is,
Nearest to known: used, noticing, reuptake, modernity, hahn, atkinson, exploding, microscope,
Nearest to two: three, four, six, five, seven, one, eight, callithrix,
Nearest to use: risk, passport, dasyprocta, microseconds, callithrix, backslash, microcebus, substituting,
Nearest to for: dasyprocta, or, with, in, against, cebus, patricio, primigenius,
Nearest to other: many, xhtml, reuptake, cebus, multiple, different, these, some,
Nearest to first: bigelow, backslash, second, thaler, numa, potsdam, latter, austin,
Nearest to from: through, into, in, on, during, amalthea, vs, at,
Nearest to after: before, when, prism, in, during, abitibi, dasyprocta, marmoset,
Nearest to there: they, it, he, which, now, easily, cebus, hardly,
Nearest to years: six, autocad, aramaean, lombardy, vasco, callithrix, govern, coinciding,
Nearest to between: with, in, recitative, from, dasyprocta, vaccination, everywhere, emphasised,
Nearest to nine: eight, seven, six, five, four, zero, three, callithrix,
Nearest to so: endeavour, advantages, tamarin, amalthea, winston, msg, fidonet, thz,
Average loss at step  82000 :  4.811216924786567
Average loss at step  84000 :  4.795806077718734
Average loss at step  86000 :  4.753829827547073
Average loss at step  88000 :  4.6853781598806385
Average loss at step  90000 :  4.764363196372986
Nearest to more: less, most, roskilde, absalom, ona, not, microsite, bengali,
Nearest to six: eight, five, seven, four, nine, three, two, zero,
Nearest to be: been, have, was, were, are, by, refer, is,
Nearest to known: used, noticing, modernity, reuptake, exploding, hahn, atkinson, contrasting,
Nearest to two: three, four, five, six, one, seven, eight, dasyprocta,
Nearest to use: risk, passport, morgue, dasyprocta, substituting, backslash, callithrix, microseconds,
Nearest to for: of, with, or, including, dasyprocta, patricio, cebus, microcebus,
Nearest to other: xhtml, many, different, linebarger, multiple, reuptake, cebus, some,
Nearest to first: backslash, second, bigelow, thaler, numa, latter, tamarin, potsdam,
Nearest to from: through, into, in, amalthea, during, on, at, agouti,
Nearest to after: before, when, during, prism, abitibi, dasyprocta, until, while,
Nearest to there: they, it, he, which, now, cebus, easily, hardly,
Nearest to years: autocad, lombardy, six, aramaean, five, months, vasco, coinciding,
Nearest to between: with, in, from, dasyprocta, vaccination, recitative, everywhere, emphasised,
Nearest to nine: eight, seven, six, five, four, zero, callithrix, dasyprocta,
Nearest to so: advantages, endeavour, amalthea, winston, fidonet, msg, disabled, believe,
Average loss at step  92000 :  4.711258682847023
Average loss at step  94000 :  4.632245309472084
Average loss at step  96000 :  4.737491478085518
Average loss at step  98000 :  4.6130690822601315
Average loss at step  100000 :  4.670656157255173
Nearest to more: less, most, roskilde, absalom, microsite, ona, olympias, bengali,
Nearest to six: seven, eight, five, four, nine, three, two, zero,
Nearest to be: been, have, are, by, were, was, is, refer,
Nearest to known: used, noticing, modernity, reuptake, called, tamarin, genuine, microscope,
Nearest to two: four, three, five, six, seven, one, eight, callithrix,
Nearest to use: risk, morgue, passport, dasyprocta, substituting, microseconds, microcebus, callithrix,
Nearest to for: or, with, dasyprocta, in, during, and, cebus, including,
Nearest to other: xhtml, many, linebarger, cebus, various, reuptake, different, multiple,
Nearest to first: second, backslash, thaler, bigelow, numa, latter, next, under,
Nearest to from: through, in, into, during, at, amalthea, on, agouti,
Nearest to after: before, when, during, abitibi, prism, in, dasyprocta, until,
Nearest to there: they, it, he, now, which, cebus, hardly, still,
Nearest to years: six, autocad, aramaean, lombardy, months, days, bokassa, callithrix,
Nearest to between: with, in, from, everywhere, vaccination, pleated, of, around,
Nearest to nine: eight, seven, six, five, four, zero, three, callithrix,
Nearest to so: advantages, endeavour, msg, believe, amalthea, tamarin, aon, winston,

最后我们形象在画布上展示学习出来单词的向量

In [35]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 200
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
labels = [reverse_dictionary[i] for i in range(plot_only)]

plt.figure(figsize=(18, 18))  #in inches
for i, label in enumerate(labels):
    x, y = low_dim_embs[i,:]
    plt.scatter(x, y)
    plt.annotate(label,
                 xy=(x, y),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')

plt.savefig('result.png')

result

In [ ]: