词向量：从One-hot编码到Word2Vec的深度理解与实战-原创手记-慕课网

1. 词向量的引入与概念

在自然语言处理（NLP）领域，如何将文字信息转化为计算机能够理解的数学表示是关键问题之一。传统上，文本中的词通常通过一维的“one-hot”表示来编码，这在实践中存在重大局限性：

维度爆炸：随着词汇量的增加，向量的维度呈指数增长，导致数据存储和计算负担显著增加。
稀疏性：one-hot向量的大部分元素为0，不利于向量之间的有效比较和计算。
语义信息丢失：one-hot编码无法捕捉词与词之间的语义关系，缺少词汇之间的相似性或上下文信息。

随着分布式表示（distributional representation）概念的引入，词通过低维、稠密的向量表示，有效解决了上述问题，不仅降低了维度，还能捕捉词的语义特征。

2. 向量表示与问题分析

分布式表示的核心思想在于，词的意义可以从它在大量文本中的“分布”特征中获取。Word2Vec正是基于这一理论，通过CBOW与Skip-gram模型，在大规模文本中高效捕捉词的语义特征。

3. 分布式表示的探索

分布式表示通过将词表示低维稠密向量，实现从“one-hot”编码到“分布式”表示的转变。Word2Vec模型通过CBOW和Skip-gram两种方法实现这一转变，模拟人类语言理解方式，强调词的上下文关系。

4. Word2Vec模型详解

CBOW（Continuous Bag of Words）：根据上下文预测目标词，模拟了人类阅读时先阅读上下文再理解中心词的模式。
Skip-gram：从目标词出发预测上下文词，更侧重于词的局部上下文关系，强调词的使用频率和常见搭配。

5. 训练数据构建与优化策略

构建训练数据时，通常使用滑动窗口策略，抽取文本中的词作为输入输出对。优化策略包括选择适当的模型参数、窗口大小和负采样等，以提高学习效率和泛化能力。

6. 模型实战与代码实现

为了将理论付诸实践，以下是使用Python、TensorFlow实现Word2Vec模型的简化步骤：

初始化：

import numpy as np

class Word2Vec:
    def __init__(self, vocabulary, window_size):
        self.vocabulary = vocabulary
        self.window_size = window_size
        self.word_vectors = {}
        self.build_word_vectors()

构建词向量矩阵：

def build_word_vectors(self):
    word_to_index = {word: idx for idx, word in enumerate(self.vocabulary)}
    vocab_size = len(vocabulary)

    self.word_vectors = np.random.randn(vocab_size, vocab_size) / np.sqrt(vocab_size)

训练模型：

def train(self, learning_rate=0.025, num_epochs=5, min_count=5):
    window_size = self.window_size
    vocab_size = len(self.vocabulary)
    word_to_index = {word: idx for idx, word in enumerate(self.vocabulary)}
    context_size = 2 * window_size + 1

    # 初始化输入矩阵和输出矩阵
    X = np.zeros((vocab_size, context_size, vocab_size))
    Y = np.zeros((vocab_size, context_size, vocab_size))

    for center_word in self.vocabulary:
        if center_word in word_to_index:
            index = word_to_index[center_word]

            for context_word in self.get_context_words(center_word):
                if context_word in word_to_index:
                    context_index = word_to_index[context_word]

                    for i in range(-window_size, window_size + 1):
                        if i == 0 or i == window_size:
                            continue

                        context_index_adj = window_size + i
                        if context_index_adj >= 0 and context_index_adj < context_size:
                            X[index, context_index_adj, :] = 1
                            Y[index, context_index_adj, :] = 1

            self.update_word_vectors(index, learning_rate)

    def update_word_vectors(self, center_index, learning_rate=0.025):
        X_center = X[center_index]
        Y_center = Y[center_index]

        for context_index, context_vector in enumerate(X_center):
            self.word_vectors[center_index] += learning_rate * np.outer(context_vector, Y_center[context_index])

获取上下文词：

def get_context_words(self, center_word):
    context_words = []
    for i in range(-self.window_size, self.window_size + 1):
        if i == 0 or i == self.window_size:  # 跳过中心词自身
            continue
        context_words.append(self.vocabulary[(i + self.window_size + 1) % (2 * self.window_size + 1)])
    return context_words

通过上述代码，读者可以实现基本的Word2Vec模型训练流程，理解词向量的生成及应用。这一实现提供了从数据加载、预处理、模型训练到应用的完整框架，为读者提供了实践NLP任务的有效工具。