python实现skip-gram(跳词)示例
文章目录示例什么是跳词?一句话就是用中心词去预测它周围的词。它是 Word2Vec 里最常用的一种训练方式。示例1、安装依赖pip install matplotlib# 其他torch等依赖早就安装了2、创建python文件skip_gram_demo.py代码importtorchimporttorch.nnasnnimporttorch.optimasoptimimportmatplotlib.pyplotaspltfromcollectionsimportCounter# # 1. 数据准备与预处理# # 一个简单的微型语料库corpus deep learning is powerful machine learning is a subset of artificial intelligence deep learning models are inspired by the brain natural language processing uses deep learning # 文本清洗与分词wordscorpus.lower().split()# 构建词汇表 (Word - Index)vocablist(set(words))word_to_idx{w:ifori,winenumerate(vocab)}idx_to_word{i:wfori,winenumerate(vocab)}vocab_sizelen(vocab)print(f词汇表大小:{vocab_size})print(f词汇表:{vocab})# 生成训练数据 (Skip-gram: 输入中心词 - 输出上下文词)defcreate_dataloader(words,word_to_idx,window_size2):inputs[]targets[]foriinrange(1,len(words)-1):center_wordwords[i]center_idxword_to_idx[center_word]# 获取上下文窗口# 比如 window_size2则取前后各2个词forjinrange(i-window_size,iwindow_size1):ifj!iand0jlen(words):context_wordwords[j]context_idxword_to_idx[context_word]inputs.append(center_idx)targets.append(context_idx)returntorch.tensor(inputs,dtypetorch.long),torch.tensor(targets,dtypetorch.long)inputs,targetscreate_dataloader(words,word_to_idx,window_size2)# # 2. 定义 Skip-gram 模型# classSkipGramModel(nn.Module):def__init__(self,vocab_size,embedding_dim):super(SkipGramModel,self).__init__()# 中心词嵌入层 (W)self.w_innn.Embedding(vocab_size,embedding_dim)# 上下文词嵌入层 (W)self.w_outnn.Embedding(vocab_size,embedding_dim)# 初始化权重nn.init.xavier_uniform_(self.w_in.weight)nn.init.xavier_uniform_(self.w_out.weight)defforward(self,x):# x: (batch_size,)# 获取中心词的向量embedsself.w_in(x)# (batch_size, embedding_dim)returnembedsdefloss(self,x,y):# x: 中心词索引, y: 上下文词索引# 1. 获取中心词向量v_centerself.w_in(x)# (batch_size, dim)# 2. 获取上下文词向量v_contextself.w_out(y)# (batch_size, dim)# 3. 计算点积 (相似度)# 这里的逻辑是点积越大概率越大scoretorch.sum(torch.mul(v_center,v_context),dim1)# (batch_size,)# 4. 使用负对数似然损失 (简化版未包含负采样)# 实际大规模训练中通常配合 Negative Sampling 使用# 这里为了演示简单直接最大化目标词的概率loss-torch.mean(score)returnloss# # 3. 训练模型# embedding_dim10# 词向量维度learning_rate0.01epochs1000modelSkipGramModel(vocab_size,embedding_dim)optimizeroptim.SGD(model.parameters(),lrlearning_rate)print(\n开始训练...)forepochinrange(epochs):optimizer.zero_grad()# 前向传播lossmodel.loss(inputs,targets)# 反向传播loss.backward()optimizer.step()if(epoch1)%2000:print(fEpoch{epoch1}, Loss:{loss.item():.4f})# # 4. 结果可视化与测试# print(\n训练完成查看词向量相似度...)# 获取嵌入权重embeddingsmodel.w_in.weight.data.numpy()# 简单的余弦相似度计算defcosine_similarity(w1,w2):returnnp.dot(w1,w2)/(np.linalg.norm(w1)*np.linalg.norm(w2))# 测试几个词test_words[learning,deep,artificial,brain]importnumpyasnpforw1intest_words:ifw1inword_to_idx:vec1embeddings[word_to_idx[w1]]print(f\n与 {w1} 最相似的词:)similarities[]forw2invocab:ifw1!w2:vec2embeddings[word_to_idx[w2]]simcosine_similarity(vec1,vec2)similarities.append((w2,sim))# 排序并打印前3个similarities.sort(keylambdax:x[1],reverseTrue)forword,scoreinsimilarities[:3]:print(f{word}:{score:.4f})# 2D 可视化 (PCA 降维)fromsklearn.decompositionimportPCA pcaPCA(n_components2)reduced_embedspca.fit_transform(embeddings)plt.figure(figsize(10,8))fori,wordinenumerate(vocab):plt.scatter(reduced_embeds[i,0],reduced_embeds[i,1])plt.annotate(word,(reduced_embeds[i,0],reduced_embeds[i,1]))plt.title(Word Embeddings Visualization (PCA))plt.xlabel(PC1)plt.ylabel(PC2)plt.grid(True)plt.show()输出结果词汇表大小:20词汇表:[artificial,inspired,brain,natural,is,are,learning,by,machine,powerful,processing,language,a,intelligence,uses,subset,deep,models,the,of]开始训练...Epoch200,Loss:-0.0312Epoch400,Loss:-0.0661Epoch600,Loss:-0.1041Epoch800,Loss:-0.1467Epoch1000,Loss:-0.1957训练完成查看词向量相似度...与learning最相似的词:inspired:0.6657are:0.4793is:0.4745与deep最相似的词:machine:0.6026intelligence:0.5229processing:0.4629与artificial最相似的词:is:0.5218by:0.5195the:0.5013与brain最相似的词:subset:0.2076powerful:0.1457language:0.0755解读给了一堆杂乱的文字它居然将这些词分出了远近关系。成功了。
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2476463.html
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!