是wup_similarity的,内部使用单个标记的同义词集来计算相似度Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).因为cricket和的祖先节点football是相同的。wup_similarity将返回1。如果你想解决这个问题,使用wup_similarity不是一个好的选择。最简单的基于令牌的方法是拟合 avectorizer然后计算相似度。例如。from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.metrics.pairwise import cosine_similaritycorpus = ["football is good,cricket is bad", "cricket is good,football is bad"]vectorizer = CountVectorizer(ngram_range=(1, 3))vectorizer.fit(corpus)x1 = vectorizer.transform(["football is good,cricket is bad"])x2 = vectorizer.transform(["cricket is good,football is bad"])cosine_similarity(x1, x2)不过,还有更智能的方法可以测量语义相似度。其中一个可以轻松试用的是 Google 的 USE Encoder。看到这个链接