我需要连接单词4Gand mobile phonesorInternet以便将有关技术的句子聚集在一起。我有以下句子:
4G is the fourth generation of broadband network.
4G is slow.
4G is defined as the fourth generation of mobile technology
I bought a new mobile phone.
我需要在同一簇中考虑上述句子。目前还没有,可能是因为它没有找到 4G 和移动之间的关系。我尝试使用firstwordnet.synsets来查找连接4G到互联网或手机的同义词,但不幸的是它没有找到任何连接。将我正在做的句子聚类如下:
rom sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy
texts = ["4G is the fourth generation of broadband network.",
"4G is slow.",
"4G is defined as the fourth generation of mobile technology",
"I bought a new mobile phone."]
# vectoization of the sentences
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(texts)
words = vectorizer.get_feature_names()
print("words", words)
n_clusters=3
number_of_seeds_to_try=10
max_iter = 300
number_of_process=2 # seads are distributed
model = KMeans(n_clusters=n_clusters, max_iter=max_iter, n_init=number_of_seeds_to_try, n_jobs=number_of_process).fit(X)
labels = model.labels_
# indices of preferible words in each cluster
ordered_words = model.cluster_centers_.argsort()[:, ::-1]
print("centers:", model.cluster_centers_)
print("labels", labels)
print("intertia:", model.inertia_)
texts_per_cluster = numpy.zeros(n_clusters)
for i_cluster in range(n_clusters):
for label in labels:
if label==i_cluster:
texts_per_cluster[i_cluster] +=1
print("Top words per cluster:")
for i_cluster in range(n_clusters):
print("Cluster:", i_cluster, "texts:", int(texts_per_cluster[i_cluster])),
for term in ordered_words[i_cluster, :10]:
print("\t"+words[term])
print("\n")
print("Prediction")
任何对此的帮助将不胜感激。
喵喵时光机
相关分类