猿问

如何调整kMeans聚类灵敏度?

我有以下数据集:


        node        bc cluster

1    russian  0.457039       1

48       man  0.286875       1

155    woman  0.129939       0

3        bit  0.092721       0

5      write  0.065424       0

98       age  0.064347       0

97     escap  0.062675       0

74      game  0.062606       0

然后我按bc值执行 kMeans 聚类以将节点分成两个不同的组。现在使用下面的代码,我得到了上面的结果(聚类结果在cluster列中)。


    bc_df = pd.DataFrame({"node": bc_nodes, "bc": bc_values})

    bc_df = bc_df.sort_values("bc", ascending=False)

    km = KMeans(n_clusters=2).fit(bc_df[['bc']])

    bc_df.loc[:,'cluster'] = km.labels_

    print(bc_df.head(8))

这很好,但我希望它的工作方式略有不同,并选择前 4 个节点进入第一个集群,然后选择第二个集群中的其他节点,因为它们彼此更相似。


我可以对 kMeans 做一些调整,或者你知道另一种算法sklearn可以做到这一点吗?


开满天机
浏览 1010回答 3
3回答

HUH函数

看起来您想要的是一维数据的聚类。解决这个问题的一种方法是使用 Jenks Natural Breaks(谷歌它以获得它的解释)。我没有写这个函数(很多功劳归功于@Frank在这里的解决方案)鉴于您的数据框:import pandas as pddf = pd.DataFrame([['russian',&nbsp; 0.457039],['man',&nbsp; 0.286875],['woman',&nbsp; 0.129939],['bit',&nbsp; 0.092721],['write',&nbsp; 0.065424],['age',&nbsp; 0.064347],['escap',&nbsp; 0.062675],['game',&nbsp; 0.062606]], columns = ['node','bc'])使用 Jenks 自然中断函数的代码:def get_jenks_breaks(data_list, number_class):&nbsp; &nbsp; data_list.sort()&nbsp; &nbsp; mat1 = []&nbsp; &nbsp; for i in range(len(data_list) + 1):&nbsp; &nbsp; &nbsp; &nbsp; temp = []&nbsp; &nbsp; &nbsp; &nbsp; for j in range(number_class + 1):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; temp.append(0)&nbsp; &nbsp; &nbsp; &nbsp; mat1.append(temp)&nbsp; &nbsp; mat2 = []&nbsp; &nbsp; for i in range(len(data_list) + 1):&nbsp; &nbsp; &nbsp; &nbsp; temp = []&nbsp; &nbsp; &nbsp; &nbsp; for j in range(number_class + 1):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; temp.append(0)&nbsp; &nbsp; &nbsp; &nbsp; mat2.append(temp)&nbsp; &nbsp; for i in range(1, number_class + 1):&nbsp; &nbsp; &nbsp; &nbsp; mat1[1][i] = 1&nbsp; &nbsp; &nbsp; &nbsp; mat2[1][i] = 0&nbsp; &nbsp; &nbsp; &nbsp; for j in range(2, len(data_list) + 1):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mat2[j][i] = float('inf')&nbsp; &nbsp; v = 0.0&nbsp; &nbsp; for l in range(2, len(data_list) + 1):&nbsp; &nbsp; &nbsp; &nbsp; s1 = 0.0&nbsp; &nbsp; &nbsp; &nbsp; s2 = 0.0&nbsp; &nbsp; &nbsp; &nbsp; w = 0.0&nbsp; &nbsp; &nbsp; &nbsp; for m in range(1, l + 1):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i3 = l - m + 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; val = float(data_list[i3 - 1])&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s2 += val * val&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s1 += val&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; w += 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; v = s2 - (s1 * s1) / w&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; i4 = i3 - 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if i4 != 0:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for j in range(2, number_class + 1):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if mat2[l][j] >= (v + mat2[i4][j - 1]):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mat1[l][j] = i3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mat2[l][j] = v + mat2[i4][j - 1]&nbsp; &nbsp; &nbsp; &nbsp; mat1[l][1] = 1&nbsp; &nbsp; &nbsp; &nbsp; mat2[l][1] = v&nbsp; &nbsp; k = len(data_list)&nbsp; &nbsp; kclass = []&nbsp; &nbsp; for i in range(number_class + 1):&nbsp; &nbsp; &nbsp; &nbsp; kclass.append(min(data_list))&nbsp; &nbsp; kclass[number_class] = float(data_list[len(data_list) - 1])&nbsp; &nbsp; count_num = number_class&nbsp; &nbsp; while count_num >= 2:&nbsp; # print "rank = " + str(mat1[k][count_num])&nbsp; &nbsp; &nbsp; &nbsp; idx = int((mat1[k][count_num]) - 2)&nbsp; &nbsp; &nbsp; &nbsp; # print "val = " + str(data_list[idx])&nbsp; &nbsp; &nbsp; &nbsp; kclass[count_num - 1] = data_list[idx]&nbsp; &nbsp; &nbsp; &nbsp; k = int((mat1[k][count_num] - 1))&nbsp; &nbsp; &nbsp; &nbsp; count_num -= 1&nbsp; &nbsp; return kclass# Get values to find the natural breaks&nbsp; &nbsp;&nbsp;x = list(df['bc'])# Calculate the break values.&nbsp;# I want 2 groups, so parameter is 2.# If you print (get_jenks_breaks(x, 2)), it will give you 3 values: [min, break1, max]# Obviously if you want more groups, you'll need to adjust this and also adjust the assign_cluster function below.breaking_point = get_jenks_breaks(x, 2)[1]# Creating group for the bc columndef assign_cluster(bc):&nbsp; &nbsp; if bc < breaking_point:&nbsp; &nbsp; &nbsp; &nbsp; return 0&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; return 1# Apply `assign_cluster` to `df['bc']`&nbsp; &nbsp;&nbsp;df['cluster'] = df['bc'].apply(assign_cluster)输出:print (df)&nbsp; &nbsp; &nbsp; node&nbsp; &nbsp; &nbsp; &nbsp; bc&nbsp; cluster0&nbsp; russian&nbsp; 0.457039&nbsp; &nbsp; &nbsp; &nbsp; 11&nbsp; &nbsp; &nbsp; man&nbsp; 0.286875&nbsp; &nbsp; &nbsp; &nbsp; 12&nbsp; &nbsp; woman&nbsp; 0.129939&nbsp; &nbsp; &nbsp; &nbsp; 13&nbsp; &nbsp; &nbsp; bit&nbsp; 0.092721&nbsp; &nbsp; &nbsp; &nbsp; 04&nbsp; &nbsp; write&nbsp; 0.065424&nbsp; &nbsp; &nbsp; &nbsp; 05&nbsp; &nbsp; &nbsp; age&nbsp; 0.064347&nbsp; &nbsp; &nbsp; &nbsp; 06&nbsp; &nbsp; escap&nbsp; 0.062675&nbsp; &nbsp; &nbsp; &nbsp; 07&nbsp; &nbsp; &nbsp;game&nbsp; 0.062606&nbsp; &nbsp; &nbsp; &nbsp; 0

MM们

前两个值总是在另一个类中而不是从索引 3 开始的那些,因为它们低于 ~0.152703 的平均值。由于您的问题也可以解释为一个简单的二类问题,您还可以使用 ~0.0790725 的中位数将这两类分开:idx = df['bc'] > df['bc'].median()现在您可以使用此索引来选择由中位数分隔的两个类:df[idx]给&nbsp; &nbsp; &nbsp; &nbsp; node&nbsp; &nbsp; &nbsp; &nbsp; bc&nbsp; cluster&nbsp; 1&nbsp; russian&nbsp; 0.457039&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp;48&nbsp; &nbsp; &nbsp; man&nbsp; 0.286875&nbsp; &nbsp; &nbsp; &nbsp; 1155&nbsp; &nbsp; woman&nbsp; 0.129939&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; 3&nbsp; &nbsp; &nbsp; bit&nbsp; 0.092721&nbsp; &nbsp; &nbsp; &nbsp; 0和df[~idx]给&nbsp; &nbsp; &nbsp;node&nbsp; &nbsp; &nbsp; &nbsp; bc&nbsp; cluster&nbsp;5&nbsp; write&nbsp; 0.065424&nbsp; &nbsp; &nbsp; &nbsp; 098&nbsp; &nbsp; age&nbsp; 0.064347&nbsp; &nbsp; &nbsp; &nbsp; 097&nbsp; escap&nbsp; 0.062675&nbsp; &nbsp; &nbsp; &nbsp; 074&nbsp; &nbsp;game&nbsp; 0.062606&nbsp; &nbsp; &nbsp; &nbsp; 0

波斯汪

只需自己选择阈值。在你得到想要的结果之前,对算法进行 hack 是不合适的。如果您希望前五个术语成为一个集群,则只需根据需要标记它们。不要假装这是一个聚类结果。
随时随地看视频慕课网APP

相关分类

Python
我要回答