HUH函数
看起来您想要的是一维数据的聚类。解决这个问题的一种方法是使用 Jenks Natural Breaks(谷歌它以获得它的解释)。我没有写这个函数(很多功劳归功于@Frank在这里的解决方案)鉴于您的数据框:import pandas as pddf = pd.DataFrame([['russian', 0.457039],['man', 0.286875],['woman', 0.129939],['bit', 0.092721],['write', 0.065424],['age', 0.064347],['escap', 0.062675],['game', 0.062606]], columns = ['node','bc'])使用 Jenks 自然中断函数的代码:def get_jenks_breaks(data_list, number_class): data_list.sort() mat1 = [] for i in range(len(data_list) + 1): temp = [] for j in range(number_class + 1): temp.append(0) mat1.append(temp) mat2 = [] for i in range(len(data_list) + 1): temp = [] for j in range(number_class + 1): temp.append(0) mat2.append(temp) for i in range(1, number_class + 1): mat1[1][i] = 1 mat2[1][i] = 0 for j in range(2, len(data_list) + 1): mat2[j][i] = float('inf') v = 0.0 for l in range(2, len(data_list) + 1): s1 = 0.0 s2 = 0.0 w = 0.0 for m in range(1, l + 1): i3 = l - m + 1 val = float(data_list[i3 - 1]) s2 += val * val s1 += val w += 1 v = s2 - (s1 * s1) / w i4 = i3 - 1 if i4 != 0: for j in range(2, number_class + 1): if mat2[l][j] >= (v + mat2[i4][j - 1]): mat1[l][j] = i3 mat2[l][j] = v + mat2[i4][j - 1] mat1[l][1] = 1 mat2[l][1] = v k = len(data_list) kclass = [] for i in range(number_class + 1): kclass.append(min(data_list)) kclass[number_class] = float(data_list[len(data_list) - 1]) count_num = number_class while count_num >= 2: # print "rank = " + str(mat1[k][count_num]) idx = int((mat1[k][count_num]) - 2) # print "val = " + str(data_list[idx]) kclass[count_num - 1] = data_list[idx] k = int((mat1[k][count_num] - 1)) count_num -= 1 return kclass# Get values to find the natural breaks x = list(df['bc'])# Calculate the break values. # I want 2 groups, so parameter is 2.# If you print (get_jenks_breaks(x, 2)), it will give you 3 values: [min, break1, max]# Obviously if you want more groups, you'll need to adjust this and also adjust the assign_cluster function below.breaking_point = get_jenks_breaks(x, 2)[1]# Creating group for the bc columndef assign_cluster(bc): if bc < breaking_point: return 0 else: return 1# Apply `assign_cluster` to `df['bc']` df['cluster'] = df['bc'].apply(assign_cluster)输出:print (df) node bc cluster0 russian 0.457039 11 man 0.286875 12 woman 0.129939 13 bit 0.092721 04 write 0.065424 05 age 0.064347 06 escap 0.062675 07 game 0.062606 0