Python 中的线性/保序聚类

如前所述，我认为获得所需结果的直接（ish）方法是仅使用正常的 K 均值聚类，然后根据需要修改生成的输出。解释：这个想法是得到 K-means 输出，然后遍历它们：跟踪前一项的集群组和当前的集群组，并控制根据条件创建的新集群。代码中的解释。import numpy as npfrom sklearn.cluster import KMeanslst = [10, 11.1, 30.4, 30.0, 32.9, 4.5, 7.2]km = KMeans(3,).fit(np.array(lst).reshape(-1,1))print(km.labels_)# [0 0 1 1 1 2 2]: OK outputlst = [10, 11.1, 30.4, 30.0, 32.9, 6.2, 31.2, 29.8, 12.3, 10.5]km = KMeans(3,).fit(np.array(lst).reshape(-1,1))print(km.labels_)# [0 0 1 1 1 2 1 1 0 0]. Desired output: [0 0 1 1 1 1 1 1 2 2]def linear_order_clustering(km_labels, outlier_tolerance = 1):    '''Expects clustering outputs as an array/list'''    prev_label = km_labels[0] #keeps track of last seen item's real cluster    cluster = 0 #like a counter for our new linear clustering outputs    result = [cluster] #initialize first entry    for i, label in enumerate(km_labels[1:]):        if prev_label == label:             #just written for clarity of control flow,             #do nothing special here            pass         else: #current cluster label did not match previous label            #check if previous cluster label reappears             #on the right of current cluster label position             #(aka current non-matching cluster is sandwiched             #within a reasonable tolerance)            if (outlier_tolerance and                 prev_label in km_labels[i + 1: i + 2 + outlier_tolerance]):                     label = prev_label #if so, overwrite current label            else:                cluster += 1 #its genuinely a new cluster        result.append(cluster)        prev_label = label    return result请注意，我仅对 1 个异常值的容差进行了测试，并且不能保证它在所有情况下都能按原样运行。然而，这应该让你开始。输出：print(km.labels_)result = linear_order_clustering(km.labels_)print(result)[1 1 0 0 0 2 0 0 1 1][0, 0, 1, 1, 1, 1, 1, 1, 2, 2]

Python 中的线性/保序聚类

3回答