FInding K 均值距离

首页课程实战体系课手记专栏慕课教程

FInding K 均值距离

我有一个数据库，它有13个特征和1000万行。我想应用 k-mean 来消除任何异常。我的方法是应用k-mean，创建一个数据点和聚类质心之间距离的新列，以及一个平均距离的新列，如果距离大于平均距离，我将删除整行。但似乎我写的代码不起作用。

数据集示例：https://drive.google.com/open?id=1iB1qjnWQyvoKuN_Pa8Xk4BySzXVTwtUk

df = pd.read_csv('Final After Simple Filtering.csv',index_col=None,low_memory=True)

# Dropping columns with low feature importance

del df['AmbTemp_DegC']

del df['NacelleOrientation_Deg']

del df['MeasuredYawError']

#applying kmeans

#applying kmeans

kmeans = KMeans( n_clusters=8)

clusters= kmeans.fit_predict(df)

centroids = kmeans.cluster_centers_

distance1 = kmeans.fit_transform(df)

distance2 = distance1.mean()

df['distances']=distance1-distance2

df = df[df['distances'] >=0]

del df['distances']

df.to_csv('/content//drive/My Drive/K TEST.csv', index=False)

错误：

KeyError Traceback (most recent call last)

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)

2896 try:

-> 2897 return self._engine.get_loc(key)

2898 except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'distances'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)

9 frames

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'distances'

During handling of the above exception, another exception occurred:

青春有我

浏览 168回答 2

2回答

HUH函数

以下是您最后一个问题的后续答案。import seaborn as snsimport pandas as pdtitanic = sns.load_dataset('titanic')titanic = titanic.copy()titanic = titanic.dropna()titanic['age'].plot.hist(  bins = 50,  title = "Histogram of the age variable")from scipy.stats import zscoretitanic["age_zscore"] = zscore(titanic["age"])titanic["is_outlier"] = titanic["age_zscore"].apply(  lambda x: x <= -2.5 or x >= 2.5)titanic[titanic["is_outlier"]]ageAndFare = titanic[["age", "fare"]]ageAndFare.plot.scatter(x = "age", y = "fare")from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()ageAndFare = scaler.fit_transform(ageAndFare)ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])ageAndFare.plot.scatter(x = "age", y = "fare")from sklearn.cluster import DBSCANoutlier_detection = DBSCAN(  eps = 0.5,  metric="euclidean",  min_samples = 3,  n_jobs = -1)clusters = outlier_detection.fit_predict(ageAndFare)clustersfrom matplotlib import cmcmap = cm.get_cmap('Accent')ageAndFare.plot.scatter(  x = "age",  y = "fare",  c = clusters,  cmap = cmap,  colorbar = False)有关所有详细信息，请参阅此链接。https://www.mikulskibartosz.name/outlier-detection-with-scikit-learn/在今天之前，我从未听说过“局部异常值因素”。当我用谷歌搜索它时，我得到了一些信息，似乎表明它是DBSCAN的衍生物。最后，我认为我的第一个答案实际上是检测异常值的最佳方法。DBSCAN正在聚类算法，碰巧找到异常值，这些异常值实际上被认为是“噪声”。我不认为DBSCAN的主要目的不是异常检测，而是集群。总之，正确选择超参数需要一些技巧。此外，DBSCAN在非常大的数据集上可能很慢，因为它隐式地需要计算每个采样点的经验密度，从而导致二次最坏情况的时间复杂度，这在大型数据集上非常慢。

0 0

慕虎7371278

您：我想应用 k 均值来消除任何异常。实际上，KMeas 将检测异常并将其包含在最近的聚类中。损失函数是从每个点到其分配的聚类质心的最小距离平方和。如果要剔除异常值，请考虑使用 z 得分方法。import numpy as npimport pandas as pd# import your datadf = pd.read_csv('C:\\Users\\your_file.csv)# get only numericsnumerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']newdf = df.select_dtypes(include=numerics)df = newdf# count rows in DF before kicking out records with z-score over 3df.shape# handle NANsdf = df.fillna(0)from scipy import statsdf = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]df.shapedf = pd.DataFrame(np.random.randn(100, 3))from scipy import statsdf[(np.abs(stats.zscore(df)) < 3).all(axis=1)]# count rows in DF before kicking out records with z-score over 3df.shape此外，当您有空闲时间时，请查看这些链接。https://medium.com/analytics-vidhya/effect-of-outliers-on-k-means-algorithm-using-python-7ba85821ea23https://statisticsbyjim.com/basics/outliers/

0 0

随时随地看视频慕课网APP

相关分类

Python