python 中的无限 while 循环，用 pandas 计算标准差

我们正在尝试删除异常值，但出现了无限循环

对于一个学校项目，我们（我和一个朋友）认为创建一个基于数据科学的工具是个好主意。为此，我们开始清理数据库（我不会在这里导入它，因为它太大（xlsx 文件、csv 文件））。我们现在尝试使用“duration_分钟”列的“标准差*3 + 平均值”规则删除异常值。

这是我们用来计算标准差和平均值的代码：

def calculateSD(database, column):

column = database[[column]]

SD = column.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None)

return SD

def calculateMean(database, column):

column = database[[column]]

mean = column.mean()

return mean

我们认为要做到以下几点：

#Now we have to remove the outliers using the code from the SD.py and SDfunction.py files

minutes = trainsData['duration_minutes'].tolist() #takes the column duration_minutes and puts it in a list

SD = int(calculateSD(trainsData, 'duration_minutes')) #calculates the SD of the column

mean = int(calculateMean(trainsData, 'duration_minutes'))

SDhigh = mean+3*SD

上面的代码计算起始值。然后我们启动一个 while 循环来删除异常值。删除异常值后，我们重新计算标准差、均值和 SDhigh。这是 while 循环：

while np.any(i >= SDhigh for i in minutes): #used to be >=, it doesnt matter for the outcome

trainsData = trainsData[trainsData['duration_minutes'] < SDhigh] #used to be >=, this caused an infinite loop so I changed it to <=. Then to <

minutes = trainsData['duration_minutes'].tolist()

SD = int(calculateSD(trainsData, 'duration_minutes')) #calculates the SD of the column

mean = int(calculateMean(trainsData, 'duration_minutes'))

SDhigh = mean+3*SD

print(SDhigh) #to see how the values changed and to confirm it is an infinite loop

输出如下：

611

652

428

354

322

308

300

296

它继续打印 296，经过几个小时的尝试解决这个问题，我们得出的结论是我们没有我们希望的那么聪明。

繁华开满天机

浏览 183回答 1

1回答

呼啦一阵风

你让事情变得比原本应该的更加困难。计算标准差以消除异常值，然后重新计算等等过于复杂（并且统计上不合理）。使用百分位数而不是标准差会更好import numpy as npimport pandas as pd# create datanums = np.random.normal(50, 8, 200)df = pd.DataFrame(nums, columns=['duration'])# set threshold based on percentilesthreshold = df['duration'].quantile(.95) * 2# now only keep rows that are below the thresholddf = df[df['duration']<threshold]

0 0

随时随地看视频慕课网APP