我正在尝试训练一个模型,以根据航空公司、月份中的某天、目的地和出发地来预测出发延迟。我尝试了几种方法,但准确性非常低。 在这里输入图像描述 拳头我使用了直接从 -20 到 +20 分钟变化的延迟标签,我尝试通过设置间隔来使其更容易:对于 [0 5[ => 0 [5 10] => 1 ..etc 的延迟
但准确性仍然很差,我尝试了几种方法;
更改图层
不规范化特征移除和添加新特征
但我仍然找不到有效的东西
################### 加载数据集
df= dataset[['UniqueCarrier','DayofMonth','DepDelay','Dest','Origin']]
df.tail()
df = df.dropna()
df = df[(df['DepDelay'] <= 20) & (df['DepDelay'] <= 20)]
############### 掩码延迟值
ask = (df.DepDelay > 0) & (df.DepDelay < 5)
column_name = 'DepDelay'
df.loc[mask, column_name] = 0
mask = (df.DepDelay >= 5) & (df.DepDelay < 10)
column_name = 'DepDelay'
df.loc[mask, column_name] = 1
mask = (df.DepDelay >= 10) & (df.DepDelay < 15)
column_name = 'DepDelay'
df.loc[mask, column_name] = 2
mask = (df.DepDelay >= 15) & (df.DepDelay <= 20)
column_name = 'DepDelay'
df.loc[mask, column_name] = 3
mask = (df.DepDelay >= -5) & (df.DepDelay < 0)
column_name = 'DepDelay'
df.loc[mask, column_name] = -1
mask = (df.DepDelay >= -10) & (df.DepDelay < -5)
column_name = 'DepDelay'
df.loc[mask, column_name] = -2
mask = (df.DepDelay >= -15) & (df.DepDelay < -10)
column_name = 'DepDelay'
df.loc[mask, column_name] = -3
mask = (df.DepDelay >= -20) & (df.DepDelay < -15)
column_name = 'DepDelay'
df.loc[mask, column_name] = -4
############### 拆分标签和特征
y= df['DepDelay']
df.drop(columns = ['DepDelay'], inplace = True, axis = 1)
################ 替换字符值
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['Dest'] = le.fit_transform(df.Dest.values)
df['Origin'] = le.fit_transform(df.Origin.values)
df['UniqueCarrier'] = le.fit_transform(df.UniqueCarrier.values
######################### 标准化
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
# Normalize Training Data
std_scale = preprocessing.StandardScaler().fit(df)
df_norm = std_scale.transform(df)
training_norm_col1 = pd.DataFrame(df_norm, index=df.index,
columns=df.columns)
df.update(training_norm_col1)
print (df.head())
隔江千里
相关分类