数据和特征的质量决定了机器学习的上限,而模型和算法只是不断逼近这个上限而已
特征工程的四个步骤
数据清洗
数据样本抽样
异常值(空值处理)
数据样本采集(抽样)
样本要具备代表性
样本比例平衡以及样本不平衡时如何处理
考虑全量数据(Hadoop,Spark等大数据工具)
异常值(空值处理)
识别异常值和重复值
Pandas:isnull()/duplicated()直接丢弃(包括重复数据)
Pandas:drop()/dropna()/drop_duplicated()当是否有异常当作一个新的属性,替代原值(或用集中值/边界值进行指代)
Pandas:fillna()插值
Pandas:interpolate() --- Series
下面模拟了一些数据,来进行常见的异常值处理
import pandas as pd
df = pd.DataFrame({"A":["a0","a1","a1","a2","a3","a4"],"B":["b0","b1","b2","b2","b3",None], "C":[1,2,None,3,4,5],"D":[0.1,10.2,11.4,8.9,9.1,12],"E":[10,19,32,25,8,None], "F":["f0","f1","g2","f3","f4","f5"]})
df
df.isnull()
df.dropna()# 删除某一列里空值的行df.dropna(subset=["B"])
df.duplicated(["A"])# 填入两个属性的情况,必须两个组合值为重复的,才会判断为重复df.duplicated(["A","B"])
df.drop_duplicates(["A"],keep="first")
df.fillna(df["E"].mean())
df["E"].interpolate()
pd.Series([1,None,4,5,20]).interpolate()# 利用四分位数去除异常值upper_q = df["D"].quantile(0.75)
lower_q = df["D"].quantile(0.25)
q_int = upper_q - lower_q
k = 1.5df[df["D"]>lower_q-q_int*k][df["D"]<upper_q+q_int*k]
df[[True if item.startswith("f") else False for item in list(df["F"].values)]]特征预处理
特征选择
特征变换
对指化、离散化、数据平滑、归一化(标准化)、数值化、正规化特征降维
特征衍生
特征选择
数据规约的常用思路
import numpy as npimport pandas as pdimport scipy.stats as ss
df = pd.DataFrame({"A":ss.norm.rvs(size=10),"B":ss.norm.rvs(size=10),"C":ss.norm.rvs(size=10)
,"D":np.random.randint(low=0,high=2,size=10)})
dffrom sklearn.svm import SVRfrom sklearn.tree import DecisionTreeRegressor
X = df.loc[:,["A","B","C"]]
y = df.loc[:,"D"]from sklearn.feature_selection import SelectKBest,RFE,SelectFromModel
skb = SelectKBest(k=2)
skb.fit(X,y)
skb.transform(X)
rfe = RFE(estimator=SVR(kernel="linear"),n_features_to_select=2,step=1)
rfe.fit_transform(X,y)
sfm = SelectFromModel(estimator=DecisionTreeRegressor(),threshold=0.1)
sfm.fit_transform(X,y)特征变换
对指化
指数化
对数据取指数,扩大数据之间的差异(先取对数再归一化)原本数据差值是0.1,指数化后数据差异0.14
对数化
对数化,将大数据缩小到容易计算的范围
离散化
深度:数的个数
宽度:数的区间
分箱
lst = [6,8,10,15,16,24,25,40,67]# 等深分箱pd.qcut(lst,q=3) pd.qcut(lst,q=3,labels=["low","medium","high"])# 等宽分箱pd.cut(lst,bins=3) pd.cut(lst,bins=3,labels=["low","medium","high"])
归一化和标准化
from sklearn.preprocessing import MinMaxScaler,StandardScalerMinMaxScaler().fit_transform(np.array([1,4,10,20,30]).reshape(-1,1))StandardScaler().fit_transform(np.array([1,1,1,1,0,0,0,0]).reshape(-1,1))StandardScaler().fit_transform(np.array([1,0,0,0,0,0,0,0]).reshape(-1,1))
数值化
四种数据类型
对定序数据常用标签化的处理方式,一般只用保留数据的相对大小关系即可
对定类数据使用独热编码的思路处理,他们保留了数据间的不同且两两之间距离也相同(对一些数值差距对标注值影响不大的定序数据也可以采用这种方法)
from sklearn.preprocessing import LabelEncoder,OneHotEncoder # 这里high是0的原因是,这里变换过程实际上是对首字母进行的升序处理 LabelEncoder().fit_transform(np.array(["low","medium","high","low"])) LabelEncoder().fit_transform(np.array(["up","down","down"])) le = LabelEncoder() lb_tran_f = le.fit_transform(np.array(["Red","Yellow","Blue","Green"])) oht_encoder = OneHotEncoder().fit(lb_tran_f.reshape(-1,1)) oht_encoder.transform(LabelEncoder().fit_transform(np.array(["Red","Yellow","Blue","Green"])).reshape(-1,1)).toarray()
特征衍生
特征衍生是在现有特征的基础上衍生出其他合理有效的特征
对HR表的特征预处理
import pandas as pdfrom sklearn.preprocessing import MinMaxScaler,StandardScaler,OneHotEncoder,LabelEncoderfrom sklearn.decomposition import PCA
d = dict([("low",0),("medium",1),("high",2)])def map_salary(s):
return d.get(s,0)# sl:satisfacation_level --- False:MinMaxScaler;True:StandardScaler# le:last_evaluation --- False:MinMaxScaler;True:StandardScalerdef hr_preprocessing(sl=False,le=False,npr=False,amh=False,tsc=False,wa=False,pl5=False,dp=False,slr=False,lower_d=False,ld_n=1):
df = pd.read_csv("./data/HR.csv") # 清洗数据
df = df.dropna(subset=["satisfaction_level","last_evaluation"])
df = df[df["satisfaction_level"]<=1][df["salary"]!="nme"]
# 获得标记值
label = df["left"]
df = df.drop("left",axis=1)
# 特征选择
# 特征处理,主要是归一化、标准化、独热、标签化的处理
scaler_list = [sl,le,npr,amh,tsc,wa,pl5]
column_list = ["satisfaction_level","last_evaluation","number_project", "average_monthly_hours","time_spend_company","Work_accident", "promotion_last_5years"] for i in range(len(scaler_list)): if not scaler_list[i]:
df[column_list[i]] = MinMaxScaler().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
else:
df[column_list[i]] = StandardScaler().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
scaler_list = [slr,dp]
column_list = ["salary","department"] for i in range(len(scaler_list)): if not scaler_list[i]: if column_list[i] == "salary":
df[column_list[i]] = [map_salary(s) for s in df["salary"].values] else:
df[column_list[i]] = LabelEncoder().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0]
df[column_list[i]] = MinMaxScaler().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0] else: # 这里使用processing中的OneHotEncoder直接处理dataframe比较费力,所以用pandas带的getdummies来处理
df = pd.get_dummies(df,columns=[column_list[i]])
# 特征降维
if lower_d: return PCA(n_components=ld_n).fit_transform(df.values) return df,label
作者:IceSource
链接:https://www.jianshu.com/p/0824786a1019