数据和特征的质量决定了机器学习的上限,而模型和算法只是不断逼近这个上限而已
特征工程的四个步骤
数据清洗
数据样本抽样
异常值(空值处理)
数据样本采集(抽样)
样本要具备代表性
样本比例平衡以及样本不平衡时如何处理
考虑全量数据(Hadoop,Spark等大数据工具)
异常值(空值处理)
识别异常值和重复值
Pandas:isnull()/duplicated()直接丢弃(包括重复数据)
Pandas:drop()/dropna()/drop_duplicated()当是否有异常当作一个新的属性,替代原值(或用集中值/边界值进行指代)
Pandas:fillna()插值
Pandas:interpolate() --- Series
下面模拟了一些数据,来进行常见的异常值处理
import pandas as pd df = pd.DataFrame({"A":["a0","a1","a1","a2","a3","a4"],"B":["b0","b1","b2","b2","b3",None], "C":[1,2,None,3,4,5],"D":[0.1,10.2,11.4,8.9,9.1,12],"E":[10,19,32,25,8,None], "F":["f0","f1","g2","f3","f4","f5"]}) df df.isnull() df.dropna()# 删除某一列里空值的行df.dropna(subset=["B"]) df.duplicated(["A"])# 填入两个属性的情况,必须两个组合值为重复的,才会判断为重复df.duplicated(["A","B"]) df.drop_duplicates(["A"],keep="first") df.fillna(df["E"].mean()) df["E"].interpolate() pd.Series([1,None,4,5,20]).interpolate()# 利用四分位数去除异常值upper_q = df["D"].quantile(0.75) lower_q = df["D"].quantile(0.25) q_int = upper_q - lower_q k = 1.5df[df["D"]>lower_q-q_int*k][df["D"]<upper_q+q_int*k] df[[True if item.startswith("f") else False for item in list(df["F"].values)]]
特征预处理
特征选择
特征变换
对指化、离散化、数据平滑、归一化(标准化)、数值化、正规化特征降维
特征衍生
特征选择
数据规约的常用思路
import numpy as npimport pandas as pdimport scipy.stats as ss df = pd.DataFrame({"A":ss.norm.rvs(size=10),"B":ss.norm.rvs(size=10),"C":ss.norm.rvs(size=10) ,"D":np.random.randint(low=0,high=2,size=10)}) dffrom sklearn.svm import SVRfrom sklearn.tree import DecisionTreeRegressor X = df.loc[:,["A","B","C"]] y = df.loc[:,"D"]from sklearn.feature_selection import SelectKBest,RFE,SelectFromModel skb = SelectKBest(k=2) skb.fit(X,y) skb.transform(X) rfe = RFE(estimator=SVR(kernel="linear"),n_features_to_select=2,step=1) rfe.fit_transform(X,y) sfm = SelectFromModel(estimator=DecisionTreeRegressor(),threshold=0.1) sfm.fit_transform(X,y)
特征变换
对指化
指数化
对数据取指数,扩大数据之间的差异(先取对数再归一化)原本数据差值是0.1,指数化后数据差异0.14
对数化
对数化,将大数据缩小到容易计算的范围
离散化
深度:数的个数
宽度:数的区间
分箱
lst = [6,8,10,15,16,24,25,40,67]# 等深分箱pd.qcut(lst,q=3) pd.qcut(lst,q=3,labels=["low","medium","high"])# 等宽分箱pd.cut(lst,bins=3) pd.cut(lst,bins=3,labels=["low","medium","high"])
归一化和标准化
from sklearn.preprocessing import MinMaxScaler,StandardScalerMinMaxScaler().fit_transform(np.array([1,4,10,20,30]).reshape(-1,1))StandardScaler().fit_transform(np.array([1,1,1,1,0,0,0,0]).reshape(-1,1))StandardScaler().fit_transform(np.array([1,0,0,0,0,0,0,0]).reshape(-1,1))
数值化
四种数据类型
对定序数据常用标签化的处理方式,一般只用保留数据的相对大小关系即可
对定类数据使用独热编码的思路处理,他们保留了数据间的不同且两两之间距离也相同(对一些数值差距对标注值影响不大的定序数据也可以采用这种方法)
from sklearn.preprocessing import LabelEncoder,OneHotEncoder # 这里high是0的原因是,这里变换过程实际上是对首字母进行的升序处理 LabelEncoder().fit_transform(np.array(["low","medium","high","low"])) LabelEncoder().fit_transform(np.array(["up","down","down"])) le = LabelEncoder() lb_tran_f = le.fit_transform(np.array(["Red","Yellow","Blue","Green"])) oht_encoder = OneHotEncoder().fit(lb_tran_f.reshape(-1,1)) oht_encoder.transform(LabelEncoder().fit_transform(np.array(["Red","Yellow","Blue","Green"])).reshape(-1,1)).toarray()
特征衍生
特征衍生是在现有特征的基础上衍生出其他合理有效的特征
对HR表的特征预处理
import pandas as pdfrom sklearn.preprocessing import MinMaxScaler,StandardScaler,OneHotEncoder,LabelEncoderfrom sklearn.decomposition import PCA d = dict([("low",0),("medium",1),("high",2)])def map_salary(s): return d.get(s,0)# sl:satisfacation_level --- False:MinMaxScaler;True:StandardScaler# le:last_evaluation --- False:MinMaxScaler;True:StandardScalerdef hr_preprocessing(sl=False,le=False,npr=False,amh=False,tsc=False,wa=False,pl5=False,dp=False,slr=False,lower_d=False,ld_n=1): df = pd.read_csv("./data/HR.csv") # 清洗数据 df = df.dropna(subset=["satisfaction_level","last_evaluation"]) df = df[df["satisfaction_level"]<=1][df["salary"]!="nme"] # 获得标记值 label = df["left"] df = df.drop("left",axis=1) # 特征选择 # 特征处理,主要是归一化、标准化、独热、标签化的处理 scaler_list = [sl,le,npr,amh,tsc,wa,pl5] column_list = ["satisfaction_level","last_evaluation","number_project", "average_monthly_hours","time_spend_company","Work_accident", "promotion_last_5years"] for i in range(len(scaler_list)): if not scaler_list[i]: df[column_list[i]] = MinMaxScaler().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0] else: df[column_list[i]] = StandardScaler().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0] scaler_list = [slr,dp] column_list = ["salary","department"] for i in range(len(scaler_list)): if not scaler_list[i]: if column_list[i] == "salary": df[column_list[i]] = [map_salary(s) for s in df["salary"].values] else: df[column_list[i]] = LabelEncoder().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0] df[column_list[i]] = MinMaxScaler().fit_transform(df[column_list[i]].values.reshape(-1,1)).reshape(1,-1)[0] else: # 这里使用processing中的OneHotEncoder直接处理dataframe比较费力,所以用pandas带的getdummies来处理 df = pd.get_dummies(df,columns=[column_list[i]]) # 特征降维 if lower_d: return PCA(n_components=ld_n).fit_transform(df.values) return df,label
作者:IceSource
链接:https://www.jianshu.com/p/0824786a1019