使用 pandas/python 基于优先级的分类

首页课程实战体系课手记专栏慕课教程

使用 pandas/python 基于优先级的分类

我在下面的数据框和代码列表中包含发票相关数据

df = pd.DataFrame({

'invoice':[1,1,2,2,2,3,3,3,4,4,4,5,5,6,6,6,7],

'code':[101,104,105,101,106,106,104,101,104,105,111,109,111,110,101,114,112],

'qty':[2,1,1,3,2,4,7,1,1,1,1,4,2,1,2,2,1]

})

+---------+------+-----+

| invoice | code | qty |

+---------+------+-----+

| 1 | 101 | 2 |

+---------+------+-----+

| 1 | 104 | 1 |

+---------+------+-----+

| 2 | 105 | 1 |

+---------+------+-----+

| 2 | 101 | 3 |

+---------+------+-----+

| 2 | 106 | 2 |

+---------+------+-----+

| 3 | 106 | 4 |

+---------+------+-----+

| 3 | 104 | 7 |

+---------+------+-----+

| 3 | 101 | 1 |

+---------+------+-----+

| 4 | 104 | 1 |

+---------+------+-----+

| 4 | 105 | 1 |

+---------+------+-----+

| 4 | 111 | 1 |

+---------+------+-----+

| 5 | 109 | 4 |

+---------+------+-----+

| 5 | 111 | 2 |

+---------+------+-----+

| 6 | 110 | 1 |

+---------+------+-----+

| 6 | 101 | 2 |

+---------+------+-----+

| 6 | 114 | 2 |

+---------+------+-----+

| 7 | 104 | 2 |

+---------+------+-----+

代码列表是，

Soda = [101,102]

Hot = [103,109]

Juice = [104,105]

Milk = [106,107,108]

Dessert = [110,111]

category我的任务是根据下面指定的添加一个新列Order of Priority。

优先级第一：如果任何发票的数量超过 10 个，则应分类为Mega。例如：qty总和invoice 3 is 12
优先事项 2：来自rest of the invoice. 如果列表中有任何code一个，则类别应该是。例如：在是在中。因此，完整发票被分类为。无论发票中是否存在其他项目 ( )。由于优先级适用于发票。invoiceMilkHealthyinvoice 2 code 106MilkHealthycode 101 & 105full
优先级No.3：从中rest of the invoice，如果其中任何一个code在invoice列表中Juice，那么这有2 parts

(3.1) 如果该果汁数量的总和为equal to 1，则类别应为OneJuice。例如：invoice 1具有code 104和qty 1.thisinvoice 1将得到，OneJuice无论code 101发票中是否存在其他项目 ( )。由于优先级适用于full发票。

(3.2) 如果该果汁数量的总和为greater than 1，则类别应为ManyJuice。例如：invoice 4有code 104 & 105 和qty 1 + 1 = 2。

优先级4：从中rest of the invoice，如果任何code发票在Hot列表中，则应将其分类为HotLovers。无论发票中是否包含其他项目。
优先级No.5：从中rest of the invoice，如果任何code发票在Dessert列表中，则应将其分类为DessertLovers。
最后，其余所有发票应归类为Others。

翻翻过去那场雪

浏览 126回答 1

1回答

天涯尽头无女友

您可以尝试使用np.selectdf['category'] = np.select([ df.groupby('invoice')['qty'].transform('sum') >= 10, df['code'].isin(Milk).groupby(df.invoice).transform('any'), (df['qty']*df['code'].isin(Juice)).groupby(df.invoice).transform('sum') == 1, (df['qty']*df['code'].isin(Juice)).groupby(df.invoice).transform('sum') > 1, df['code'].isin(Hot).groupby(df.invoice).transform('any'), df['code'].isin(Dessert).groupby(df.invoice).transform('any')], ['Mega','Healthy','OneJuice','ManyJuice','HotLovers','DessertLovers'], 'Other')print(df)输出 invoice code qty category0 1 101 2 OneJuice1 1 104 1 OneJuice2 2 105 1 Healthy3 2 101 3 Healthy4 2 106 2 Healthy5 3 106 4 Mega6 3 104 7 Mega7 3 101 1 Mega8 4 104 1 ManyJuice9 4 105 1 ManyJuice10 4 111 1 ManyJuice11 5 109 4 HotLovers12 5 111 2 HotLovers13 6 110 1 DessertLovers14 6 101 2 DessertLovers15 6 114 2 DessertLovers16 7 104 2 ManyJuice微基准测试pd.show_versions()commit : Nonepython : 3.7.5.final.0python-bits : 64OS : LinuxOS-release : 4.4.0-18362-Microsoftmachine : x86_64processor : x86_64byteorder : littleLC_ALL : NoneLANG : C.UTF-8LOCALE : en_US.UTF-8pandas : 0.25.3numpy : 1.17.4数据创建于def make_data(n): return pd.DataFrame({ 'invoice':np.arange(n)//3, 'code':np.random.choice(np.arange(101,112),n), 'qty':np.random.choice(np.arange(1,8), n, p=[10/25,10/25,1/25,1/25,1/25,1/25,1/25])})结果perfplot.show( setup=make_data, kernels=[get_category, get_with_np_select], n_range=[2**k for k in range(8, 20)], logx=True, logy=True, equality_check=False, xlabel='len(df)')

0 0

随时随地看视频慕课网APP

相关分类

Python