我尝试像下面的代码一样解决数据,但是我还没有使用 groupy 和 udf 弄清楚它,并且还发现 udf 无法返回数据帧。
有什么办法可以通过spark或其他一些方法来实现这一点,可以处理不平衡的数据
ratio = 3
def balance_classes(grp):
picked = grp.loc[grp.editorsSelection == True]
n = round(picked.shape[0]*ratio)
if n:
try:
not_picked = grp.loc[grp.editorsSelection == False].sample(n)
except: # In case, fewer than n comments with `editorsSelection == False`
not_picked = grp.loc[grp.editorsSelection == False]
balanced_grp = pd.concat([picked, not_picked])
return balanced_grp
else: # If no editor's pick for an article, dicard all comments from that article
return None
comments = comments.groupby('articleID').apply(balance_classes).reset_index(drop=True)
德玛西亚99
相关分类