我是熊猫的新手,我有一些数据的初始数据框。例如表 MхN 大小中从 0 到 999 的数字。
# initial dataframe with random numbers
np.random.seed(123)
M = 100
N = 1000
raw_df = pd.DataFrame(np.array([(np.random.choice([f'index_{i}' for i in range(1,5)]),
*[np.random.randint(1000) for i in range(M)]) for n in range(N)]),
columns=['index', *range(M)])
raw_df.set_index('index', inplace = True)
像这样:
index 0 1 2 3 4 ... 95 96 97 98 99
index_3 365 382 322 988 98 ... 980 824 305 780 530
index_2 513 51 940 885 745 ... 493 77 8 206 390
index_2 222 198 552 887 970 ... 791 731 695 290 293
index_2 855 853 665 401 186 ... 803 881 83 350 583
index_4 855 501 851 886 334 ... 771 735 233 219 247
我想像这样计算特定索引的每个值:
index 0 1 2 3 4 ... 995 996 997 998 999
index_1 19 19 29 30 19 ... 21 16 19 24 31
index_2 26 29 32 18 18 ... 22 26 38 38 19
index_3 24 23 32 36 22 ... 23 17 23 24 22
index_4 41 21 24 28 26 ... 26 30 33 33 37
我的代码在 12 秒内完成。有没有办法做得更快?例如两次
# create new df
df = pd.DataFrame(raw_df.index.unique(), columns=['index']).set_index('index')
df.sort_index(inplace=True)
# create new columns
unique_values = set()
for column in raw_df.columns:
unique_values.update(raw_df[column].unique())
df_rows = sorted(unique_values, key=lambda x: int(x))
# fill all dataframe by zeros
for row in df_rows:
df.loc[:,str(row)] = 0
# fill new dataframe
for column in raw_df.columns:
small_df = raw_df.groupby(by = ['index',column])[column].count().to_frame(name='count').reset_index()
small_df.drop_duplicates()
for index in small_df.index:
name = small_df.at[index,'index'] # index_1
raw_column = small_df.at[index, column] # 6943
count = small_df.at[index,'count'] # 1
df[raw_column][name] += count
ITMISS
凤凰求蛊
胡说叔叔
慕码人2483693
相关分类