猿问

选择一个数据框的子集,每个变量都有 N 年的数据

我有一个数据集,显示了 100 多个国家的年度增长指标,从 1970 年到 2013 年。并非所有国家都有所有年份的数据,年份最少的国家拥有 30 年的数据。我想把事情弄平,让所有国家向我展示 30 年的数据,从超过 30 年的国家中删除年份。我在下面提供了一个例子。


我想过使用循环从数据框中删除数据,直到所有国家/地区都出现 30 次,然后构建一个全新的数据框,但我相信有更好的解决方案。


import pandas as pd


data = {'Country':['Israel','Congo','Denmark',

                   'Israel','Denmark',

                   'Israel','Congo',

                   'Israel','Congo','Denmark'],

        'Year':[2000,2000,2000,

                2001,2001,

                2002,2002,

                2003,2003,2003],

        'Value':[2.5,1.2,3.1,2.8,1.1,2.9,3.1,1.9,3.0,3.1]}

df = pd.DataFrame(data=data)

df

   Country  Year  Value

0   Israel  2000    2.5

1    Congo  2000    1.2

2  Denmark  2000    3.1

3   Israel  2001    2.8

4  Denmark  2001    1.1

5   Israel  2002    2.9

6    Congo  2002    3.1

7   Israel  2003    1.9

8    Congo  2003    3.0

9  Denmark  2003    3.1

上面的代码使用仅使用 3 个国家和 4 年的示例创建了一个数据框。从数据框中,您可以看到以色列有 4 年的数据,而丹麦和刚果只有三年。我想从以色列删除一年,以便所有国家都有 3 年。在实际数据框中,我想从超过 30 年的国家中删除年份,以便所有国家/地区都具有相同的年份,最好删除价值最小的年份。


这是我使用 for 循环的解决方案,它使用了很多代码行:


gp = df.groupby('Country').groups #Group by country name

d = {} #Build dictionary Country Name => index list.


for i in gp: #Iterate over all countries until a list of 3 indeces is 

#reached for each country.

    d[i] = []

    for j in gp[i]:

        if len(d[i])<3: #A country appears once every year in the dataset,

#3 means 3 years. If a country appears more than 3 times, it will only 

#include the indices of the first 3 occurrences. 

            d[i].append(j)

indeces = [] #Gather the indeces to keep in the dataframe.

for i in d:

    for j in d[i]:

        if len(d[i])==3: #make sure the list has exactly 3 items

            indeces.append(j)




素胚勾勒不出你
浏览 191回答 2
2回答

慕工程0101907

您可以从 year 列中的唯一值创建最近几年的列表,并使用布尔索引来使用该列表索引数据框。recent_years = df.Year.unique()[-3:]df[df.Year.isin(recent_years)]&nbsp; &nbsp; Country Year&nbsp; &nbsp; Value3&nbsp; &nbsp;Israel&nbsp; 2001&nbsp; &nbsp; 2.84&nbsp; &nbsp;Denmark 2001&nbsp; &nbsp; 1.15&nbsp; &nbsp;Israel&nbsp; 2002&nbsp; &nbsp; 2.96&nbsp; &nbsp;Congo&nbsp; &nbsp;2002&nbsp; &nbsp; 3.17&nbsp; &nbsp;Israel&nbsp; 2003&nbsp; &nbsp; 1.98&nbsp; &nbsp;Congo&nbsp; &nbsp;2003&nbsp; &nbsp; 3.09&nbsp; &nbsp;Denmark 2003&nbsp; &nbsp; 3.1如果您的 Year 值不一定按顺序排列,请使用 numpy unique 返回排序数组,这与 pandas unique() 不同recent_years = np.unique(df.Year)[-3:]df[df.Year.isin(recent_years)]这是另一个解决方案,它为每个国家/地区返回 3 个最近的年份。如果数据没有按年份排序,则需要先排序。idx = df.groupby('Country').apply(lambda x: x['Year'].tail(3)).indexdf.set_index(['Country', df.index]).reindex(idx).reset_index().drop('level_1', 1)&nbsp; &nbsp; Country Year&nbsp; &nbsp; Value0&nbsp; &nbsp;Congo&nbsp; &nbsp;2000&nbsp; &nbsp; 1.21&nbsp; &nbsp;Congo&nbsp; &nbsp;2002&nbsp; &nbsp; 3.12&nbsp; &nbsp;Congo&nbsp; &nbsp;2003&nbsp; &nbsp; 3.03&nbsp; &nbsp;Denmark 2000&nbsp; &nbsp; 3.14&nbsp; &nbsp;Denmark 2001&nbsp; &nbsp; 1.15&nbsp; &nbsp;Denmark 2003&nbsp; &nbsp; 3.16&nbsp; &nbsp;Israel&nbsp; 2001&nbsp; &nbsp; 2.87&nbsp; &nbsp;Israel&nbsp; 2002&nbsp; &nbsp; 2.98&nbsp; &nbsp;Israel&nbsp; 2003&nbsp; &nbsp; 1.9如果数据没有排序,首先使用排序df = df.sort_values(by = 'Year')

慕村9548890

这是我使用 Pandas 的解决方案。即使它使用了很多行代码,它也完成了它必须做的事情。感谢@Vaishali 的帮助:threshold = 3 #Anything that occurs less than this will be removed,&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #if it ocurrs more, the extra ocurrences with the least values&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #will be removed.newIndex = df.set_index('Country')#set new index to make selection by&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #index posible.values = newIndex.index.value_counts() #Count occurrences of index values.to_keep = values[values>=threshold].index.values&nbsp;#Keep index values that ocurr >= threshold.rank_df = newIndex.loc[to_keep,['Value','Year']]#Select rows and&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #columns to keep.#Sort values in descending order before meeting threshold.rank_df = rank_df.sort_values('Value',ascending=False)rank_df = rank_df.groupby(rank_df.index).head(threshold)#group again&nbsp;#Since values are sorted, head() will show highest valuesrank_df = rank_df.groupby([rank_df.index,'Year']).mean() \&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .sort_values('Value',ascending=False)#Finally, reset index to convert Year index into a column, and sort by yearrank_df.reset_index(level=1).sort_values('Year')输出:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Year&nbsp; &nbsp; ValueCountry&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Denmark&nbsp; &nbsp; &nbsp;2000&nbsp; &nbsp; 3.1Israel&nbsp; &nbsp; &nbsp; 2000&nbsp; &nbsp; 2.5Congo&nbsp; &nbsp; &nbsp; &nbsp;2000&nbsp; &nbsp; 1.2Israel&nbsp; &nbsp; &nbsp; 2001&nbsp; &nbsp; 2.8Denmark&nbsp; &nbsp; &nbsp;2001&nbsp; &nbsp; 1.1Congo&nbsp; &nbsp; &nbsp; &nbsp;2002&nbsp; &nbsp; 3.1Israel&nbsp; &nbsp; &nbsp; 2002&nbsp; &nbsp; 2.9Denmark&nbsp; &nbsp; &nbsp;2003&nbsp; &nbsp; 3.1Congo&nbsp; &nbsp; &nbsp; &nbsp;2003&nbsp; &nbsp; 3.0
随时随地看视频慕课网APP

相关分类

Python
我要回答