猿问

比较DataFrames / csv并仅返回具有差异的列,包括Key值

我有两个CSV文件,我正在比较并仅并排返回具有不同值的列。


df1

Country 1980    1981    1982    1983    1984

Bermuda 0.00793 0.00687 0.00727 0.00971 0.00752

Canada  9.6947  9.58952 9.20637 9.18989 9.78546

Greenland   0.00791 0.00746 0.00722 0.00505 0.00799

Mexico  3.72819 4.11969 4.33477 4.06414 4.18464


df2

Country 1980    1981    1982    1983    1984

Bermuda 0.77777 0.00687 0.00727 0.00971 0.00752

Canada  9.6947  9.58952 9.20637 9.18989 9.78546

Greenland   0.00791 0.00746 0.00722 0.00505 0.00799

Mexico  3.72819 4.11969 4.33477 4.06414 4.18464


import pandas as pd

import numpy as np



df1=pd.read_csv('csv1.csv')

df2=pd.read_csv('csv2.csv')




def diff_pd(df1, df2):

    """Identify differences between two pandas DataFrames"""

    assert (df1.columns == df2.columns).all(), \

        "DataFrame column names are different"

    if any(df1.dtypes != df2.dtypes):

        "Data Types are different, trying to convert"

        df2 = df2.astype(df1.dtypes)

    if df1.equals(df2):

        print("Dataframes are the same")

        return None

    else:

        # need to account for np.nan != np.nan returning True

        diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())

        ne_stacked = diff_mask.stack()

        changed = ne_stacked[ne_stacked]

        changed.index.names = ['Country', 'Column']

        difference_locations = np.where(diff_mask)

        changed_from = df1.values[difference_locations][0]

        changed_to = df2.values[difference_locations]

        y=pd.DataFrame({'From': changed_from, 'To': changed_to},

                            index=changed.index)

        print(y)

        return pd.DataFrame({'From': changed_from, 'To': changed_to},

                            index=changed.index)




diff_pd(df1,df2)

我当前的输出是:


                   From       To

Country Column                  

0       1980    0.00793  0.77777

因此,我想获得索引值不匹配的行的国家/地区名称,而不是索引0。下面是一个例子。


我希望我的输出是:


                   From       To

Country Column                  

Bermuda  1980    0.00793  0.77777

谢谢所有能提供解决方案的人。


千万里不及你
浏览 154回答 1
1回答

函数式编程

一种更短的方法,在此过程中会重命名:def process_df(df):    res = df.set_index('Country').stack()    res.index.rename('Column', level=1, inplace=True)    return resdf1 = process_df(df1)df2 = process_df(df2)mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())df3 = pd.concat([df1[mask], df2[mask]], axis=1).rename({0:'From', 1:'To'}, axis=1)df3                   From       ToCountry Column                  Bermuda 1980    0.00793  0.77777
随时随地看视频慕课网APP

相关分类

Python
我要回答