猿问

如何将多个csv连接到xarray并定义坐标?

我有多个 csv 文件,具有相同的行和列,它们包含的数据因日期而异。每个 csv 文件都附属于不同的日期,在其名称中列出,例如data.2018-06-01.csv. 我的数据的一个最小示例如下所示:我有 2 个文件data.2018-06-01.csv和data.2019-06-01.csv,它们分别包含


user_id, weight, status

001, 70, healthy

002, 90, healthy 


user_id, weight, status

001, 72, healthy

002, 103, obese

我的问题:如何将 csv 文件连接到 xarray 并定义 xarray 的坐标是user_id和date?


我尝试了以下代码


df_all = [] 

date_arr = []


for f in [`data.2018-06-01.csv`, `data.2019-06-01.csv`]:

  date = f.split('.')[1]

  df = pd.read_csv(f)

  df_all.append(df)

  date_arr.append(date)


x_arr = xr.concat([df.to_xarray() for df in df_all], coords=[date_arr, 'user_id'])

但coords=[...]会导致错误。我能做什么?谢谢


慕桂英546537
浏览 134回答 2
2回答

慕的地8271018

NumPy回想一下,尽管它在原始类数组之上引入了维度、坐标和属性形式的标签,但它的xarray灵感来自pandas. 因此,要回答这个问题,您可以按照以下步骤进行。from glob import globimport numpy as npimport pandas as pd# Get the list of all the csv files in data pathcsv_flist = glob(data_path + "/*.csv")&nbsp;df_list = []for _file in csv_flist:&nbsp; &nbsp; # get the file name from the data path&nbsp; &nbsp; file_name = _file.split("/")[-1]&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # extract the date from a file name, e.g. "data.2018-06-01.csv"&nbsp; &nbsp; date = file_name.split(".")[1]&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # read the read the data in _file&nbsp; &nbsp; df = pd.read_csv(_file)&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # add a column date knowing that all the data in df are recorded at the same date&nbsp; &nbsp; df["date"] = np.repeat(date, df.shape[0])&nbsp; &nbsp; df["date"] = df.date.astype("datetime64[ns]") # reset date column to a correct date format&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # append df to df_list&nbsp; &nbsp; df_list.append(df)让我们检查一下例如第df一个df_listprint(df_list[0])&nbsp; &nbsp; status&nbsp; user_id&nbsp; weight&nbsp; &nbsp; &nbsp; &nbsp;date0&nbsp; healthy&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; 72 2019-06-011&nbsp; &nbsp; obese&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp;103 2019-06-01连接所有的dfsaxis=0df_all = pd.concat(df_list, ignore_index=True).sort_index()print(df_all)&nbsp; &nbsp; status&nbsp; user_id&nbsp; weight&nbsp; &nbsp; &nbsp; &nbsp;date0&nbsp; healthy&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; 72 2019-06-011&nbsp; &nbsp; obese&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp;103 2019-06-012&nbsp; healthy&nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; 70 2018-06-013&nbsp; healthy&nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; 90 2018-06-01使用 和将 的索引设置df_all为两个级别的levels[0] = "date"多索引levels[1]="user_id"。data = df_all.set_index(["date", "user_id"]).sort_index()print(data)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;status&nbsp; weightdate&nbsp; &nbsp; &nbsp; &nbsp;user_id&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2018-06-01 1&nbsp; &nbsp; &nbsp; &nbsp; healthy&nbsp; &nbsp; &nbsp; 70&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; healthy&nbsp; &nbsp; &nbsp; 902019-06-01 1&nbsp; &nbsp; &nbsp; &nbsp; healthy&nbsp; &nbsp; &nbsp; 72&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; obese&nbsp; &nbsp; &nbsp;103随后,您可以将结果pandas.DataFrame转换为xarray.Datasetusing .to_xarray(),如下所示。xds = data.to_xarray()print(xds)<xarray.Dataset>Dimensions:&nbsp; (date: 2, user_id: 2)Coordinates:&nbsp; * date&nbsp; &nbsp; &nbsp;(date) datetime64[ns] 2018-06-01 2019-06-01&nbsp; * user_id&nbsp; (user_id) int64 1 2Data variables:&nbsp; &nbsp; status&nbsp; &nbsp;(date, user_id) object 'healthy' 'healthy' 'healthy' 'obese'&nbsp; &nbsp; weight&nbsp; &nbsp;(date, user_id) int64 70 90 72 103这将完全回答这个问题。

阿晨1998

试试这些:&nbsp; &nbsp; import glob&nbsp; &nbsp; import pandas as pd&nbsp; &nbsp; path=(r'ur file')&nbsp; &nbsp; all_file = glob.glob(path + "/*.csv")&nbsp; &nbsp; li = []&nbsp; &nbsp; for filename in all_file:&nbsp; &nbsp; df = pd.read_csv(filename, index_col=None, header=0)&nbsp; &nbsp; li.append(df)&nbsp; &nbsp; frame = pd.concat(li, axis=0, ignore_index=True)
随时随地看视频慕课网APP

相关分类

Python
我要回答