使用 Python/Pandas 清除 Dataframe 中的错误标头

我有一个损坏的数据帧,其中数据帧内有随机标题重复。加载数据框时如何忽略或删除这些行?


由于这个随机头在数据框中,熊猫在加载时会引发错误。我想在用熊猫加载它时忽略这一行。或者在用熊猫加载它之前以某种方式删除它。


该文件如下所示:


col1, col2, col3

0, 1, 1

0, 0, 0

1, 1, 1

col1, col2, col3  <- this is the random copy of the header inside the dataframe

0, 1, 1

0, 0, 0

1, 1, 1

我想:


col1, col2, col3

0, 1, 1

0, 0, 0

1, 1, 1

0, 1, 1

0, 0, 0

1, 1, 1


慕丝7291255
浏览 471回答 2
2回答

白衣染霜花

投入na_filter = False以将您的列类型转换为字符串。然后找到所有包含错误数据的行,然后将它们过滤掉您的数据框。>>> df = pd.read_csv('sample.csv', header = 0, na_filter = False)>>> df&nbsp; &nbsp;col1&nbsp; col2&nbsp; col30&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;11&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;02&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;13&nbsp; col1&nbsp; col2&nbsp; col34&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;15&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;06&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1>>> type(df.iloc[0,0])<class 'str'>现在您已将每列中的数据解析为字符串,找到col1, col2, and col3df 中的所有值,如果您在每列中找到它们,则创建一个新列np.where(),如下所示:>>> df['Tag'] = np.where(((df['col1'] != '0') & (df['col1'] != '1')) & ((df['col2'] != '0') & (df['col2'] != '1')) & ((df['col3'] != '0') & (df['col3'] != '1')), ['Remove'], ['Don\'t remove'])>>> df&nbsp; &nbsp;col1&nbsp; col2&nbsp; col3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Tag0&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; Don't remove1&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;0&nbsp; Don't remove2&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; Don't remove3&nbsp; col1&nbsp; col2&nbsp; col3&nbsp; &nbsp; &nbsp; &nbsp; Remove4&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; Don't remove5&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;0&nbsp; Don't remove6&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; Don't remove现在,使用 过滤掉列中标记为Removed的那个。Tagisin()>>> df2 = df[~df['Tag'].isin(['Remove'])]>>> df2&nbsp; col1 col2 col3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Tag0&nbsp; &nbsp; 0&nbsp; &nbsp; 1&nbsp; &nbsp; 1&nbsp; Don't remove1&nbsp; &nbsp; 0&nbsp; &nbsp; 0&nbsp; &nbsp; 0&nbsp; Don't remove2&nbsp; &nbsp; 1&nbsp; &nbsp; 1&nbsp; &nbsp; 1&nbsp; Don't remove4&nbsp; &nbsp; 0&nbsp; &nbsp; 1&nbsp; &nbsp; 1&nbsp; Don't remove5&nbsp; &nbsp; 0&nbsp; &nbsp; 0&nbsp; &nbsp; 0&nbsp; Don't remove6&nbsp; &nbsp; 1&nbsp; &nbsp; 1&nbsp; &nbsp; 1&nbsp; Don't remove删除Tag列:>>> df2 = df2[['col1', 'col2', 'col3']]>>> df2&nbsp; col1 col2 col30&nbsp; &nbsp; 0&nbsp; &nbsp; 1&nbsp; &nbsp; 11&nbsp; &nbsp; 0&nbsp; &nbsp; 0&nbsp; &nbsp; 02&nbsp; &nbsp; 1&nbsp; &nbsp; 1&nbsp; &nbsp; 14&nbsp; &nbsp; 0&nbsp; &nbsp; 1&nbsp; &nbsp; 15&nbsp; &nbsp; 0&nbsp; &nbsp; 0&nbsp; &nbsp; 06&nbsp; &nbsp; 1&nbsp; &nbsp; 1&nbsp; &nbsp; 1最后将您的数据帧类型转换为 int,如果您需要它是整数:>>> df2 = df2.astype(int)>>> df2&nbsp; &nbsp;col1&nbsp; col2&nbsp; col30&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;11&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;02&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;14&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;15&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;06&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1>>> type(df2['col1'][0])<class 'numpy.int32'>注意:如果您想要标准索引,请使用:>>> df2.reset_index(inplace = True, drop = True)>>> df2&nbsp; &nbsp;col1&nbsp; col2&nbsp; col30&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;11&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;02&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;13&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;14&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;0&nbsp; &nbsp; &nbsp;05&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;1

BIG阳

您只需要执行以下操作:假设df_raw您的原始数据框具有列标题作为列名并在其他几行中重复,则您更正的数据框是df.# Filter out only the rows without the headers in them.headers = df_raw.columns.tolist()df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)假设:- 我们假设第一列标题的出现意味着必须删除该行。现在详细介绍一个详细的代码块,任何人都可以- 创建数据,- 将其写入 csv 文件,- 将其作为数据帧加载,然后- 删除作为标题的行。import numpy as npimport pandas as pd# make a csv file to load as dataframedata = '''col1, col2, col30, 1, 10, 0, 01, 1, 1col1, col2, col30, 1, 10, 0, 01, 1, 1'''# Write the data to a csv filewith open('data.csv', 'w') as f:&nbsp; &nbsp; f.write(data)# Load your data with header=Nonedf_raw = pd.read_csv('data.csv', header=None)# Declare which row to find the header data:&nbsp;#&nbsp; &nbsp; assuming the top one, we set this to zero.header_row_number = 0# Read in columns headersheaders = df_raw.iloc[header_row_number].tolist()# Set new column headersdf_raw.columns = headers# Filter out only the rows without the headers in them# We assume that the appearance of the&nbsp;# first column header means that row has to be dropped# And reset index (and drop the old index column)df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)df
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python