根据行开头的时间戳过滤文本文件

3回答

MMTTMM

使用熊猫主要区别在于，pandas 已将所有数据转换为正确的dtype,（例如datetime, int, 和float），并且代码更简洁。此外，数据现在采用了一种有用的格式来执行时间序列分析和绘图，但我建议添加列名称。df.columns = ['datetime', ..., 'price']这可以通过 1 行矢量化操作来完成。如timeit测试所示，对于 1M 行数据，使用 pandas 比使用读取文件with open和str查找:00.读取文件并pandas.read_csv解析第 0 列中的日期。使用header=None，因为测试数据中没有提供标题使用布尔索引选择秒为 0 的日期使用.dt访问器获取.second.import pandas as pd# read the file which apparently has no header and parse the date columndf = pd.read_csv('test.csv', header=None, parse_dates=[0])# using Boolean indexing to select data when seconds = 00top_of_the_minute = df[df[0].dt.second == 0]# save the datatop_of_the_minute.to_csv('clean.csv', header=False, index=False)# display(top_of_the_minute) 0 1 2 3 4 5 6 7 85 2020-08-03 22:17:00 0 0 4803 4800 91 28.05 24.05 58.89176 2020-08-03 22:17:00 0 0 4802 4800 91 28.05 24.05 58.89257 2020-08-03 22:17:00 0 0 4805 4800 91 28.05 24.05 58.93418 2020-08-03 22:17:00 0 0 4802 4800 91 28.05 24.05 58.96839 2020-08-03 22:17:00 0 0 4802 4800 91 28.05 23.05 58.9780# example: rename columnstop_of_the_minute.columns = ['datetime', 'v1', 'v2', 'v3', 'v4', 'v5', 'p1', 'p2', 'p3']# example: plot the datap = top_of_the_minute.plot('datetime', 'p3')p.legend(bbox_to_anchor=(1.05, 1), loc='upper left')p.set_xlim('2020-08', '2020-09')test.csv2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.89172020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.89252020-08-03 22:17:14,0,0,4805,4800,91,28.05,24.05,58.93412020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.96832020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.9782020-08-03 22:17:00,0,0,4803,4800,91,28.05,24.05,58.89172020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.89252020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.93412020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.96832020-08-03 22:17:00,0,0,4802,4800,91,28.05,23.05,58.978%%timeit测试创建测试数据# read test.csvdf = pd.read_csv('test.csv', header=None, parse_dates=[0])# create a dataframe with 1M rows df = pd.concat([df] * 100000)# save the new test datadf.to_csv('test.csv', index=False, header=False)test_skdef test_sk(path: str): zero_entries = [] with open(path, "r") as file: for line in file: semi_index = line.index(',') if line[:semi_index].endswith(':00'): zero_entries.append(line) return zero_entries%%timeitresult_sk = test_sk('test.csv')[out]:668 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)test_tmdef test_tm(path: str): df = pd.read_csv(path, header=None, parse_dates=[0]) return df[df[0].dt.second == 0]%%timeitresult_tm = test_tm('test.csv')[out]:774 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0 0

慕桂英4014372

试试这个finalmasterlist2 = []for i in range(len(altmasterlist)):    if ":00" in altmasterlist[i][0]:        finalmasterlist2.extend(altmasterlist[i])print("finalemasterlist_2")print(finalmasterlist2)输入：2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917 2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925  2020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.9341  2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683  2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978   输出：['2020-08-03 22:17:00', '0', '0', '4805', '4800', '91', '28.05', '24.05', '58.9341']

0 0

长风秋雁

你说你的文件很大？也许最好在阅读时拆分数据。您可以在没有库的情况下这样做：zero_entries = []with open(path_to_file, "r") as file:    # iterates over every line     for line in file:        # finds the end if the first cell        timestamp_end = line.index(',')        # checks if the timestamp ends on zero seconds and adds it to a list.        if line[:timestamp_end].endswith(':00'):            zero_entries.append(line)print(zero_entries)我假设您的时间戳将始终是该行的第一个元素。根据您的文件大小，这将比 Trenton 的解决方案快得多（我用 ~58k 行对其进行了测试）：import timeimport pandas as pdpath = r"txt.csv"start = time.time()zero_entries = []with open(path, "r") as file:    for line in file:        semi_index = line.index(',')        if line[:semi_index].endswith(':00'):            zero_entries.append(line)end = time.time()print(end-start)start = time.time()df = pd.read_csv(path, header=None, parse_dates=[0])# using Boolean indexing to select data when seconds = 00top_of_the_minute = df[df[0].dt.second == 0]end = time.time()print(end-start)0.04886937141418457 # built-in0.27971720695495605 # pandas

0 0