猿问

根据行开头的时间戳过滤文本文件

我有这个巨大的文本文件,我想在分钟的顶部获取具有关联数据的行。这是来自该文本文件的几行。这是超过 36 小时的数据片段。我所说的关联是指时间戳后面的 8 个数据点。


2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917

2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925

2020-08-03 22:17:14,0,0,4805,4800,91,28.05,24.05,58.9341

2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683

2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978

...

我找不到一种方法让 python 查看时间戳的秒部分,然后创建一个仅包含与“:00”秒相关联的数据的新列表。


for line in fh:

    line = line.rstrip("\n")

    line = line.split(",")

    masterlist.extend(line) #this is putting the information into one list

    altmasterlist.append(line) #this is putting the lines of information into a list


for line in altmasterlist:

    if ":00" in line:

        finalmasterlist.extend(line) #Nothing is entering this if statement


print(finalmasterlist)

我什至在这两个 for 循环的正确区域吗?


千巷猫影
浏览 146回答 3
3回答

MMTTMM

使用熊猫主要区别在于,pandas 已将所有数据转换为正确的dtype,(例如datetime, int, 和float),并且代码更简洁。此外,数据现在采用了一种有用的格式来执行时间序列分析和绘图,但我建议添加列名称。df.columns = ['datetime', ..., 'price']这可以通过 1 行矢量化操作来完成。如timeit测试所示,对于 1M 行数据,使用 pandas 比使用 读取文件with open和str查找:00.读取文件并pandas.read_csv解析第 0 列中的日期。使用header=None,因为测试数据中没有提供标题使用布尔索引选择秒为 0 的日期使用.dt访问器获取.second.import pandas as pd# read the file which apparently has no header and parse the date columndf = pd.read_csv('test.csv', header=None, parse_dates=[0])# using Boolean indexing to select data when seconds = 00top_of_the_minute = df[df[0].dt.second == 0]# save the datatop_of_the_minute.to_csv('clean.csv', header=False, index=False)# display(top_of_the_minute)                    0  1  2     3     4   5      6      7        85 2020-08-03 22:17:00  0  0  4803  4800  91  28.05  24.05  58.89176 2020-08-03 22:17:00  0  0  4802  4800  91  28.05  24.05  58.89257 2020-08-03 22:17:00  0  0  4805  4800  91  28.05  24.05  58.93418 2020-08-03 22:17:00  0  0  4802  4800  91  28.05  24.05  58.96839 2020-08-03 22:17:00  0  0  4802  4800  91  28.05  23.05  58.9780# example: rename columnstop_of_the_minute.columns = ['datetime', 'v1', 'v2', 'v3', 'v4', 'v5', 'p1', 'p2', 'p3']# example: plot the datap = top_of_the_minute.plot('datetime', 'p3')p.legend(bbox_to_anchor=(1.05, 1), loc='upper left')p.set_xlim('2020-08', '2020-09')test.csv2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.89172020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.89252020-08-03 22:17:14,0,0,4805,4800,91,28.05,24.05,58.93412020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.96832020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.9782020-08-03 22:17:00,0,0,4803,4800,91,28.05,24.05,58.89172020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.89252020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.93412020-08-03 22:17:00,0,0,4802,4800,91,28.05,24.05,58.96832020-08-03 22:17:00,0,0,4802,4800,91,28.05,23.05,58.978%%timeit测试创建测试数据# read test.csvdf = pd.read_csv('test.csv', header=None, parse_dates=[0])# create a dataframe with 1M rows df = pd.concat([df] * 100000)# save the new test datadf.to_csv('test.csv', index=False, header=False)test_skdef test_sk(path: str):    zero_entries = []    with open(path, "r") as file:        for line in file:            semi_index = line.index(',')            if line[:semi_index].endswith(':00'):                zero_entries.append(line)    return zero_entries%%timeitresult_sk = test_sk('test.csv')[out]:668 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)test_tmdef test_tm(path: str):    df = pd.read_csv(path, header=None, parse_dates=[0])    return df[df[0].dt.second == 0]%%timeitresult_tm = test_tm('test.csv')[out]:774 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

慕桂英4014372

试试这个finalmasterlist2 = []for i in range(len(altmasterlist)):    if ":00" in altmasterlist[i][0]:        finalmasterlist2.extend(altmasterlist[i])print("finalemasterlist_2")print(finalmasterlist2)输入:2020-08-03 22:17:12,0,0,4803,4800,91,28.05,24.05,58.8917 2020-08-03 22:17:13,0,0,4802,4800,91,28.05,24.05,58.8925  2020-08-03 22:17:00,0,0,4805,4800,91,28.05,24.05,58.9341  2020-08-03 22:17:15,0,0,4802,4800,91,28.05,24.05,58.9683  2020-08-03 22:17:18,0,0,4802,4800,91,28.05,23.05,58.978   输出:['2020-08-03 22:17:00', '0', '0', '4805', '4800', '91', '28.05', '24.05', '58.9341']

长风秋雁

你说你的文件很大?也许最好在阅读时拆分数据。您可以在没有库的情况下这样做:zero_entries = []with open(path_to_file, "r") as file:    # iterates over every line     for line in file:        # finds the end if the first cell        timestamp_end = line.index(',')        # checks if the timestamp ends on zero seconds and adds it to a list.        if line[:timestamp_end].endswith(':00'):            zero_entries.append(line)print(zero_entries)我假设您的时间戳将始终是该行的第一个元素。根据您的文件大小,这将比 Trenton 的解决方案快得多(我用 ~58k 行对其进行了测试):import timeimport pandas as pdpath = r"txt.csv"start = time.time()zero_entries = []with open(path, "r") as file:    for line in file:        semi_index = line.index(',')        if line[:semi_index].endswith(':00'):            zero_entries.append(line)end = time.time()print(end-start)start = time.time()df = pd.read_csv(path, header=None, parse_dates=[0])# using Boolean indexing to select data when seconds = 00top_of_the_minute = df[df[0].dt.second == 0]end = time.time()print(end-start)0.04886937141418457 # built-in0.27971720695495605 # pandas
随时随地看视频慕课网APP

相关分类

Python
我要回答