读取换行的分隔文件

如果对此已经有了明显的答案，我深表歉意。

我有一个非常大的文件，对解析提出了一些挑战。我从我的组织外部收到这些文件，因此我无法更改它们的格式。

首先，文件以空格分隔，但表示数据“列”的字段可以跨越多行。例如，如果您有一行应该是 25 列数据，它可能会在文件中写为：

1 2 3 4 5 6 7 8 9 10 11 12 13 14

15 16 17 18 19 20 21

22 23 24 25

1 2 3 4 5 6 7 8 9 10 11 12 13

14 15 16 17 18

19 20 21 22 23 24 25

如您所见，我不能依赖每组数据都在同一行上，但我可以依赖每组数据的列数相同。

更糟糕的是，该文件遵循一个定义：数据类型格式，其中前 3 行左右将描述数据（包括一个告诉我有多少行的字段），接下来的 N 行是数据。然后它会再次回到 3 行格式来描述下一组数据。这意味着我不能只为 N 列格式设置一个阅读器并让它运行到 EOF。

我担心内置的 python 文件读取功能会变得非常难看，但我在 csv 或 numpy 中找不到任何有效的东西。

有什么建议么？

编辑：就像不同解决方案的一个例子：

我们在 MATLAB 中有一个旧工具，它在打开的文件句柄上使用 textscan 解析这个文件。我们知道列数，因此我们执行以下操作：

data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);

这将读取数据，无论它如何包装，同时保持文件句柄打开以稍后处理下一部分。这样做是因为文件太大，可能会导致 RAM 使用量过多。

慕村225694

浏览 180回答 1

1回答

慕尼黑5688855

这是您如何进行的草图：（编辑：有一些修改）file = open("testfile.txt", "r") # store data for the different sections heredatasections = list()while True:    current_row = []    # read three lines    l1 = file.readline()    if line == '': # or other end condition        break    l2 =  file.readline()    l3 =  file.readline()    # extract the following information from l1, l2, l3    nrows = # extract the number rows in the next section    ncols = # extract the number of columns in the next section    # loop while len(current_row) < nrows * ncols:        # read next line, isolate the items using str.split()        # append items to current_row    # break current_row into the lines after each ncols-th item    # store data in datasections in a new array

随时随地看视频慕课网APP