猿问

在空格格式的报表中分析多行标头

我正在尝试解析表中具有多行标题的文件:


                        Categ_1   Categ_2   Categ_3    Categ_4

data1 Group             Data      Data      Data       Data     (     %)  Options

--------------------------------------------------------------------------------

param_group1            6.366e-03 6.644e-03 6.943e-05    0.0131 (57.42%)  i

param_group2            1.251e-05 7.253e-06 4.256e-04 4.454e-04 ( 1.96%)  

param_group3            2.205e-05 6.421e-05 2.352e-03 2.438e-03 (10.70%)  

param_group4            1.579e-07    0.0000 1.479e-05 1.495e-05 ( 0.07%)  

param_group5            3.985e-03 2.270e-07 2.789e-03 6.775e-03 (29.74%)  

param_group6            0.0000    0.0000    0.0000    0.0000 ( 0.00%)  

param_group7            -8.121e-09

                                     0.0000 1.896e-08 1.084e-08 ( 0.00%)  


我过去曾成功地使用pyparsing来解析这样的表,但是标题在一行中,并且没有一个标题字段在它们中有多个空格(    %)


我是这样做的:


def mustMatchCols(startloc,endloc):

    return lambda s,l,t: startloc <= col(l,s) <= endloc+1


def tableValue(expr, colstart, colend):

    return Optional(expr.copy().addCondition(mustMatchCols(colstart,colend), message="text not in expected columns"))


if header:

    column_lengths = determine_header_column_widths(header_line)


# Then run the tableValue function for each start,end pair.

是否有任何内置的构造/示例用于在pyparsing或任何其他方法中解析此类空间格式的表?


波斯汪
浏览 86回答 1
1回答

达令说

如果您可以预先确定列宽,则下面是将多个列标题拼接在一起的代码:headers = """\&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Categ_1&nbsp; &nbsp;Categ_2&nbsp; &nbsp;Categ_3&nbsp; &nbsp; Categ_4data1 Group&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Data&nbsp; &nbsp; &nbsp; Data&nbsp; &nbsp; &nbsp; Data&nbsp; &nbsp; &nbsp; &nbsp;Data&nbsp; &nbsp; &nbsp;(&nbsp; &nbsp; &nbsp;%)&nbsp; Options"""col_widths = [24, 10, 10, 11, 9, 10, 10]# convert widths to slicescol_slices = []prev = 0for cw in col_widths:&nbsp; &nbsp; col_slices.append(slice(prev, prev + cw))&nbsp; &nbsp; prev += cw# verify slices# for line in headers.splitlines():#&nbsp; &nbsp; &nbsp;for slc in col_slices:#&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;print(line[slc])def extract_line_parts(slices, line_string):&nbsp; &nbsp; return [line_string[slc].strip() for slc in slices]# extract the different column header partsparts = [extract_line_parts(col_slices, line) for line in headers.splitlines()]for p in parts:&nbsp; &nbsp; print(p)# use zip(*parts) to transpose list of row parts into list of column partsheader_cols = list(zip(*parts))print(header_cols)for header in header_cols:&nbsp; &nbsp; print(' '.join(filter(None, header)))指纹:['', 'Categ_1', 'Categ_2', 'Categ_3', 'Categ_4', '', '']['data1 Group', 'Data', 'Data', 'Data', 'Data', '(&nbsp; &nbsp; &nbsp;%)', 'Options'][('', 'data1 Group'), ('Categ_1', 'Data'), ('Categ_2', 'Data'), ('Categ_3', 'Data'), ('Categ_4', 'Data'), ('', '(&nbsp; &nbsp; &nbsp;%)'), ('', 'Options')]data1 GroupCateg_1 DataCateg_2 DataCateg_3 DataCateg_4 Data(&nbsp; &nbsp; &nbsp;%)Options
随时随地看视频慕课网APP

相关分类

Python
我要回答