猿问

如何使用 pandas 从文件中提取 html 表?

我是 pandas 新手,我正在尝试从一些 HTML 文件中提取一些数据。


如何转换多个 HTML 表,如下所示:


       PS4

Game Name | Price

GoW       | 49.99

FF VII R  | 59.99


       XBX

Game Name | Price

Gears 5   | 49.99

Forza 5   | 59.99

<table>

  <tr colspan="2">

    <td>PS4</td>

  </tr>

  <tr>

    <td>Game Name</td>

    <td>Price</td>

  </tr>

  <tr>

    <td>GoW</td>

    <td>49.99</td>

  </tr>

  <tr>

    <td>FF VII R</td>

    <td>59.99</td>

  </tr>

</table>


<table>

  <tr colspan="2">

    <td>XBX</td>

  </tr>

  <tr>

    <td>Game Name</td>

    <td>Price</td>

  </tr>

  <tr>

    <td>Gears 5</td>

    <td>49.99</td>

  </tr>

  <tr>

    <td>Forza 5</td>

    <td>59.99</td>

  </tr>

</table>

像这样的 json 对象:


[

  { "Game Name": "Gow", "Price": "49.99", "platform": "PS4"},

  { "Game Name": "FF VII R", "Price": "59.99", "platform": "PS4"},

  { "Game Name": "Gears 5", "Price": "49.99", "platform": "XBX"},

  { "Game Name": "Forza 5", "Price": "59.99", "platform": "XBX"}

]


我尝试使用 pandas.read_html(path/to/file) 加载包含表的 html 文件,它确实返回了 DataFrame 列表,但我不知道之后如何提取数据,特别是平台名称位于标题而不是作为单独的列。


我使用 pandas 是因为我从包含其他形式的表格和 HTML 代码的本地 htm 文件中提取这些表格,所以我使用:


tables = pandas.read_html(file_path, match="Game Name")

使用基于该列名称的匹配参数快速隔离我需要的表。


守候你守候我
浏览 82回答 1
1回答

红颜莎娜

import pandas as pd# list to save all dataframe from all tables in all filesdf_list = list()# list of files to loadlist_of_files = ['test.html']# iterate through your filesfor file in list_of_files:&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # create a list of dataframes from the tables in the file&nbsp; &nbsp; dfl = pd.read_html(file, match='Game Name')&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # fix the headers and columns&nbsp; &nbsp; for d in dfl:&nbsp; &nbsp; &nbsp; &nbsp; # select row 1 as the headers&nbsp; &nbsp; &nbsp; &nbsp; d.columns = d.iloc[1]&nbsp; &nbsp; &nbsp; &nbsp; # select row 0, column 0 as the platform&nbsp; &nbsp; &nbsp; &nbsp; d['platform'] = d.iloc[0, 0]&nbsp; &nbsp; &nbsp; &nbsp; # selection row 2 and below as the data, row 0 and 1 were the headers&nbsp; &nbsp; &nbsp; &nbsp; d = d.iloc[2:]&nbsp; &nbsp; &nbsp; &nbsp; # append the cleaned dataframe to df_list&nbsp; &nbsp; &nbsp; &nbsp; df_list.append(d.copy())&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# create a single dataframedf = pd.concat(df_list).reset_index(drop=True)# create a list of dicts from dfrecords = df.to_dict('records')print(records)[out]:[{'Game Name': 'GoW', 'Price': '49.99', 'platform': 'PS4'},&nbsp;{'Game Name': 'FF VII R', 'Price': '59.99', 'platform': 'PS4'},&nbsp;{'Game Name': 'Gears 5', 'Price': '49.99', 'platform': 'XBX'},&nbsp;{'Game Name': 'Forza 5', 'Price': '59.99', 'platform': 'XBX'}]
随时随地看视频慕课网APP

相关分类

Python
我要回答