使用 beautifulsoup 替换表格内容

使用 beautifulsoup 替换表格内容

我想使用 beautiful soup 解析一个 HTML 文档，其中也包含表格数据。我正在对此做一些 NLP。

表格单元格可能只有数字，也可能有大量文本。因此，在执行 soup.get_text() 之前，我希望根据以下条件更改表格数据的内容。

条件：如果单元格有两个以上的单词（我们可以认为一个数字是一个单词），则只保留它，否则将单元格内容更改为空字符串。

<code to change table data based on condition>

soup = BeautifulSoup(html)

text = soup.get_text()

这是我尝试过的。

tables = soup.find_all('table')

for table in tables:

table_body = table.find('tbody')

rows = table_body.find_all('tr')

for row in rows:

cols = row.find_all('td')

for ele in cols:

if len(ele.text.split(' ')<3):

ele.text = ''

但是，我们无法设置 ele.text，因此它会引发错误。

这是一个带有表格的简单 HTML 结构

<!DOCTYPE html>

<html>

<head>

<title>HTML Tables</title>

</head>

<body>

<table border = "1">

<tr>

<td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>

<td><p><span>not kept</span></p></td>

</tr>

<tr>

<td><p><span>Row 2, Column 1, should be kept</span></p></td>

<td><p><span>Row 2, Column 2, should be kept</span></p></td>

</tr>

</table>

</body>

</html>

繁花不似锦

浏览 105回答 1

1回答

慕丝7291255

一旦找到该元素，然后使用ele.string.replace_with("")基于您的示例 htmlhtml='''<html>   <head>      <title>HTML Tables</title>   </head>   <body>      <table border = "1">         <tr>            <td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>            <td><p><span>not kept</span></p></td>         </tr>         <tr>            <td><p><span>Row 2, Column 1, should be kept</span></p></td>            <td><p><span>Row 2, Column 2, should be kept</span></p></td>         </tr>      </table>   </body></html>'''soup=BeautifulSoup(html,'html.parser')tables = soup.find_all('table')for table in tables:    rows = table.find_all('tr')    for row in rows:        cols = row.find_all('td')        for ele in cols:            if len(ele.text.split(' '))<3:               ele.string.replace_with("")print(soup)输出：<html><head><title>HTML Tables</title></head><body><table border="1"><tr><td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td><td><p><span></span></p></td></tr><tr><td><p><span>Row 2, Column 1, should be kept</span></p></td><td><p><span>Row 2, Column 2, should be kept</span></p></td></tr></table></body></html>

0

0

随时随地看视频慕课网APP

相关分类

Html5