使用 beautifulsoup 替换表格内容

我想使用 beautiful soup 解析一个 HTML 文档,其中也包含表格数据。我正在对此做一些 NLP。


表格单元格可能只有数字,也可能有大量文本。因此,在执行 soup.get_text() 之前,我希望根据以下条件更改表格数据的内容。


条件:如果单元格有两个以上的单词(我们可以认为一个数字是一个单词),则只保留它,否则将单元格内容更改为空字符串。


<code to change table data based on condition>


soup = BeautifulSoup(html)

text = soup.get_text()

这是我尝试过的。


    tables = soup.find_all('table')

    for table in tables:

        table_body = table.find('tbody')

        rows = table_body.find_all('tr')

        for row in rows:

            cols = row.find_all('td')

            for ele in cols:

                if len(ele.text.split(' ')<3):

                    ele.text = ''

但是,我们无法设置 ele.text,因此它会引发错误。


这是一个带有表格的简单 HTML 结构


<!DOCTYPE html>

<html>


   <head>

      <title>HTML Tables</title>

   </head>


   <body>

      <table border = "1">

         <tr>

            <td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>

            <td><p><span>not kept</span></p></td>

         </tr>


         <tr>

            <td><p><span>Row 2, Column 1, should be kept</span></p></td>

            <td><p><span>Row 2, Column 2, should be kept</span></p></td>

         </tr>

      </table>


   </body>

</html>


繁花不似锦
浏览 77回答 1
1回答

慕丝7291255

一旦找到该元素,然后使用ele.string.replace_with("")基于您的示例 htmlhtml='''<html>&nbsp; &nbsp;<head>&nbsp; &nbsp; &nbsp; <title>HTML Tables</title>&nbsp; &nbsp;</head>&nbsp; &nbsp;<body>&nbsp; &nbsp; &nbsp; <table border = "1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<tr>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <td><p><span>not kept</span></p></td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;</tr>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<tr>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <td><p><span>Row 2, Column 1, should be kept</span></p></td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <td><p><span>Row 2, Column 2, should be kept</span></p></td>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;</tr>&nbsp; &nbsp; &nbsp; </table>&nbsp; &nbsp;</body></html>'''soup=BeautifulSoup(html,'html.parser')tables = soup.find_all('table')for table in tables:&nbsp; &nbsp; rows = table.find_all('tr')&nbsp; &nbsp; for row in rows:&nbsp; &nbsp; &nbsp; &nbsp; cols = row.find_all('td')&nbsp; &nbsp; &nbsp; &nbsp; for ele in cols:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if len(ele.text.split(' '))<3:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ele.string.replace_with("")print(soup)输出:<html><head><title>HTML Tables</title></head><body><table border="1"><tr><td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td><td><p><span></span></p></td></tr><tr><td><p><span>Row 2, Column 1, should be kept</span></p></td><td><p><span>Row 2, Column 2, should be kept</span></p></td></tr></table></body></html>
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Html5