Python解决方案,可将HTML表转换为可读的纯文本

我正在寻找一种将HTML表完全转换为可读的纯文本的方法。


即给出输入:


<table>

    <tr>

        <td>Height:</td>

        <td>200</td>

    </tr>

    <tr>

        <td>Width:</td>

        <td>440</td>

    </tr>

</table>

我期望输出:


Height: 200

Width: 440

我宁愿不使用外部工具,例如w3m -dump file.html,因为它们是(1)依赖于平台的,(2)我想对过程进行一些控制,并且(3)我认为它可以单独使用Python(带有或不带有额外模块)都是可行的。


我不需要任何自动换行或可调整的单元格分隔符宽度。使用制表符作为单元格分隔符就足够了。


慕仙森
浏览 473回答 3
3回答

翻阅古今

如何使用这个:解析HTML表到Python列表?但是,请使用collections.OrderedDict()而不是简单的字典来保留顺序。有了字典后,从字典中获取文本并设置其格式非常非常容易:使用@Colt 45的解决方案:import xml.etree.ElementTreeimport collectionss = """\<table>&nbsp; &nbsp; <tr>&nbsp; &nbsp; &nbsp; &nbsp; <th>Height</th>&nbsp; &nbsp; &nbsp; &nbsp; <th>Width</th>&nbsp; &nbsp; &nbsp; &nbsp; <th>Depth</th>&nbsp; &nbsp; </tr>&nbsp; &nbsp; <tr>&nbsp; &nbsp; &nbsp; &nbsp; <td>10</td>&nbsp; &nbsp; &nbsp; &nbsp; <td>12</td>&nbsp; &nbsp; &nbsp; &nbsp; <td>5</td>&nbsp; &nbsp; </tr>&nbsp; &nbsp; <tr>&nbsp; &nbsp; &nbsp; &nbsp; <td>0</td>&nbsp; &nbsp; &nbsp; &nbsp; <td>3</td>&nbsp; &nbsp; &nbsp; &nbsp; <td>678</td>&nbsp; &nbsp; </tr>&nbsp; &nbsp; <tr>&nbsp; &nbsp; &nbsp; &nbsp; <td>5</td>&nbsp; &nbsp; &nbsp; &nbsp; <td>3</td>&nbsp; &nbsp; &nbsp; &nbsp; <td>4</td>&nbsp; &nbsp; </tr></table>"""table = xml.etree.ElementTree.XML(s)rows = iter(table)headers = [col.text for col in next(rows)]for row in rows:&nbsp; &nbsp; values = [col.text for col in row]&nbsp; &nbsp; for key, value in collections.OrderedDict(zip(headers, values)).iteritems():&nbsp; &nbsp; &nbsp; &nbsp; print key, value输出:Height 10Width 12Depth 5Height 0Width 3Depth 678Height 5Width 3Depth 4
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python