如何在Python中解析和提取HTML文档中的特定元素？

Python中有很多XML和HTML解析器，我正在寻找一种简单的方法来提取HTML文档的一部分，最好使用XPATH构造，但这只是可选的。

这是一个例子

src = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"

我想用id = content提取元素的整个主体，所以结果应该是： <div id=content>AAA<B>BBB</B>CCC</div>

如果不安装新库就可以做到这一点。

我也希望获得所需元素的原始内容（未重新格式化）。

不允许使用regexp，因为这对于解析XML / HTML是不安全的。

慕仙森

浏览 265回答 2

2回答

绝地无双

要使用库进行解析-最好的方法是BeautifulSoup，以下是它对您的工作方式的一小段内容！from BeautifulSoup import BeautifulSoupsrc = "<html><body>...<div id=content>AAA<B>BBB</B>CCC</div>...</body></html>"soupy = BeautifulSoup( src )content_divs = soupy.findAll( attrs={'id':'content'} )if len(content_divs) > 0:    # print the first one    print str(content_divs[0])    # to print the text contents    print content_divs[0].text    # or to print all the raw html    for each in content_divs:        print each

0 0

德玛西亚99

是的，我已经做到了。这样做可能不是最好的方法，但是它的工作原理类似于下面的代码。我没有测试import rematch = re.finditer("<div id=content>",src)src = src[match.start():]#at this point the string start with your div everything proceeding it has been stripped.#This next part works because the first div in the string is the end of your div section.match = re.finditer("</div>",src)src = src[:match.end()]src现在在字符串中仅包含div您的after。如果在某些情况下您想要的内容还有另一个，您只需要为您的重新查找部分建立一个更高级的搜索模式即可。

0 0

随时随地看视频慕课网APP