使用 BeautifulSoup 提取重复标签中的特定文本

在这个例子中，我们可以使用 CSS 选择器。假设你使用的是 BeautifulSoup 4.7+，CSS 选择器支持是由Soupsieve库提供的。我们将首先使用:has()CSS 级别 4 选择器来查找具有直接子标签的标签，然后使用汤筛的非标准:contains选择器来确保标签包含Description:. 然后我们简单地打印所有符合此条件的元素的内容，去除前导和尾随空格并去除Description:. 请记住，有多种方法可以做到这一点，这就是我选择来说明方法：import bs4markup = """</div><div class="col-sm-6">    Book Title:    <A HREF="book_detail.cfm?ID=2449">The Holy Bible containing the Old and New Testaments, according to the authorised version. With illustrations by Gustave Doré</a>            Author: Doré, Gustave, 1832-1883                Image Title: Baptism of Jesus                Scripture Reference:<ul><li>John 1 (<a href='search.cfm?biblicalbook=John&biblicalbookchapter=1'>further images</a> / <a rel='shadowbox;height=500;width=600' href='http://www.commonenglishbible.com/explore/passage-lookup/?query=John+1'>scripture text</a>)</li></ul>                        Description: John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.                    <A HREF="book_list.cfm?ID=2449">Click here        </a> for additional images available from this book.        For information on licensing this image, please send an email, including a link to the image, to         <a href="mailto:dia@emory.edu?subject=Licensing%20Image%20From%20DIA - 17250">dia@emory.edu</a>    </div>"""soup = bs4.BeautifulSoup(markup, "html.parser")for el in soup.select('p:has(> b:contains("Description:"))'):    print(el.get_text().strip('').replace('Description: ', ''))输出：John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.

使用 BeautifulSoup 提取重复标签中的特定文本

3回答