使用 BeautifulSoup 从 <a href 标签中提取特定页面链接

我正在使用 BeautifulSoup 从此页面中提取所有链接：http : //kern.humdrum.org/search?s=t&keyword=Haydn

我通过这种方式获得所有这些链接：

# -*- coding: utf-8 -*-

from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup

my_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'

#opening up connecting, grabbing the page

uClient = uReq(my_url)

# put all the content in a variable

page_html = uClient.read()

#close the internet connection

uClient.close()

#It does my HTML parser

page_soup = soup(page_html, "html.parser")

# Grab all of the links

containers = page_soup.findAll('a', href=True)

#print(type(containers))

for container in containers:

link = container

#start_index = link.index('href="')

print(link)

print("---")

#print(start_index)

我的部分输出是：

请注意，它返回了几个链接，但我真的想要所有带有 >Someting 的链接。（例如，“> Allegro”和“Allegro vivace”等等）。

换句话说，在这一点上，我有一堆锚标签（+- 1000）。从所有这些标签中，有一堆只是“垃圾”和 +- 350 个我想提取的标签。所有这些标签看起来几乎一样，但唯一的区别是我需要的标签末尾有一个“>某人的名字<\a>”。我只想提取具有此特征的所有锚标记的链接。

尚方宝剑之说

浏览 840回答 3