Python 从 URL 中提取标题

Python 从 URL 中提取标题

我正在使用以下函数尝试从网络抓取的 url 列表中提取标题。

我确实看过一些 SO 答案，但注意到许多人建议避免使用正则表达式解决方案。我想修复并构建我现有的解决方案，但很高兴收到其他优雅解决方案的建议。

示例 url 1：https://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg/220px-Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg

示例 url 2： https: //upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Rembrandt_-_Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son_-_Google_Art_Project.jpg/220px-Rembrandt_-_Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Art_Son_Project.Google

试图从 url 中提取标题的代码（函数）。

def titleextract(url):

#return unquote(url[58:url.rindex("/",58)-8].replace('_',''))

cleanedtitle1=url[58:]

title= cleanedtitle1.strip("-_Google_Art_Project.jpg/220px-")

return title

以上对 URL 有以下影响：

网址 1：Rembrandt_- Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son - Google_Art_Project.jpg/220px-Rembrandt - Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son -_Google_Art_Project.jpg

网址 2：Rembrandt_van_Rijn_- Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist - Google_Art_Project.jpg/220px-Rembrandt_van_Rijn - Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist -_Google_Art_Project.jpg

然而，所需的输出是：

网址 1：伦勃朗_-_Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son

网址 2： Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh2C_the_Wife_of_the_Artist

我正在努力解决的是在此之后删除所有内容：_- Google_Art_Project.jpg/220px-Rembrandt - Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son -_Google_Art_Project.jpg 对于每个独特的案例，然后删除不需要的字符（如果它们存在），例如 url2 中的 %。

理想情况下，我还想去掉标题中的下划线。

任何使用我现有代码的建议以及适当的逐步解释都将不胜感激。

我删除开头的尝试奏效了：

cleanedtitle1=url[58:]

但是我已经尝试了各种方法来剥离字符并删除结尾，但没有奏效：

title= cleanedtitle1.strip("-_Google_Art_Project.jpg/220px-")

根据一个建议，我也尝试过：

return unquote(url[58:url.rindex("/",58)-8].replace('_',''))

..但这并没有正确地删除不需要的文本，只是最后 8 个字符，但是由于它是可变的，所以这是行不通的。

我也试过这个，再次删除下划线 - 没有运气。

cleanedtitle1=url[58:]

cleanedtitle2= cleanedtitle1.strip("-_Google_Art_Project.jpg/220px-")

title = cleanedtitle2.strip("_")

return title

慕尼黑的夜晚无繁华

浏览 250回答 3

3回答

阿晨1998

从你的开始：cleanedtitle1=url[58:]这可行，但它可能对硬编码数字不是很稳健，所以让我们从倒数第二个“/”之后的字符开始。您可以使用正则表达式来做到这一点，但更简单地说，这可能看起来像：pos1 = url.rindex("/")  # index of last /pos2 = url[:pos1].rindex("/")  # index of second-to-last /cleanedtitle1 = url[pos2 + 1:]虽然实际上，您只对倒数第二个和最后一个之间的位感兴趣/，所以让我们更改使用pos1我们发现的中间值：pos1 = url.rindex("/")  # index of last /pos2 = url[:pos1].rindex("/")  # index of second-to-last /cleanedtitle1 = url[pos2 + 1: pos1]在这里，这给出了以下值cleanedtitle1'Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg'现在到你的strip. 这不会完全符合您的要求：它会遍历您提供的字符串，给出该字符串中的各个字符，然后删除所有出现的每个字符。因此，让我们使用replace, 并将字符串替换为空字符串。title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")然后我们也可以做类似的事情：title = title.replace("_", " ")然后我们得到：'Rembrandt van Rijn - Self-Portrait'把它放在一起：pos1 = url.rindex("/")pos2 = url[:pos1].rindex("/")cleanedtitle1 = url[pos2 + 1: pos1]title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")title = title.replace("_", " ")return title更新我错过了一个事实，即 URL 可能包含%2C我们希望替换的序列。这些可以使用相同的方式完成replace，例如：url = url.replace("%2C", ",")但是您必须对所有可能出现的相似序列执行此操作，因此最好unquote使用urllib. 如果在代码的顶部放置：from urllib.parse import unquote那么你可以使用这些替换url = unquote(url)在其余处理之前：from urllib.parse import unquotedef titleextract(url):    url = unquote(url)    pos1 = url.rindex("/")    pos2 = url[:pos1].rindex("/")    cleanedtitle1 = url[pos2 + 1: pos1]    title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")    title = title.replace("_", " ")    return title

0

0

POPMUISE

这应该有效，让我知道任何问题def titleextract(url):    title = url[58:]    if "Google_Art_Project" in title:        x = title.index("-_Google_Art_Project.jpg")        title = title[:x] # Cut after where this is.    disallowed_chars = "%" # Edit which chars should go.    # Python will look at each character in turn. If it is not in the disallowed chars string,     # then it will be left. "".join() joins together all chars still allowed.     title = "".join(c for c in title if c not in disallowed_chars)    title = title.replace("_"," ") # Change underscores to spaces.    return title

0

0

四季花海

有几种方法可以做到这一点：如果您只想使用内置的 python 字符串函数，那么您可以首先根据拆分所有内容，/然后剥离所有 URL 的公共部分。def titleextract(url):    cleanedtitle1 = url.split("/")[-1]    return cleanedtitle1[6:-4].replace('_',' ')由于您已经在使用 bs4 导入，您可以通过以下方式完成：soup = BeautifulSoup(htmlString, 'html.parser')title = soup.title.text

0

0

随时随地看视频慕课网APP

相关分类

Python