从带有 403 错误 python 的网站抓取链接

首页课程实战体系课手记专栏慕课教程

我试图从链接列表中抓取链接（全部指向同一网站上的不同页面），但我一直在运行 403 错误。这是我试图抓取的链接示例

https://www.spectatornews.com/page/6/?s=band

https://www.spectatornews.com/page/7/?s=band

等等。

这是我的代码：

getarticles = []

from bs4 import BeautifulSoup

import urllib.request

for i in listoflinks:

resp = urllib.request.urlopen(i)

soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):

getarticles.append(link['href'])

我一直在尝试在 Python 3 Web Scraping 中使用HTTP 错误 403 中的一些答案，但我没有取得太大的成功。我不确定我是否正确地将它们应用于我的整个链接列表。我试图通过使用标头来使用以下解决方案之一，但返回 HTTP 406 错误：不可接受

这是我试图修复的代码：

getarticles = []

from bs4 import BeautifulSoup

import urllib.request

for i in listoflinks:

req=urllib.request.Request(i, headers={'User-Agent': 'Mozilla/5.0'})

resp = urllib.request.urlopen(req)

soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):

getarticles.append(link['href'])

任何帮助是极大的赞赏。我对此很陌生，因此您可以尽可能多地解释和帮助。我只想从我的网站列表中收集链接！

ibeautiful

浏览 238回答 2

随时随地看视频慕课网APP