为什么当我尝试在此网页上抓取 PDF 的链接时,我只得到一个空列表作为回报?

我正在尝试在此网页上抓取指向 PDF 的链接。但是,我得到一个空列表作为回报。对此问题的任何帮助将不胜感激。


这是我使用的代码:


import requests

from bs4 import BeautifulSoup

import lxml

import csv

url="https://occ.ca/our-publications/"

source=requests.get(url).text

soup=BeautifulSoup(source,'lxml')

match=soup.find_all('div')

print(match)


慕斯王
浏览 122回答 3
3回答

慕桂英546537

以下import requestsfrom bs4 import BeautifulSoupresponse = source = requests.get('https://occ.ca/our-publications/', headers={'User-Agent': 'Mozilla'})if response.status_code == 200:    soup = BeautifulSoup(response.text, 'html')    pdfs = soup.findAll('div', {"class": "publicationoverlay"})    links = [pdf.find('a').attrs['href'] for pdf in pdfs]    print(links)输出['https://occ.ca/wp-content/uploads/The-Great-Mosaic-Reviving-Ontarios-Regional-Economies.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter-in-support-of-the-OPG-Pickering-Nuclear-Nomination.pdf', 'https://occ.ca/wp-content/uploads/OCC-Beverage-Alcohol-Report.pdf', 'https://occ.ca/wp-content/uploads/Industrial-Electricity-Rates.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter_Strategic-Approach-to-Alcohol-Sales.pdf', 'https://occ.ca/wp-content/uploads/OCC-Submission-Modernizing-Ontarios-Environmental-Assessment-Program.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter-on-Ticket-Sales-Act.pdf', 'https://occ.ca/wp-content/uploads/2018-2019-Policy-Report-Card.pdf', 'https://occ.ca/wp-content/uploads/Letter-on-Right-to-Repair-May-1.pdf', 'https://occ.ca/wp-content/uploads/Federal-Carbon-Tax-Transparency-Act-2019-OCC.pdf', 'https://occ.ca/wp-content/uploads/Waste-and-Litter-Submission-_-Final.pdf', 'https://occ.ca/wp-content/uploads/Supporting-Ontarios-Budding-Cannabis-Industry.pdf']

素胚勾勒不出你

该页面返回 403(禁止请求)和一些错误页面。如果您添加用户代理标头,它会返回 200(OK)以及您需要的页面:requests.get(url, headers={'User-Agent': 'Mozilla'})

慕丝7291255

那是因为在您的原始请求中,您收到了 403 禁止请求。默认情况下,Python 请求会添加如下标头:{ 'User-Agent': 'python-requests/2.21.0',  'Accept-Encoding': 'gzip, deflate',  'Accept': '*/*',  'Connection': 'keep-alive',  'Content-Length': '40',   'Content-Type': 'application/json' }某些网站会阻止此类标头。所以你得到一个 403 HTTP 错误。source=requests.get(url, headers={'User-Agent': 'Mozilla'})添加这将解决该问题,您将获得所需的内容。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python